Extraction of Meaningful Information from the Web: a Brief Survey

Santosh V. Chobe; Dr. Shirish S. Sane

doi:10.14419/ijet.v7i4.19.28283

Authors

Santosh V. Chobe
Dr. Shirish S. Sane

Received date: March 10, 2019

Accepted date: March 10, 2019

Published date: November 27, 2018

DOI:

https://doi.org/10.14419/ijet.v7i4.19.28283

Keywords:

Information Extraction, Web Mining, Wrapper Generation, Wrapper Induction.

Abstract

There is an explosive growth of information on Internet that makes extraction of relevant data from various sources, a difficult task for its users. Therefore, to transform the Web pages into databases, Information Extraction (IE) systems are needed. Relevant information in Web documents can be extracted using information extraction and presented in a structured format.
By applying information extraction techniques, information can be extracted from structured, semi-structured, and unstructured data. This paper presents some of the major information extraction tools. Here, advantages and limitations of the tools are discussed from a userâ€™s perspective.
Â
Â

References

[1] Kushmerick, N., Weld, D., and Doorenbos, R., â€œWrapper Induction for Information Extraction,â€ Proceedings of the Fifteenth International Conference on Artificial Intelligence (IJCAI), 1997, pp. 729-735.
[2] Doorenbos, Robert B., Oren Etzioni, and Daniel S. Weld, â€œA Scalable Comparison-Shopping Agent for the World-Wide Web,â€ Proceedings Of The First International Conference On Autonomous Agents, ACM, 1997, pp. 39-48.
[3] Hsu, C.-N. and Dung, M., â€œGenerating Finite-State Transducers For Semi-Structured Data Extraction From The Web,â€ Journal of Information Systems, vol. 23, no. 8, 1998, pp. 521-538.
[4] Adelberg, B., â€œNoDoSE - A Tool For Semi-Automatically Extracting Structured And Semi-Structured Data from Text Document,â€ SIGMOD Record, vol. 27, no. 2, 1998, pp. 283-294.
[5] Califf, M. and Mooney, R., â€œRelational Learning Of Pattern-Match Rules For Information Extraction,â€ Proceedings of AAAI Spring Symposium on Applying Machine Learning to Discourse Processing Stanford, California, March, 1998.
[6] Freitag, D., â€œInformation Extraction From HTML: Application Of A General Learning Approach,â€ Proceedings of the Fifteenth Conference on Artificial Intelligence (AAAI-98).
[7] Soderland, S., â€œLearning Information Extraction Rules For Semi-Structured And Free Text,â€ Journal of Machine Learning, vol. 34, no. 1-3, 1999, pp. 233-272.
[8] Muslea, I., Minton, S., and Knoblock, C., â€œA Hierarchical Approach to Wrapper Induction,â€ Proceedings of the Third International Conference on Autonomous Agents (AA-99), ACM, 1999, pp. 190-197.
[9] Embley, David W., Douglas M. Campbell, Yuan S. Jiang, Stephen W. Liddle, Deryle W. Lonsdale, Y-K. Ng, and Randy D. Smith., â€œConceptual-Model-Based Data Extraction from Multiple-Record Web Pages,â€ Data & Knowledge Engineering 31, no. 3, 1999, pp. 227-251.
[10] Ribeiro-Neto, B., A., Laender, A., H., F. and DA Silva, A., S., â€œExtracting Semi-Structured Data Through Examples,â€ Proceedings of the Eighth ACM International Conference on Information and Knowledge Management (CIKM), Kansas City, Missouri, 1999, pp. 94-101.
[11] Eikvil, Line. "Information Extraction from World Wide Web - A Survey," 1999.
[12] Liu, L., Pu, C., and Han, W., â€œXWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources,â€ Proceedings of the 16^th IEEE International Conference on Data Engineering (ICDE), San Diego, California, 2000, pp. 611-621.
[13] Kosala, Raymond, and Hendrik Blockeel., â€œWeb Mining Research: A Survey,â€ ACM SIGKDD Explorations Newsletter, vol. 2, no. 1, 2000, pp. 1-15.
[14] Freitag, Dayne, â€œMachine Learning For Information Extraction In Informal Domains,â€ Machine Learning, vol. 39, no. 2-3, 2000, pp. 169-202.
[15] Crescenzi, V., Mecca, G. and Merialdo, P., â€œRoadRunner: Towards Automatic Data Extraction from Large Web Sites,â€ Proceedings of the 26^th International Conference on Very Large Database Systems (VLDB), Rome, Italy, 2001, pp. 109-118.
[16] Sahuguet, Arnaud, and Fabien Azavant, â€œBuilding Intelligent Web Applications Using Lightweight Wrappers,â€ Data & Knowledge Engineering, vol. 36, no. 3, 2001, pp. 283-316.
[17] Laender, A. H. F., Ribeiro-Neto, B. and DA Silva, A., S., â€œDEByE -Data Extraction by Example,â€ Data and Knowledge Engineering, vol. 40, no. 2, 2002, pp. 121-154.
[18] Laender, Alberto HF, Berthier A. Ribeiro-Neto, Altigran S. Da Silva, and Juliana S. Teixeira, â€œA Brief Survey Of Web Data Extraction Tools,â€ ACM SIGMOD Record, vol. 31, no. 2, 2002, pp. 84-93.
[19] Flesca, S., Manco, G., Masciari, E., Rende, E., & Tagarelli, A., â€œWeb Wrapper Induction: A Brief Survey,â€ AI Communications, vol. 17, no. 2, 2004, pp. 57-61.
[20] Chang, C. H., Kayed, M., Girgis, M. R., & Shaalan, K. F., â€œA Survey Of Web Information Extraction Systems,â€ IEEE Transactions On Knowledge And Data Engineering, vol. 18, no. 10, 2006, pp. 1411-1428.
[21] Liu, Bing., â€œWeb Data Mining: Exploring Hyperlinks, Contents, And Usage Data,â€ Springer Science & Business Media, 2007.
[22] W. Su, J. Wang, F. H. Lochovsky, and Yi Liu, â€œCombining Tag and Value Similarity for Data Extraction and Alignment,â€ IEEE Transactions Knowledge and Data Engineering, vol. 24, no. 7, July, 2012, pp.1186-1200.
[23] WORLD WIDE WEB CONSORTIUM. W3C. The Document Object Model. https://www.w3.org/DOM/.

Extraction of Meaningful Information from the Web: a Brief Survey

Authors

Santosh V. Chobe

Dr. Shirish S. Sane

How to Cite

DOI:

Keywords:

Abstract

References

Downloads

How to Cite