article

Automatic information extraction from large websites

Authors:

Valter Crescenzi,

Giansalvatore MeccaAuthors Info & Claims

Journal of the ACM (JACM), Volume 51, Issue 5

Pages 731 - 779

https://doi.org/10.1145/1017460.1017462

Published: 01 September 2004 Publication History

Abstract

Information extraction from websites is nowadays a relevant problem, usually performed by software modules called wrappers. A key requirement is that the wrapper generation process should be automated to the largest extent, in order to allow for large-scale extraction tasks even in presence of changes in the underlying sites. So far, however, only semi-automatic proposals have appeared in the literature.We present a novel approach to information extraction from websites, which reconciles recent proposals for supervised wrapper induction with the more traditional field of grammar inference. Grammar inference provides a promising theoretical framework for the study of unsupervised---that is, fully automatic---wrapper generation algorithms. However, due to some unrealistic assumptions on the input, these algorithms are not practically applicable to Web information extraction tasks.The main contributions of the article stand in the definition of a class of regular languages, called the prefix mark-up languages, that abstract the structures usually found in HTML pages, and in the definition of a polynomial-time unsupervised learning algorithm for this class. The article shows that, differently from other known classes, prefix mark-up languages and the associated algorithm can be practically used for information extraction purposes.A system based on the techniques described in the article has been implemented in a working prototype. We present some experimental results on known Websites, and discuss opportunities and limitations of the proposed approach.

References

[1]

Adelberg, B. 1998. NoDoSE---A tool for semi-automatically extracting structured and semistructured data from text documents. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'98) (Seattle, Wash.). ACM, New York.]]

[2]

Angluin, D. 1980. Inductive inference of formal languages from positive data. Inf. Cont. 45, 117--135.]]

[3]

Angluin, D. 1982. Inference of reversible languages. J. ACM 29, 3, 741--765.]]

[4]

Arlotta, L., Crescenzi, V., Mecca, G., and Merialdo, P. 2003. Automatic annotation of data extracted from large Web sites. In Proceedings of the 6th Workshop on the Web and Databases (WebDB'03) (in conjunction with SIGMOD'03). ACM, New York, 7--12.]]

[5]

Ashish, N., and Knoblock, C. 1997. Wrapper generation for semistructured Internet sources. In Proceedings of the Workshop on the Management of Semistructured Data (in conjunction with ACM SIGMOD 1997). ACM, New York.]]

[6]

Atzeni, P., and Mecca, G. 1997. Cut and paste. In Proceedings of the 16th ACM SIGMOD International Symposium on Principles of Database Systems (PODS'97) (Tucson, AZ). ACM, New York, 144--153.]]

[7]

Baumgartner, R., Flesca, S., and Gottlob, G. 2001. Visual web information extraction with lixto. In Proceedings of the International Conference on Very Large Data Bases (VLDB'2001) (Roma, Italy, Sept. 11--14). 119--128.]]

[8]

Bruggemann-Klein, A., and Wood, D. 1998. One-unambiguous regular languages. Info. Comput. 142, 2 (May), 182--206.]]

[9]

Chidlovskii, B. 2000. Wrapper generation by k-reversible grammar induction. In Proceedings of the International Workshop on Machine Learning and Information Extraction (ECAI'00). 61--72.]]

[10]

Crescenzi, V. 2002. On automatic information extraction from large web sites. Ph.D. dissertation, Dipartimento di Informatica e Sistemistica, Università di Roma "La Sapienza", Rome (Italy).]]

[11]

Crescenzi, V., and Mecca, G. 1998. Grammars have exceptions. Info. Syst. 23, 8, 539--565. (Special Issue on Semistructured Data.)]]

[12]

Crescenzi, V., Mecca, G., and Merialdo, P. 2001. Roadrunner: Towards automatic data extraction from large Web sites. In Proceedings of the International Conference on Very Large Data Bases (VLDB'2001) (Rome, Italy, Sept. 11--14). 109--119.]]

[13]

Crescenzi, V., Mecca, G., and Merialdo, P. 2002. Roadrunner: Automatic data extraction from data-intensive web sites. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'2002) (Madison, Wisco.). ACM. New York.]]

[14]

Embley, D. W., Campbell, M. D., Jiang, Y. S., Liddle, S. W., Ng, Y. K., Quass, D., and Smith, R. D. 1999. Conceptual-model-based data extraction from multiple-record web pages. Data Knowl. Eng. 31, 3, 227--251.]]

[15]

Fernau, H. 2000a. Learning xml grammars. In Proceedings of the 2nd Machine Learning and Data Mining in Pattern Recognition MLDM'01. Lecture Notes in Computer Science and Lecture Notes in Artificial Intelligence, vol. 2123. Springer-Verlag, New York, 73--87.]]

[16]

Fernau, H. 2000b. On learning function distinguishable languages. Tech. Rep. WSI-2000-13, Wilhem-Schickard-Institut für Informatik.]]

[17]

Fernau, H. 2003. Identification of function distinguishable languages. Theoret. Comput. Sci. 290, 1679--1711.]]

[18]

Freitag, D. 1998. Information extraction from html: Application of a general learning approach. In Proceedings of the 15th Conference on Artificial Intelligence (AAAI-98). 517--523.]]

[19]

Gold, E. M. 1967. Language identification in the limit. Inf. Cont. 10, 5, 447--474.]]

[20]

Grumbach, S., and Mecca, G. 1999. In search of the lost schema. In Proceedings of the 7th International Conference on Data Base Theory (ICDT'99) (Jerusalem, Israel). Lecture Notes in Computer Science, Springer-Verlag, New York, 314--331.]]

[21]

Gupta, A., Harinarayan, V., and Rajaraman, A. 1998. Virtual database technology. In Proceedings of the 14th International Conference on Data Engineering (Orlando, Fla., Feb. 23--27). IEEE Computer Society, Los Alamitos, Calif., 297--301.]]

[22]

Hammer, J., Garcia-Molina, H., Cho, J., Aranha, R., and Crespo, A. 1997. Extracting semistructured information from the Web. In Proceedings of the Workshop on the Management of Semistructured Data (in conjunction with ACM SIGMOD 1997). ACM, New York.]]

[23]

Hong, T. W., and Clark, K. L. 2001. Using grammatical inference to automate information extraction from the Web. In Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery (PKDD 2001). 216--227.]]

[24]

Hsu, C., and Dung, M. 1998. Generating finite-state transducers for semistructured data extraction from the web. Info. Syst. 23, 8, 521--538.]]

[25]

Huck, G., Frankhauser, P., Aberer, K., and Neuhold, E. J. 1998. Jedi: Extracting and synthesizing information from the web. In Proceedings of the 3rd International Conference on Cooperative Information Systems (CoopIS'98). 32--43.]]

[26]

Hull, R. 1988. A survey of theoretical research on typed complex database objects. In Databases, J. Paredaens, Ed. Academic Press, Orlando, Fla. 193--256.]]

[27]

Kosala, R., Van den Bussche, J., Bruynooghe, M., and Blockeel, H. 2002. Information extraction in structured documents using tree automata induction. In Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery (PKDD 2002). 299--310.]]

[28]

Kushmerick, N. 2000a. Wrapper induction: Efficiency and expressiveness. Artif. Intel. 118, 15--68.]]

[29]

Kushmerick, N. 2000b. Wrapper verification. WWW J. 3, 2, 79--94.]]

[30]

Kushmerick, N., Weld, D. S., and Doorenbos, R. 1997. Wrapper induction for information extraction. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI'97).]]

[31]

Lerman, K., and Minton, S. 2000. Learning the common structure of data. In Proceedings of the 17th National Conference on Artificial Intelligence (AAAI-00).]]

[32]

Lerman, K., Minton S. N., and Knoblock, C. A. 2003. Wrapper maintenance: A machine learning approach. J. Artif. Intel. Res. 18, 149--181.]]

[33]

Liu, L., Pu, C., and Han, W. 2000. Xwrap: An xml-enabled wrapper construction system for web information sources. In Proceedings of the 16th IEEE International Conference on Data Engineering (ICDE'00) (San Diego, Calif.), IEEE Computer Society Press, Los Alamitos, Calif. 611--621.]]

[34]

Muslea, I., Minton, S., and Knoblock, C. A. 1999. A hierarchical approach to wrapper induction. In Proceedings of the 3rd Annual Conference on Autonomous Agents. 190--197.]]

[35]

Muslea, I., Minton, S., and Knoblock, C. 2001. Hierarchical wrapper induction for semistructured sources. J. Autonom. Agents Multi-Agent Syst. 4, 93--114.]]

[36]

Papadimitriou, C. H. 1994. Computational Complexity. Addison-Wesley, Reading, Mass.]]

[37]

Pitt, L. 1989. Inductive inference, DFAs and computational complexity. In Analogical and Inductive Inference. Lecture Notes in Artificial Intelligence, vol. 397, K. P. Jantke, Ed. Springer-Verlag, Berlin, 18--44.]]

[38]

Radhakrishnan, V., and Nagaraja, G. 1987. Inference of regular grammars via skeletons. IEEE Trans. Syst., Man and Cybernet. 17, 6, 982--992.]]

[39]

Ribeiro-Neto, B. A., Laender, A. H. F., and Soares da Silva, A. 1999. Extracting semistructured data through examples. In Proceedings of the 1999 ACM International Conference on Information and Knowledge Management (CIKM'99). ACM, New York. 94--101.]]

[40]

Sahuguet, A., and Azavant, F. 1999. Web ecology: Recycling HTML pages as XML documents using W4F. In Proceedings of the 2nd Workshop on the Web and Databases (WebDB'99) (in conjunction with SIGMOD'99). ACM, New York.]]

[41]

Soderland, S. 1999. Learning information extraction rules for semistructured and free text. Mach. Learn. 34, 1--3, 233--272.]]

Cited By

Wang QFang YRavula AFeng FQuan XLiu D(2022)WebFormer: The Web-page Transformer for Structure Information ExtractionProceedings of the ACM Web Conference 202210.1145/3485447.3512032(3124-3133)Online publication date: 25-Apr-2022
https://dl.acm.org/doi/10.1145/3485447.3512032
Toaha MAsad SRahman CHaque SProma MShuvo MAhmed TBasher M(2022)Automatic signboard detection and localization in densely populated developing citiesImage Communication10.1016/j.image.2022.116857109:COnline publication date: 1-Nov-2022
https://dl.acm.org/doi/10.1016/j.image.2022.116857
Cetorelli VAtzeni PCrescenzi VMilicchio F(2021)The smallest extraction problemProceedings of the VLDB Endowment10.14778/3476249.347629314:11(2445-2458)Online publication date: 27-Oct-2021
https://dl.acm.org/doi/10.14778/3476249.3476293
Show More Cited By

Recommendations

Automatic Extraction of Semantically-Meaningful Information from the Web.
AH '02: Proceedings of the Second International Conference on Adaptive Hypermedia and Adaptive Web-Based Systems

The semantic Web will bring meaning to the Internet, making it possible for web agents to understand the information it contains. However, current trends seem to suggest that the semantic web is not likely to be adopted in the forthcoming years. In this ...
Business information extraction from semi-structured webpages

To protect online consumers, as OECD Guidelines recommend, Internet shopping malls should provide information about their business on their webpages. In Korea, The Consumer Protection Law in Electronic Commerce, forced Internet shopping malls to provide ...
Joint unsupervised structure discovery and information extraction
SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data

In this paper we present JUDIE (Joint Unsupervised Structure Discovery and Information Extraction), a new method for automatically extracting semi-structured data records in the form of continuous text (e.g., bibliographic citations, postal addresses, ...

Comments

Information & Contributors

Information

Published In

cover image Journal of the ACM

Journal of the ACM Volume 51, Issue 5

September 2004

151 pages

ISSN:0004-5411

EISSN:1557-735X

DOI:10.1145/1017460

Issue’s Table of Contents

Copyright © 2004 ACM.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 2004

Published in JACM Volume 51, Issue 5

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

116
Total Citations
View Citations
4,637
Total Downloads

Downloads (Last 12 months)25
Downloads (Last 6 weeks)4

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang QFang YRavula AFeng FQuan XLiu D(2022)WebFormer: The Web-page Transformer for Structure Information ExtractionProceedings of the ACM Web Conference 202210.1145/3485447.3512032(3124-3133)Online publication date: 25-Apr-2022
https://dl.acm.org/doi/10.1145/3485447.3512032
Toaha MAsad SRahman CHaque SProma MShuvo MAhmed TBasher M(2022)Automatic signboard detection and localization in densely populated developing citiesImage Communication10.1016/j.image.2022.116857109:COnline publication date: 1-Nov-2022
https://dl.acm.org/doi/10.1016/j.image.2022.116857
Cetorelli VAtzeni PCrescenzi VMilicchio F(2021)The smallest extraction problemProceedings of the VLDB Endowment10.14778/3476249.347629314:11(2445-2458)Online publication date: 27-Oct-2021
https://dl.acm.org/doi/10.14778/3476249.3476293
Yuliana OChittayasothorn S(2021)Transformation from Web Pages to Optimal Normal Form Database Schema Using a Conceptual Schema Approach2021 7th International Conference on Engineering, Applied Sciences and Technology (ICEAST)10.1109/ICEAST52143.2021.9426306(112-116)Online publication date: 1-Apr-2021
https://doi.org/10.1109/ICEAST52143.2021.9426306
Yuliana OChang C(2020)DCADE: divide and conquer alignment with dynamic encoding for full page data extractionApplied Intelligence10.1007/s10489-019-01499-050:2(271-295)Online publication date: 1-Feb-2020
https://dl.acm.org/doi/10.1007/s10489-019-01499-0
Zhang YLi CChen NLiu SDu LWang ZMa M(2019)Semantic-Based Geospatial Data Integration With Unique FeaturesGeospatial Intelligence10.4018/978-1-5225-8054-6.ch012(254-277)Online publication date: 2019
https://doi.org/10.4018/978-1-5225-8054-6.ch012
Zhang YLi CChen NLiu SDu LWang ZMa M(2019)Semantic Web and Geospatial Unique Features Based Geospatial Data IntegrationGeospatial Intelligence10.4018/978-1-5225-8054-6.ch011(230-253)Online publication date: 2019
https://doi.org/10.4018/978-1-5225-8054-6.ch011
Guo JCrescenzi VFurche TGrasso GGottlob G(2019)RED: Redundancy-Driven Data Extraction from Result Pages?The World Wide Web Conference10.1145/3308558.3313529(605-615)Online publication date: 13-May-2019
https://dl.acm.org/doi/10.1145/3308558.3313529
Zhang YLi CChen NLiu SDu LWang ZMa M(2018)Semantic-Based Geospatial Data Integration With Unique FeaturesInnovations, Developments, and Applications of Semantic Web and Information Systems10.4018/978-1-5225-5042-6.ch015(393-416)Online publication date: 2018
https://doi.org/10.4018/978-1-5225-5042-6.ch015
Yuliana OChang C(2018)A novel alignment algorithm for effective web data extraction from singleton-item pagesApplied Intelligence10.1007/s10489-018-1208-048:11(4355-4370)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1007/s10489-018-1208-0
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents