skip to main content
article

Automatic information extraction from large websites

Published: 01 September 2004 Publication History

Abstract

Information extraction from websites is nowadays a relevant problem, usually performed by software modules called wrappers. A key requirement is that the wrapper generation process should be automated to the largest extent, in order to allow for large-scale extraction tasks even in presence of changes in the underlying sites. So far, however, only semi-automatic proposals have appeared in the literature.We present a novel approach to information extraction from websites, which reconciles recent proposals for supervised wrapper induction with the more traditional field of grammar inference. Grammar inference provides a promising theoretical framework for the study of unsupervised---that is, fully automatic---wrapper generation algorithms. However, due to some unrealistic assumptions on the input, these algorithms are not practically applicable to Web information extraction tasks.The main contributions of the article stand in the definition of a class of regular languages, called the prefix mark-up languages, that abstract the structures usually found in HTML pages, and in the definition of a polynomial-time unsupervised learning algorithm for this class. The article shows that, differently from other known classes, prefix mark-up languages and the associated algorithm can be practically used for information extraction purposes.A system based on the techniques described in the article has been implemented in a working prototype. We present some experimental results on known Websites, and discuss opportunities and limitations of the proposed approach.

References

[1]
Adelberg, B. 1998. NoDoSE---A tool for semi-automatically extracting structured and semistructured data from text documents. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'98) (Seattle, Wash.). ACM, New York.]]
[2]
Angluin, D. 1980. Inductive inference of formal languages from positive data. Inf. Cont. 45, 117--135.]]
[3]
Angluin, D. 1982. Inference of reversible languages. J. ACM 29, 3, 741--765.]]
[4]
Arlotta, L., Crescenzi, V., Mecca, G., and Merialdo, P. 2003. Automatic annotation of data extracted from large Web sites. In Proceedings of the 6th Workshop on the Web and Databases (WebDB'03) (in conjunction with SIGMOD'03). ACM, New York, 7--12.]]
[5]
Ashish, N., and Knoblock, C. 1997. Wrapper generation for semistructured Internet sources. In Proceedings of the Workshop on the Management of Semistructured Data (in conjunction with ACM SIGMOD 1997). ACM, New York.]]
[6]
Atzeni, P., and Mecca, G. 1997. Cut and paste. In Proceedings of the 16th ACM SIGMOD International Symposium on Principles of Database Systems (PODS'97) (Tucson, AZ). ACM, New York, 144--153.]]
[7]
Baumgartner, R., Flesca, S., and Gottlob, G. 2001. Visual web information extraction with lixto. In Proceedings of the International Conference on Very Large Data Bases (VLDB'2001) (Roma, Italy, Sept. 11--14). 119--128.]]
[8]
Bruggemann-Klein, A., and Wood, D. 1998. One-unambiguous regular languages. Info. Comput. 142, 2 (May), 182--206.]]
[9]
Chidlovskii, B. 2000. Wrapper generation by k-reversible grammar induction. In Proceedings of the International Workshop on Machine Learning and Information Extraction (ECAI'00). 61--72.]]
[10]
Crescenzi, V. 2002. On automatic information extraction from large web sites. Ph.D. dissertation, Dipartimento di Informatica e Sistemistica, Università di Roma "La Sapienza", Rome (Italy).]]
[11]
Crescenzi, V., and Mecca, G. 1998. Grammars have exceptions. Info. Syst. 23, 8, 539--565. (Special Issue on Semistructured Data.)]]
[12]
Crescenzi, V., Mecca, G., and Merialdo, P. 2001. Roadrunner: Towards automatic data extraction from large Web sites. In Proceedings of the International Conference on Very Large Data Bases (VLDB'2001) (Rome, Italy, Sept. 11--14). 109--119.]]
[13]
Crescenzi, V., Mecca, G., and Merialdo, P. 2002. Roadrunner: Automatic data extraction from data-intensive web sites. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'2002) (Madison, Wisco.). ACM. New York.]]
[14]
Embley, D. W., Campbell, M. D., Jiang, Y. S., Liddle, S. W., Ng, Y. K., Quass, D., and Smith, R. D. 1999. Conceptual-model-based data extraction from multiple-record web pages. Data Knowl. Eng. 31, 3, 227--251.]]
[15]
Fernau, H. 2000a. Learning xml grammars. In Proceedings of the 2nd Machine Learning and Data Mining in Pattern Recognition MLDM'01. Lecture Notes in Computer Science and Lecture Notes in Artificial Intelligence, vol. 2123. Springer-Verlag, New York, 73--87.]]
[16]
Fernau, H. 2000b. On learning function distinguishable languages. Tech. Rep. WSI-2000-13, Wilhem-Schickard-Institut für Informatik.]]
[17]
Fernau, H. 2003. Identification of function distinguishable languages. Theoret. Comput. Sci. 290, 1679--1711.]]
[18]
Freitag, D. 1998. Information extraction from html: Application of a general learning approach. In Proceedings of the 15th Conference on Artificial Intelligence (AAAI-98). 517--523.]]
[19]
Gold, E. M. 1967. Language identification in the limit. Inf. Cont. 10, 5, 447--474.]]
[20]
Grumbach, S., and Mecca, G. 1999. In search of the lost schema. In Proceedings of the 7th International Conference on Data Base Theory (ICDT'99) (Jerusalem, Israel). Lecture Notes in Computer Science, Springer-Verlag, New York, 314--331.]]
[21]
Gupta, A., Harinarayan, V., and Rajaraman, A. 1998. Virtual database technology. In Proceedings of the 14th International Conference on Data Engineering (Orlando, Fla., Feb. 23--27). IEEE Computer Society, Los Alamitos, Calif., 297--301.]]
[22]
Hammer, J., Garcia-Molina, H., Cho, J., Aranha, R., and Crespo, A. 1997. Extracting semistructured information from the Web. In Proceedings of the Workshop on the Management of Semistructured Data (in conjunction with ACM SIGMOD 1997). ACM, New York.]]
[23]
Hong, T. W., and Clark, K. L. 2001. Using grammatical inference to automate information extraction from the Web. In Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery (PKDD 2001). 216--227.]]
[24]
Hsu, C., and Dung, M. 1998. Generating finite-state transducers for semistructured data extraction from the web. Info. Syst. 23, 8, 521--538.]]
[25]
Huck, G., Frankhauser, P., Aberer, K., and Neuhold, E. J. 1998. Jedi: Extracting and synthesizing information from the web. In Proceedings of the 3rd International Conference on Cooperative Information Systems (CoopIS'98). 32--43.]]
[26]
Hull, R. 1988. A survey of theoretical research on typed complex database objects. In Databases, J. Paredaens, Ed. Academic Press, Orlando, Fla. 193--256.]]
[27]
Kosala, R., Van den Bussche, J., Bruynooghe, M., and Blockeel, H. 2002. Information extraction in structured documents using tree automata induction. In Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery (PKDD 2002). 299--310.]]
[28]
Kushmerick, N. 2000a. Wrapper induction: Efficiency and expressiveness. Artif. Intel. 118, 15--68.]]
[29]
Kushmerick, N. 2000b. Wrapper verification. WWW J. 3, 2, 79--94.]]
[30]
Kushmerick, N., Weld, D. S., and Doorenbos, R. 1997. Wrapper induction for information extraction. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI'97).]]
[31]
Lerman, K., and Minton, S. 2000. Learning the common structure of data. In Proceedings of the 17th National Conference on Artificial Intelligence (AAAI-00).]]
[32]
Lerman, K., Minton S. N., and Knoblock, C. A. 2003. Wrapper maintenance: A machine learning approach. J. Artif. Intel. Res. 18, 149--181.]]
[33]
Liu, L., Pu, C., and Han, W. 2000. Xwrap: An xml-enabled wrapper construction system for web information sources. In Proceedings of the 16th IEEE International Conference on Data Engineering (ICDE'00) (San Diego, Calif.), IEEE Computer Society Press, Los Alamitos, Calif. 611--621.]]
[34]
Muslea, I., Minton, S., and Knoblock, C. A. 1999. A hierarchical approach to wrapper induction. In Proceedings of the 3rd Annual Conference on Autonomous Agents. 190--197.]]
[35]
Muslea, I., Minton, S., and Knoblock, C. 2001. Hierarchical wrapper induction for semistructured sources. J. Autonom. Agents Multi-Agent Syst. 4, 93--114.]]
[36]
Papadimitriou, C. H. 1994. Computational Complexity. Addison-Wesley, Reading, Mass.]]
[37]
Pitt, L. 1989. Inductive inference, DFAs and computational complexity. In Analogical and Inductive Inference. Lecture Notes in Artificial Intelligence, vol. 397, K. P. Jantke, Ed. Springer-Verlag, Berlin, 18--44.]]
[38]
Radhakrishnan, V., and Nagaraja, G. 1987. Inference of regular grammars via skeletons. IEEE Trans. Syst., Man and Cybernet. 17, 6, 982--992.]]
[39]
Ribeiro-Neto, B. A., Laender, A. H. F., and Soares da Silva, A. 1999. Extracting semistructured data through examples. In Proceedings of the 1999 ACM International Conference on Information and Knowledge Management (CIKM'99). ACM, New York. 94--101.]]
[40]
Sahuguet, A., and Azavant, F. 1999. Web ecology: Recycling HTML pages as XML documents using W4F. In Proceedings of the 2nd Workshop on the Web and Databases (WebDB'99) (in conjunction with SIGMOD'99). ACM, New York.]]
[41]
Soderland, S. 1999. Learning information extraction rules for semistructured and free text. Mach. Learn. 34, 1--3, 233--272.]]

Cited By

View all
  • (2022)WebFormer: The Web-page Transformer for Structure Information ExtractionProceedings of the ACM Web Conference 202210.1145/3485447.3512032(3124-3133)Online publication date: 25-Apr-2022
  • (2022)Automatic signboard detection and localization in densely populated developing citiesImage Communication10.1016/j.image.2022.116857109:COnline publication date: 1-Nov-2022
  • (2021)The smallest extraction problemProceedings of the VLDB Endowment10.14778/3476249.347629314:11(2445-2458)Online publication date: 27-Oct-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Journal of the ACM
Journal of the ACM  Volume 51, Issue 5
September 2004
151 pages
ISSN:0004-5411
EISSN:1557-735X
DOI:10.1145/1017460
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 2004
Published in JACM Volume 51, Issue 5

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Information extraction
  2. relational model
  3. wrappers

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)25
  • Downloads (Last 6 weeks)4
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2022)WebFormer: The Web-page Transformer for Structure Information ExtractionProceedings of the ACM Web Conference 202210.1145/3485447.3512032(3124-3133)Online publication date: 25-Apr-2022
  • (2022)Automatic signboard detection and localization in densely populated developing citiesImage Communication10.1016/j.image.2022.116857109:COnline publication date: 1-Nov-2022
  • (2021)The smallest extraction problemProceedings of the VLDB Endowment10.14778/3476249.347629314:11(2445-2458)Online publication date: 27-Oct-2021
  • (2021)Transformation from Web Pages to Optimal Normal Form Database Schema Using a Conceptual Schema Approach2021 7th International Conference on Engineering, Applied Sciences and Technology (ICEAST)10.1109/ICEAST52143.2021.9426306(112-116)Online publication date: 1-Apr-2021
  • (2020)DCADE: divide and conquer alignment with dynamic encoding for full page data extractionApplied Intelligence10.1007/s10489-019-01499-050:2(271-295)Online publication date: 1-Feb-2020
  • (2019)Semantic-Based Geospatial Data Integration With Unique FeaturesGeospatial Intelligence10.4018/978-1-5225-8054-6.ch012(254-277)Online publication date: 2019
  • (2019)Semantic Web and Geospatial Unique Features Based Geospatial Data IntegrationGeospatial Intelligence10.4018/978-1-5225-8054-6.ch011(230-253)Online publication date: 2019
  • (2019)RED: Redundancy-Driven Data Extraction from Result Pages?The World Wide Web Conference10.1145/3308558.3313529(605-615)Online publication date: 13-May-2019
  • (2018)Semantic-Based Geospatial Data Integration With Unique FeaturesInnovations, Developments, and Applications of Semantic Web and Information Systems10.4018/978-1-5225-5042-6.ch015(393-416)Online publication date: 2018
  • (2018)A novel alignment algorithm for effective web data extraction from singleton-item pagesApplied Intelligence10.1007/s10489-018-1208-048:11(4355-4370)Online publication date: 1-Nov-2018
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media