Skip to main content
Log in

Finding and Extracting Data Records from Web Pages

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

Many HTML pages are generated by software programs by querying some underlying databases and then filling in a template with the data. In these situations the metainformation about the data structure is lost, so automated software programs cannot process these data in such powerful manners as information from databases. We propose a set of novel techniques for detecting structured records in a web page and extracting the data values that constitute them. Our method needs only an input page. It starts by identifying the data region of interest in the page. Then it is partitioned into records by using a clustering method that groups similar subtrees in the DOM tree of the page. Finally, the attributes of the data records are extracted by using a method based on multiple string alignment. We have tested our techniques with a high number of real web sources, obtaining high precision and recall values.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10

Similar content being viewed by others

Notes

  1. http://www.tic.udc.es/~mad/resources/projects/dataextraction/testcollection_0507.htm

  2. http://www.dia.uniroma3.it/db/roadRunner/software.html

References

  1. Álvarez, M., Pan, A., Raposo, J., Bellas, F., & Cacheda, F. (2007). Finding and extracting data records from web pages. Proc. of 2007 IFIP International Conference on Embedded and Ubiquitous Computing (EUC 2007). Lecture Notes in Computer Science, 4808, 466–478 ISSN: 0302-9743.

    Article  Google Scholar 

  2. Álvarez, M., Pan, A., Raposo, J., Cacheda, F., Bellas, F., & Carneiro, V. (2007). Crawling the content hidden behind web forms. In Proceedings of the 2007 International Conference on Computational Science and its Applications (ICCSA). Lecture Notes in Computer Science, 4706(2), 322–333 Springer Berlin/Heidelberg, ISSN: 0302-9743, ISBN-10: 3-540-74475-4, ISBN-13: 978-3-540-74475-7.

    Article  Google Scholar 

  3. Arasu, A., & Garcia-Molina, H. (2003). Extracting structured data from web pages. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data.

  4. Arlota, L., Crescenzi, V., Mecca, G., & Merialdo, P. (2003). Automatic annotation of data extracted from large websites. In Proceedings of the WebDB Workshop, pp. 7–12.

  5. Baumgartner, R., Flesca, S., Gottlob, G. (2001). Visual web information extraction with lixto. In Proc. of Very Large DataBases (VLDB).

  6. Chakrabarti, S. (2003). Mining the web: Discovering knowledge from hypertext data. San Francisco: Morgan Kaufmann ISBN: 1-55860-754-4.

    Google Scholar 

  7. Chang, C., & Lui, S. (2001). IEPAD: Information extraction based on pattern discovery. In Proc. of 2001 Int. World Wide Web Conf., pp. 681–688.

  8. Chang, K., He, B., & Zhang, Z. (2004). MetaQuerier over the deep web: Shallow integration across holistic sources. In Proceedings of the VLDB Workshop on Information Integration on the Web (VLDB-IIWeb).

  9. Crescenzi, V., Mecca, G., & Merialdo, P. (2001). ROADRUNNER: Towards automatic data extraction from large web sites. In Proc. of the 2001 Int. VLDB Conf, pp. 109–118.

  10. Crescenzi, V., Merialdo, P., & Missier, P. (2005). Clustering web pages based on their structure. Data & Knowledge Engineering Journal, 54(3), 279–299. September.

    Article  Google Scholar 

  11. Gonnet, G. H., Baeza-Yates, R. A., & Snider, T. (1992). New indices for text: Pat trees and pat arrays. Information retrieval: Data structures and algorithms. Upper Saddle River: Prentice Hall.

    Google Scholar 

  12. Hammer, J., McHugh, J., & Garcia-Molina, H. (1997). Semistructured data: The Tsimmis experience. In Proceedings of the 1st East-European Symposium on Advances in Databases and Information Systems (ADBIS), pp. 1–8.

  13. Hogue, A., & Karger, D. (2005). Thresher: Automating the unwrapping of semantic content from the world wide web. In Proceedings of the 14th International World Wide Web Conference.

  14. Hsu, C. N., & Dung, M. T. (1998). Generating finite-state transducers for semi-structured data extraction from the web. Information System, 23(8), 521–538. doi:10.1016/S0306-4379(98)00027-1.

    Article  Google Scholar 

  15. Jung, Y., Geller, J., Wu, Y., & Ae Chun, S. (2007). Semantic deep web: Automatic attribute extraction from the deep web data sources. In Proceedings of the International SAC Conference, pp. 1667–1672.

  16. Kovalev, V., Bhowmick, S., & Madria, S. (2005). HW-STALKER: A machine learning-based system for transforming QURE-Pagelets to XML. Data & Knowledge Engineering Journal, 54(2), 241–276, August.

    Article  Google Scholar 

  17. Kistlera, T., & Marais, H. (1998). WebL: A Programming Language for the Web. In Proceedings of the 7th International World Wide Web Conference (WWW7), pp. 259–270.

  18. Kushmerick, N., Weld, D. S., & Doorenbos, R. B. (1997). Wrapper induction for information extraction. In Proceedings of the 15th International Joint Conference on Artificial Intelligence (IJCAI), pp. 729–737.

  19. Laender, A. H. F., Ribeiro-Neto, B. A., Soares da Silva, A., & Teixeira, J. S. (2002). A brief survey of web data extraction tools. SIGMOD Record, 31(2), 84–93. doi:10.1145/565117.565137.

    Article  Google Scholar 

  20. Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10, 707–710.

    MathSciNet  Google Scholar 

  21. Liddle, S., Yau, S., & Embley, D. (2001). On the automatic extraction of data from the hidden web. ER (Workshops), pp. 212–226.

  22. Muslea, I., Minton, S., & Knoblock, C. (2001). Hierarchical wrapper induction for semistructured information sources. Auton. Agent. Multi Agent Syst., 93–114. doi:10.1023/A:1010022931168.

  23. Notredame, C. (2002). Recent progresses in multiple sequence alignment: A survey. Technical report, Information Genetique et.

  24. Pan, A., et al. (2002). Semi-automatic wrapper generation for commercial web sources. In Proc. of IFIP WG8.1 Conf. on Engineering Inf. Systems in the Internet Context (EISIC).

  25. Raghavan, S., & García-Molina, H. (2001). Crawling the hidden web. In Proceedings of the 27th International Conference on Very Large Databases (VLDB).

  26. Raposo, J., Pan, A., Álvarez, M., & Hidalgo, J. (2007). Automatically maintaining wrappers for web sources. Data & Knowledge Engineering, 61(2), 331–358. doi:10.1016/j.datak.2006.06.006.

    Article  Google Scholar 

  27. Sahuguet, A., & Azavant, F. (2001). Building intelligent web applications using lightweight wrappers. Data & Knowledge Engineering Journal, 36(3), 283–316. doi:10.1016/S0169-023X(00)00051-3.

    Article  MATH  Google Scholar 

  28. Wang, J., & Lochovsky, F. (2003). Data extraction and label assignment for web databases. In Proceedings of the 12th International World Wide Web Conference (WWW12).

  29. Zhai, Y., & Liu, B. (2005). Extracting web data using instance-based learning. In Proc. of Web Information Systems Engineering (WISE), pp. 318–331.

  30. Zhai, Y., & Liu, B. (2006). Structured data extraction from the web based on partial tree alignment. IEEE Transactions on Knowledge and Data Engineering, 18(12), 1614–1628. doi:10.1109/TKDE.2006.197.

    Article  Google Scholar 

Download references

Acknowledgements

This research was partially supported by the Spanish Ministry of Education and Science under project TSI2005-07730.

Alberto Pan’s work was partially supported by the “Ramón y Cajal” programme of the Spanish Ministry of Education and Science.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Manuel Álvarez.

Additional information

This research was partially supported by the Spanish Ministry of Education and Science under project TSI2005-07730. Alberto Pan’s work was partially supported by the “Ramón y Cajal” programme of the Spanish Ministry of Education and Science.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Álvarez, M., Pan, A., Raposo, J. et al. Finding and Extracting Data Records from Web Pages. J Sign Process Syst Sign Image Video Technol 59, 123–137 (2010). https://doi.org/10.1007/s11265-008-0270-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-008-0270-y

Keywords

Navigation