Skip to main content

An XML Approach to Semantically Extract Data from HTML Tables

  • Conference paper
Database and Expert Systems Applications (DEXA 2005)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3588))

Included in the following conference series:

  • 1236 Accesses

Abstract

Data intensive information is often published on the internet in the format of HTML tables. Extracting some of the information that is of users’ interest from the internet, especially when large number of web pages need to be accessed, is time consuming. To automate the processes of information extraction, this paper proposes an XML way of semantically analyzing HTML tables for the data od interest. It firstly introduces a mini language in XML syntax for specifying ontologies that represent the data of interest. Then it defines algorithms that parse HTML tables to a specially defined type of XML trees. The XML trees are then compared with the ontologies to semantically analyze and locate the part of table or nested tables that have the interesting data. Finally, interesting data, once identified, is output as XML documents.

This research was supported by the international join research grant of the IITA (Institute of Information Technology Assessment) foreign professor invitation program of the MIC (Ministry of Information and Communication), Korea.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Brasethvik, T., Gulla, J.A.: Natural language analysis for semantic document modeling. DKE 38(1), 45–62 (2001)

    Article  MATH  Google Scholar 

  2. Bray, T., Paoli, J., Sperberg-McQueen, C.M.: Extensible markup language (xml) 1.0 (1998), http://www.w3.org/TR/1998/REC-xml-19980210

  3. Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: VLDB, pp. 109–118 (2001)

    Google Scholar 

  4. Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: automatic data extraction from data-intensive web sites. In: SIGMOD Conference, p. 624 (2002)

    Google Scholar 

  5. Crescenzi, V., Mecca, G., Merialdo, P., Missier, P.: An automatic data grabber for large web sites. In: VLDB, pp. 1321–1324 (2004)

    Google Scholar 

  6. Embley, D.W., Tao, C., Liddle, S.W.: Automatically extracting ontologically specified data from html tables of unknown structure. In: Spaccapietra, S., March, S.T., Kambayashi, Y. (eds.) ER 2002. LNCS, vol. 2503, pp. 322–337. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  7. Filha, I.M.R.E., da Silva, A.S., Laender, A.H.F., Embley, D.W.: Using nested tables for representing and querying semistructured web data. In: Pidduck, A.B., Mylopoulos, J., Woo, C.C., Ozsu, M.T. (eds.) CAiSE 2002. LNCS, vol. 2348, pp. 719–723. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  8. Hammer, J., Garcia-Molina, H., Cho, J., Aranha, R., Crespo, A.: Extracting semistructured information from the web. In: Proceedings of the Workshop on Management of Semistructured Data (1997)

    Google Scholar 

  9. HTML-Working-Group. Hypertext markup language (html), W3C (2004), http://www.w3.org/MarkUp/

  10. Lam, W., Lin, W.-Y.: Learning to extract hierarchical information from semi-structured documents. In: CIKM, pp. 250–257 (2000)

    Google Scholar 

  11. Lerman, K., Getoor, L., Minton, S., Knoblock, C.A.: Using the structure of web sites for automatic segmentation of tables. In: SIGMOD Conference, pp. 119–130 (2004)

    Google Scholar 

  12. Lerman, K., Knoblock, C.A., Minton, S.: Automatic data extraction from lists and tables in web sources. In: Automatic Text Extraction and Mining workshop (ATEM 2001), IJCAI 2001, Seattle, WA (2001), http://www.isi.edu/~lerman/papers/lerman-atem2001.pdf

  13. Lim, S.-J., Nag, Y.-K.: An automated approach for retrieving hierarchical data from html tables. In: CIKM, pp. 466–474 (1999)

    Google Scholar 

  14. Soderland, S.: Learning to extract text-based information from the world wide web. In: KDD, pp. 251–254 (1997)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Liu, J., Ao, Z., Park, HH., Chen, Y. (2005). An XML Approach to Semantically Extract Data from HTML Tables. In: Andersen, K.V., Debenham, J., Wagner, R. (eds) Database and Expert Systems Applications. DEXA 2005. Lecture Notes in Computer Science, vol 3588. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11546924_68

Download citation

  • DOI: https://doi.org/10.1007/11546924_68

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-28566-3

  • Online ISBN: 978-3-540-31729-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics