Skip to main content

An XML Approach to Semantically Extract Data from HTML Tables

  • Conference paper
Database and Expert Systems Applications (DEXA 2005)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3588))

Included in the following conference series:

  • 1287 Accesses


Data intensive information is often published on the internet in the format of HTML tables. Extracting some of the information that is of users’ interest from the internet, especially when large number of web pages need to be accessed, is time consuming. To automate the processes of information extraction, this paper proposes an XML way of semantically analyzing HTML tables for the data od interest. It firstly introduces a mini language in XML syntax for specifying ontologies that represent the data of interest. Then it defines algorithms that parse HTML tables to a specially defined type of XML trees. The XML trees are then compared with the ontologies to semantically analyze and locate the part of table or nested tables that have the interesting data. Finally, interesting data, once identified, is output as XML documents.

This research was supported by the international join research grant of the IITA (Institute of Information Technology Assessment) foreign professor invitation program of the MIC (Ministry of Information and Communication), Korea.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others


  1. Brasethvik, T., Gulla, J.A.: Natural language analysis for semantic document modeling. DKE 38(1), 45–62 (2001)

    Article  MATH  Google Scholar 

  2. Bray, T., Paoli, J., Sperberg-McQueen, C.M.: Extensible markup language (xml) 1.0 (1998),

  3. Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: VLDB, pp. 109–118 (2001)

    Google Scholar 

  4. Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: automatic data extraction from data-intensive web sites. In: SIGMOD Conference, p. 624 (2002)

    Google Scholar 

  5. Crescenzi, V., Mecca, G., Merialdo, P., Missier, P.: An automatic data grabber for large web sites. In: VLDB, pp. 1321–1324 (2004)

    Google Scholar 

  6. Embley, D.W., Tao, C., Liddle, S.W.: Automatically extracting ontologically specified data from html tables of unknown structure. In: Spaccapietra, S., March, S.T., Kambayashi, Y. (eds.) ER 2002. LNCS, vol. 2503, pp. 322–337. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  7. Filha, I.M.R.E., da Silva, A.S., Laender, A.H.F., Embley, D.W.: Using nested tables for representing and querying semistructured web data. In: Pidduck, A.B., Mylopoulos, J., Woo, C.C., Ozsu, M.T. (eds.) CAiSE 2002. LNCS, vol. 2348, pp. 719–723. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  8. Hammer, J., Garcia-Molina, H., Cho, J., Aranha, R., Crespo, A.: Extracting semistructured information from the web. In: Proceedings of the Workshop on Management of Semistructured Data (1997)

    Google Scholar 

  9. HTML-Working-Group. Hypertext markup language (html), W3C (2004),

  10. Lam, W., Lin, W.-Y.: Learning to extract hierarchical information from semi-structured documents. In: CIKM, pp. 250–257 (2000)

    Google Scholar 

  11. Lerman, K., Getoor, L., Minton, S., Knoblock, C.A.: Using the structure of web sites for automatic segmentation of tables. In: SIGMOD Conference, pp. 119–130 (2004)

    Google Scholar 

  12. Lerman, K., Knoblock, C.A., Minton, S.: Automatic data extraction from lists and tables in web sources. In: Automatic Text Extraction and Mining workshop (ATEM 2001), IJCAI 2001, Seattle, WA (2001),

  13. Lim, S.-J., Nag, Y.-K.: An automated approach for retrieving hierarchical data from html tables. In: CIKM, pp. 466–474 (1999)

    Google Scholar 

  14. Soderland, S.: Learning to extract text-based information from the world wide web. In: KDD, pp. 251–254 (1997)

    Google Scholar 

Download references

Author information

Authors and Affiliations


Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Liu, J., Ao, Z., Park, HH., Chen, Y. (2005). An XML Approach to Semantically Extract Data from HTML Tables. In: Andersen, K.V., Debenham, J., Wagner, R. (eds) Database and Expert Systems Applications. DEXA 2005. Lecture Notes in Computer Science, vol 3588. Springer, Berlin, Heidelberg.

Download citation

  • DOI:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-28566-3

  • Online ISBN: 978-3-540-31729-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics