An XML Approach to Semantically Extract Data from HTML Tables

Liu, Jixue; Ao, Zhuoyun; Park, Ho-Hyun; Chen, Yongfeng

doi:10.1007/11546924_68

Jixue Liu¹⁹,
Zhuoyun Ao¹⁹,
Ho-Hyun Park²⁰ &
…
Yongfeng Chen²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3588))

Included in the following conference series:

International Conference on Database and Expert Systems Applications

1287 Accesses

Abstract

Data intensive information is often published on the internet in the format of HTML tables. Extracting some of the information that is of users’ interest from the internet, especially when large number of web pages need to be accessed, is time consuming. To automate the processes of information extraction, this paper proposes an XML way of semantically analyzing HTML tables for the data od interest. It firstly introduces a mini language in XML syntax for specifying ontologies that represent the data of interest. Then it defines algorithms that parse HTML tables to a specially defined type of XML trees. The XML trees are then compared with the ontologies to semantically analyze and locate the part of table or nested tables that have the interesting data. Finally, interesting data, once identified, is output as XML documents.

This research was supported by the international join research grant of the IITA (Institute of Information Technology Assessment) foreign professor invitation program of the MIC (Ministry of Information and Communication), Korea.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Inferring a Relax NG Schema from XML Documents

Enabling Real Time Analytics over Raw XML Data

Transformation of XML Data Sources for Sequential Path Mining

References

Brasethvik, T., Gulla, J.A.: Natural language analysis for semantic document modeling. DKE 38(1), 45–62 (2001)
Article MATH Google Scholar
Bray, T., Paoli, J., Sperberg-McQueen, C.M.: Extensible markup language (xml) 1.0 (1998), http://www.w3.org/TR/1998/REC-xml-19980210
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: VLDB, pp. 109–118 (2001)
Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: automatic data extraction from data-intensive web sites. In: SIGMOD Conference, p. 624 (2002)
Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P., Missier, P.: An automatic data grabber for large web sites. In: VLDB, pp. 1321–1324 (2004)
Google Scholar
Embley, D.W., Tao, C., Liddle, S.W.: Automatically extracting ontologically specified data from html tables of unknown structure. In: Spaccapietra, S., March, S.T., Kambayashi, Y. (eds.) ER 2002. LNCS, vol. 2503, pp. 322–337. Springer, Heidelberg (2002)
Chapter Google Scholar
Filha, I.M.R.E., da Silva, A.S., Laender, A.H.F., Embley, D.W.: Using nested tables for representing and querying semistructured web data. In: Pidduck, A.B., Mylopoulos, J., Woo, C.C., Ozsu, M.T. (eds.) CAiSE 2002. LNCS, vol. 2348, pp. 719–723. Springer, Heidelberg (2002)
Chapter Google Scholar
Hammer, J., Garcia-Molina, H., Cho, J., Aranha, R., Crespo, A.: Extracting semistructured information from the web. In: Proceedings of the Workshop on Management of Semistructured Data (1997)
Google Scholar
HTML-Working-Group. Hypertext markup language (html), W3C (2004), http://www.w3.org/MarkUp/
Lam, W., Lin, W.-Y.: Learning to extract hierarchical information from semi-structured documents. In: CIKM, pp. 250–257 (2000)
Google Scholar
Lerman, K., Getoor, L., Minton, S., Knoblock, C.A.: Using the structure of web sites for automatic segmentation of tables. In: SIGMOD Conference, pp. 119–130 (2004)
Google Scholar
Lerman, K., Knoblock, C.A., Minton, S.: Automatic data extraction from lists and tables in web sources. In: Automatic Text Extraction and Mining workshop (ATEM 2001), IJCAI 2001, Seattle, WA (2001), http://www.isi.edu/~lerman/papers/lerman-atem2001.pdf
Lim, S.-J., Nag, Y.-K.: An automated approach for retrieving hierarchical data from html tables. In: CIKM, pp. 466–474 (1999)
Google Scholar
Soderland, S.: Learning to extract text-based information from the world wide web. In: KDD, pp. 251–254 (1997)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer and Information Science, University of South, Australia
Jixue Liu & Zhuoyun Ao
School of Electrical and Electronics Engineering, Chung-Ang University,
Ho-Hyun Park
Faculty of Management, Xian University of Architecture and Technology,
Yongfeng Chen

Authors

Jixue Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zhuoyun Ao
View author publications
You can also search for this author in PubMed Google Scholar
Ho-Hyun Park
View author publications
You can also search for this author in PubMed Google Scholar
Yongfeng Chen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Copenhagen Business School, Centre for Applied ICT, 60 Howitzvej, 2000, Frederiksberg, DK
Kim Viborg Andersen
University Of Technology Sydney, NSW 2007, Australia
John Debenham
University of Linz, Altenbergerstraße 69, 4040, Linz, Austria
Roland Wagner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, J., Ao, Z., Park, HH., Chen, Y. (2005). An XML Approach to Semantically Extract Data from HTML Tables. In: Andersen, K.V., Debenham, J., Wagner, R. (eds) Database and Expert Systems Applications. DEXA 2005. Lecture Notes in Computer Science, vol 3588. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11546924_68

Download citation

DOI: https://doi.org/10.1007/11546924_68
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28566-3
Online ISBN: 978-3-540-31729-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

An XML Approach to Semantically Extract Data from HTML Tables

Abstract

Access this chapter

Preview

Similar content being viewed by others

Inferring a Relax NG Schema from XML Documents

Enabling Real Time Analytics over Raw XML Data

Transformation of XML Data Sources for Sequential Path Mining

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

An XML Approach to Semantically Extract Data from HTML Tables

Abstract

Access this chapter

Preview

Similar content being viewed by others

Inferring a Relax NG Schema from XML Documents

Enabling Real Time Analytics over Raw XML Data

Transformation of XML Data Sources for Sequential Path Mining

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation