List Data Extraction in Semi-structured Document

Xu, Hui; Li, Juan-Zi; Xu, Peng

doi:10.1007/11581062_51

Hui Xu²¹,
Juan-Zi Li²¹ &
Peng Xu²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3806))

Included in the following conference series:

International Conference on Web Information Systems Engineering

1191 Accesses

Abstract

The amount of semi-structured documents is tremendous online, such as business annual reports, online airport listings, catalogs, hotel directories, etc. List, which has structured characteristics, is used to store highly structured and database-like information in many semi-structured documents. This paper is about list data extraction from semi-structured documents. By list data extraction, we mean extracting data from lists and grouping it by rows and columns. List data extraction is of benefit to text mining applications on semi-structured documents. Recently, several methods are proposed to extract list data by utilizing the word layout and arrangement information [1, 2]. However, in the research community, few previous studies has so far sufficiently investigated the problem of making use of not only layout and arrangement information, but also the semantic information of words, to the best of our knowledge. In this paper, we propose a clustering based method making use of both the layout information and the semantic information of words for this extraction task. We show experimental results on plain-text annual reports from Shanghai Stock Exchange, in which 73.49% of the lists were extracted correctly.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Lerman, K., Knoblock, C., Minton, S.: Automatic Data Extraction from Lists and Tables in Web Sources. In: Proc. ATEM 2001, pp. 34–41. IEEE Press, USA (2001)
Google Scholar
Douglas, S., Hurst, M.: Layout and language: Lists and tables in technical documents. In: Proceedings of the ACL SIGPARSE Workshop on Punctuation in Computational Linguistics, pp. 19–24 (1996)
Google Scholar
Zeng, H.-J., He, Q.-C., Chen, Z., Ma, W.-Y., Ma, J.: Learning to Cluster Web Search Results. In: Proceedings of the 27th annual international conference on Research and development in information retrieval (2004)
Google Scholar
Soo, V.-W., Lee, C.-Y., Li, C.-C., Chen, S.L., Chen, C.-c.: Automated Semantic Annotation and Retrieval Based on Sharable Ontology and Case-based Learning Techniques. In: Proceedings of the 2003 Joint Conference on Digital Libraries. IEEE, Los Alamitos (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, P.R. China
Hui Xu, Juan-Zi Li & Peng Xu

Authors

Hui Xu
View author publications
You can also search for this author in PubMed Google Scholar
Juan-Zi Li
View author publications
You can also search for this author in PubMed Google Scholar
Peng Xu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Texas State University, San Marcos, TX,
Anne H. H. Ngu
Institute of Industrial Science, The University of Tokyo, 4-6-1 Komaba, Meguro-ku, 153-8505, Tokyo, Japan
Masaru Kitsuregawa
University of Vienna, Vienna, Austria
Erich J. Neuhold
IBM Research Division, Thomas J. Watson Research Center, P.O. Box 218, 10598, New York, Yorktown Heights, USA
Jen-Yao Chung
School of Computer Science and Engineering, University of New South Wales, NSW 2052, Sydney, Australia
Quan Z. Sheng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xu, H., Li, JZ., Xu, P. (2005). List Data Extraction in Semi-structured Document. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, JY., Sheng, Q.Z. (eds) Web Information Systems Engineering – WISE 2005. WISE 2005. Lecture Notes in Computer Science, vol 3806. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11581062_51

Download citation

DOI: https://doi.org/10.1007/11581062_51
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-30017-5
Online ISBN: 978-3-540-32286-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics