Abstract
The amount of semi-structured documents is tremendous online, such as business annual reports, online airport listings, catalogs, hotel directories, etc. List, which has structured characteristics, is used to store highly structured and database-like information in many semi-structured documents. This paper is about list data extraction from semi-structured documents. By list data extraction, we mean extracting data from lists and grouping it by rows and columns. List data extraction is of benefit to text mining applications on semi-structured documents. Recently, several methods are proposed to extract list data by utilizing the word layout and arrangement information [1, 2]. However, in the research community, few previous studies has so far sufficiently investigated the problem of making use of not only layout and arrangement information, but also the semantic information of words, to the best of our knowledge. In this paper, we propose a clustering based method making use of both the layout information and the semantic information of words for this extraction task. We show experimental results on plain-text annual reports from Shanghai Stock Exchange, in which 73.49% of the lists were extracted correctly.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Lerman, K., Knoblock, C., Minton, S.: Automatic Data Extraction from Lists and Tables in Web Sources. In: Proc. ATEM 2001, pp. 34–41. IEEE Press, USA (2001)
Douglas, S., Hurst, M.: Layout and language: Lists and tables in technical documents. In: Proceedings of the ACL SIGPARSE Workshop on Punctuation in Computational Linguistics, pp. 19–24 (1996)
Zeng, H.-J., He, Q.-C., Chen, Z., Ma, W.-Y., Ma, J.: Learning to Cluster Web Search Results. In: Proceedings of the 27th annual international conference on Research and development in information retrieval (2004)
Soo, V.-W., Lee, C.-Y., Li, C.-C., Chen, S.L., Chen, C.-c.: Automated Semantic Annotation and Retrieval Based on Sharable Ontology and Case-based Learning Techniques. In: Proceedings of the 2003 Joint Conference on Digital Libraries. IEEE, Los Alamitos (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Xu, H., Li, JZ., Xu, P. (2005). List Data Extraction in Semi-structured Document. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, JY., Sheng, Q.Z. (eds) Web Information Systems Engineering – WISE 2005. WISE 2005. Lecture Notes in Computer Science, vol 3806. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11581062_51
Download citation
DOI: https://doi.org/10.1007/11581062_51
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-30017-5
Online ISBN: 978-3-540-32286-3
eBook Packages: Computer ScienceComputer Science (R0)