Abstract
Making HTML documents, the authors use various methods for clearly conveying their intension. In those various methods, this paper pays special attention to tables because tables are commonly used within many documents to make the meanings clear, which are well recognized because web documents use tags for additional information. On the Internet, tables are used for the purpose of the knowledge structuring as well as design of documents. Thus, we are firstly interested in classifying tables into two types: meaningful tables and decorative tables. However, this is not easy because HTML does not separate presentation and structure. This paper proposes a method of extracting meaningful tables using a modified k-means and compares it with other methods. The experiment results show that classifying on web documents is promising.
This work was supported by National Research Laboratory Program (Contract Number: M10203000028-02J0000-01510 ) of KISTEP.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)
Mitchell, T.M.: Machine Learning, pp. 53–79. McGraw-Hill, New York (1997)
Hammer, J., Garcia-Molina, H., Cho, J., Aranha, R., Crespo, A.: Extracting Semistructured Information from the Web. SIGMOD Record 26(2), 18–25 (1997)
Huang, Y., Qi, G.Z., Zhang, F.Y.: Constructing Semistructed information extractor from the Web document. Journal of Software 11(1), 73–75 (2000)
Ashish, N., Knoblock, C.: Wrapper Generation for Semi-structed Internet Sources. SIGMOD Record 26(4), 8–15 (1997)
Smith, D., Lopez, M.: Information Extraction for Semi-structed Documents. In: Proceedings of the Workshop on Management of Semistructed Data, in conjunction with PODS/SIGMOD, Tucson, AZ, USA, May 12 (1997)
Ning, G., Guowen, W., Xiaoyuan, W., Baile, S.: Extracting Web table information in cooperative learning activites based on abstract semantic model. In: The Sixth International Conference on Computer Supported Cooperative Work in Design 2001, pp. 492–497 (2001)
Jung, S.W., Sung, K.H., Park, T.W., Kwon, H.C.: Effective Retrieval of Information in Tables on the Internet. In: Hendtlass, T., Ali, M. (eds.) IEA/AIE 2002. LNCS (LNAI), vol. 2358, pp. 493–501. Springer, Heidelberg (2002)
Jung, S.W., Lee, W.H., Park, S.K., Kwon, H.C.: Extraction of Meaningful Tables from the Internet Using Decision Trees. In: Chung, P.W.H., Hinde, C.J., Ali, M. (eds.) IEA/AIE 2003. LNCS (LNAI), vol. 2718, pp. 176–186. Springer, Heidelberg (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Jung, Sw., Han, Gd., Kwon, Hc. (2005). Mining Table Information on the Internet. In: Su, KY., Tsujii, J., Lee, JH., Kwong, O.Y. (eds) Natural Language Processing – IJCNLP 2004. IJCNLP 2004. Lecture Notes in Computer Science(), vol 3248. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30211-7_62
Download citation
DOI: https://doi.org/10.1007/978-3-540-30211-7_62
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24475-2
Online ISBN: 978-3-540-30211-7
eBook Packages: Computer ScienceComputer Science (R0)