Extraction of Meaningful Tables from the Internet Using Decision Trees

Jung, Sung-Won; Lee, Won-Hee; Park, Sang-Kyu; Kwon, Hyuk-Chul

doi:10.1007/3-540-45034-3_18

Sung-Won Jung³,
Won-Hee Lee³,
Sang-Kyu Park⁴ &
…
Hyuk-Chul Kwon³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2718))

Included in the following conference series:

International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems

3670 Accesses
1 Citations

Abstract

The information retrieval system currently in use fails to consider the structural information of documents but uses extracted indexes from documents instead. Structural information such as the font face, font size, indentation, tables, and etc. demonstrate the author’s meaning and is clearly the prime means of documentation. This paper pays special attention to tables because tables are commonly used within many documents to make the meanings clear, which are well recognized because web documents use tags for additional information. On the Internet, tables are used for the purpose of the structure of knowledge and also the design of documents. This report will propose a method of extracting meaningful tables using a decision tree and to construct a dictionary of table indexes in order to apply an information retrieval system and thus enhance the accuracy.

This work was partially supported by Korean Science and Engineering Foundation (Contract Number: R01 - 2000 - 00275) and National Research Laboratory Program (Contract Number: M10203000028-02J0000-01510) of KISTEP.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Kobayashi, M., Takeda, K.: Information Retrieval on the Web. ACM Computing Surveys (2000) 144–173
Google Scholar
Salton, G., McGill, M. J.:Introduction to Modern Information Retrieval, McGraw-Hill, New York (1983)
MATH Google Scholar
Fox, E. A.: Extending the Boolean and Vector Space Models of Information Retrieval with P-norm Queries and Multiple Concept Types, Dissertation Cornell University (1983)
Google Scholar
Smith, M. E.: Aspects of the P-norm Model of Information Retrieval: Syntactic Query Generation, Efficiency, and Theoretical Properties, Dissertation Cornell University, (1990)
Google Scholar
Salton, G., Fox, E. A., Wu, H.: Extended Boolean Information Retrieval, ncstrl.cornel, (1982) 82–511
Google Scholar
Mitchell, T. M.: Machine Learning, McGraw-Hill (1997), 53–79
Google Scholar
http://www.busancvb.org/eng/home.html
Jung, S.W., Sung, K.H., Park, T.W., Kwon, H.C.: Effective Retrieval of Information in Tables on the Internet, IEA/AIE June (2002) 493–501
Google Scholar
Hammer, J., Garcia-Molina, H., Cho, J., Aranha, R., and A. Crespo.: Extracting Semistructured Information from the Web, SIGMOD Record, 26(2) (1997) 18–25
Article Google Scholar
Huang Y., Qi G.Z., Zhang F.Y.: Constructing Semistructed information extractor from the Web document, Journal of Software 11(1) (2000) 73–75
Google Scholar
Ashish., N., Knoblock, C.: Wrapper Generation for Semi-structed Internet Sources, SIGMOD Record, 26(4) (1997) 8–15
Article Google Scholar
Smith, D., Lopez M.: Information Extraction for Semi-structed Documents, In Proceedings of the Workshop on Management of Semistructed Data, in conjunction with PODS/SIGMOD, Tucson, AZ, USA, May, 12 (1997)
Google Scholar
Ning, G., Guowen, W., Xiaoyuan, W., Baile, S.: Extracting Web table information in cooperative learning activites based on abstract semantic model, Computer Supported Cooperative Work in Design, The Sixth International Conference on 2001 (2001) 492–497
Google Scholar

Download references

Author information

Authors and Affiliations

AI Lab. Dept. of Computer Science, Pusan National University, San 30, Jang-geon Dong, 609-735, Busan, Korea
Sung-Won Jung, Won-Hee Lee & Hyuk-Chul Kwon
Electronic and Telecommunications Research Institute, 161 Gajeong Dong, Yuseong Gu, 305-350, Daejeon, Korea
Sang-Kyu Park

Authors

Sung-Won Jung
View author publications
You can also search for this author in PubMed Google Scholar
Won-Hee Lee
View author publications
You can also search for this author in PubMed Google Scholar
Sang-Kyu Park
View author publications
You can also search for this author in PubMed Google Scholar
Hyuk-Chul Kwon
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dept. of Computer Science, Loughborough University, Loughborough, LE11 3TU, England
Paul W. H. Chung & Chris Hinde &
Dept. of Computer Science, Southwest Texas State University, 601 University Drive, San Marcos, TX, 78666, USA
Moonis Ali

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jung, SW., Lee, WH., Park, SK., Kwon, HC. (2003). Extraction of Meaningful Tables from the Internet Using Decision Trees. In: Chung, P.W.H., Hinde, C., Ali, M. (eds) Developments in Applied Artificial Intelligence. IEA/AIE 2003. Lecture Notes in Computer Science(), vol 2718. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45034-3_18

Download citation

DOI: https://doi.org/10.1007/3-540-45034-3_18
Published: 24 June 2003
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40455-2
Online ISBN: 978-3-540-45034-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics