Identifying Web Tables: Supporting a Neglected Type of Content on the Web

Galkin, Mikhail; Mouromtsev, Dmitry; Auer, Sören

doi:10.1007/978-3-319-24543-0_4

Mikhail Galkin^12,13,
Dmitry Mouromtsev¹³ &
Sören Auer¹²

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 518))

Included in the following conference series:

International Conference on Knowledge Engineering and the Semantic Web

825 Accesses
1 Altmetric

Abstract

The abundance of the data in the Internet facilitates the improvement of extraction and processing tools. The trend in the open data publishing encourages the adoption of structured formats like CSV and RDF. However, there is still a plethora of unstructured data on the Web which we assume contain semantics. For this reason, we propose an approach to derive semantics from web tables which are still the most popular publishing tool on the Web. The paper also discusses methods and services of unstructured data extraction and processing as well as machine learning techniques to enhance such a workflow. The eventual result is a framework to process, publish and visualize linked open data. The software enables tables extraction from various open data sources in the HTML format and an automatic export to the RDF format making the data linked. The paper also gives the evaluation of machine learning techniques in conjunction with string similarity functions to be applied in a tables recognition task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

TableProcessor: The Tool for the Analysis and the Interpretation of Web Tables to Create the Geo Knowledge Base of Kazakhstan

Extracting Knowledge from Web Tables Based on DOM Tree Similarity

Making Sense of Numerical Data - Semantic Labelling of Web Tables

References

Babyak, M.A.: What you see may not be what you get: a brief, nontechnical introduction to overfitting in regression-type models. Psychosomatic Medicine 66(3), 411–421 (2004)
Google Scholar
Cafarella, M., Wu, E., Halevy, A., Zhang, Y., Wang, D.: Webtables: exploring the power of tables on the web. In: VLDB 2008 (2008)
Google Scholar
Crestan, E., Pantel, P.: Web-scale table census and classification. In: ACM WSDM 2011 Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, pp. 545–554 (2011)
Google Scholar
Embley, D., Tao, C., Liddle, S.: Automating the extraction of data from html tables with unknown structure. Data and Knowledge Engineering 54, 3–28 (2005). Special issue: ER
Article Google Scholar
Ermilov, I., Auer, S., Stadler, C.: User-driven semantic mapping of tabular data. In: Proceedings of the 9th International Conference on Semantic Systems, pp. 105–112. ACM (2013)
Google Scholar
Haase, P., Schmidt, M., Schwarte, A.: The information workbench as a self-service platform for linked data applications. In: COLD. Citeseer (2011)
Google Scholar
Hu, J., Kashi, R., Lopresti, D., Wilfong, G.: Evaluating the performance of table processing algorithms. International Journal on Document Analysis and Recognition 4, 140–153 (2002)
Article Google Scholar
Hurst, M.: A constraint-based approach to table structure derivation. In: Proceedings of International Conference on Document Analysis and Recognition (ICDAR 2003), pp. 911–915 (2003)
Google Scholar
Kolchin, M., Kozlov, F.: A template-based information extraction from web sites with unstable markup. In: Presutti, V., et al. (eds.) SemWebEval 2014. CCIS, vol. 475, pp. 89–94. Springer, Heidelberg (2014)
Google Scholar
Mitchell, T.: Machine Learning. McGraw-Hill Science/Engineering/Math (1997)
Google Scholar
Mouromtsev, D., Vlasov, V., Parkhimovich, O., Galkin, M., Knyazev, V.: Development of the st. petersburg’s linked open data site using information workbench. In: 14th FRUCT Proceedings, pp. 77–82 (2013)
Google Scholar
Rokach, L., Maimon, O.: Data mining with decision trees: theory and applications. World Scientific Pub. Co., Inc. (2008)
Google Scholar
Silva, A., Jorge, A., Torgo, L.: Design of an end-to-end method to extract information from tables. International Journal of Document Analysis and Recognition (IJDAR) 8, 144–171 (2006)
Article Google Scholar
Tijerino, Y., Embley, D., Lonsdale, D., Ding, Y., Nagy, G.: Towards ontology generation from tables. World Wide Web 8, 261–285 (2005)
Article Google Scholar
Tou, J., Gonzalez, R.: Pattern Recognition Principles. University of Southern California (1974)
Google Scholar
Ukkonen, E.: Approximate string matching with q-grams and maximal matches. Theoretical Computer Science, 191–211 (1992)
Google Scholar
Vorontsov, K.: Machine learning methods (2009). http://www.machinelearning.ru
Wang, Y., Hu, J.: A machine learning based approach for table detection on the web. In: Proceedings of the 11th WWW, pp. 242–250 (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Bonn, Bonn, Germany
Mikhail Galkin & Sören Auer
ITMO University, Saint Petersburg, Russia
Mikhail Galkin & Dmitry Mouromtsev

Authors

Mikhail Galkin
View author publications
You can also search for this author in PubMed Google Scholar
Dmitry Mouromtsev
View author publications
You can also search for this author in PubMed Google Scholar
Sören Auer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Dmitry Mouromtsev or Sören Auer .

Editor information

Editors and Affiliations

Complexible Inc, Washington, District of Columbia, USA
Pavel Klinov
ITMO University, St. Petersburg, Russia
Dmitry Mouromtsev

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Galkin, M., Mouromtsev, D., Auer, S. (2015). Identifying Web Tables: Supporting a Neglected Type of Content on the Web. In: Klinov, P., Mouromtsev, D. (eds) Knowledge Engineering and Semantic Web. KESW 2015. Communications in Computer and Information Science, vol 518. Springer, Cham. https://doi.org/10.1007/978-3-319-24543-0_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-24543-0_4
Published: 30 October 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-24542-3
Online ISBN: 978-3-319-24543-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics