Skip to main content

Factoring Web Tables

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6703))

Abstract

Automatic interpretation of web tables can enable database-like semantic search over the plethora of information stored in tables on the web. Our table interpretation method presented here converts the two-dimensional hierarchy of table headers, which provides a visual means of assimilating complex data, into a set of strings that is more amenable to algorithmic analysis of table structure. We show that Header Paths, a new purely syntactic representation of visual tables, can be readily transformed (“factored”) into several existing representations of structured data, including category trees and relational tables. Detailed examination of over 100 tables reveals what table features require further work.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Wang, X.: Tabular Abstraction, Editing, and Formatting, Ph.D Dissertation, University of Waterloo, Waterloo, ON, Canada (1996)

    Google Scholar 

  2. Embley, D.W., Hurst, M., Lopresti, D., Nagy, G.: Table Processing Paradigms: A Research Survey. Int. J. Doc. Anal. Recognit. 8(2-3), 66–86 (2006)

    Article  Google Scholar 

  3. Zanibbi, R., Blostein, D., Cordy, J.R.: A survey of table recognition: Models, observations, transformations, and inferences. International Journal of Document Analysis and Recognition 7(1), 1–16 (2004)

    Article  Google Scholar 

  4. Krüpl, B., Herzog, M., Gatterbauer, W.: Using visual cues for extraction of tabular data from arbitrary HTML documents. In: Proceedings. of the 14th Int’l Conf. on World Wide Web, pp. 1000–1001 (2005)

    Google Scholar 

  5. Pivk, A., Ciamiano, P., Sure, Y., Gams, M., Rahkovic, V., Studer, R.: Transforming arbitrary tables into logical form with TARTAR. Data and Knowledge Engineering 60(3), 567–595 (2007)

    Article  Google Scholar 

  6. Silva, E.C., Jorge, A.M., Torgo, L.: Design of an end-to-end method to extract information from tables. Int. J. Doc. Anal. Recognit. 8(2), 144–171 (2006)

    Article  Google Scholar 

  7. Esposito, F., Ferilli, S., Di Mauro, N., Basile, T.M.A.: Incremental Learning of First Order Logic Theories for the Automatic Annotations of Web Documents. In: Proceedings of the 9th International Conference on Document Analysis and Recognition (ICDAR-2007), Curitiba, Brazil, September 23-26, pp. 1093–1097. IEEE Computer Society, Los Alamitos (2007); ISBN 0-7695-2822-8, ISSN 1520-5363

    Google Scholar 

  8. Esposito, F., Ferilli, S., Basile, T.M.A., Di Mauro, N.: Machine Learning for Digital Document Processing: From Layout Analysis To Metadata Extraction. In: Marinai, S., Fujisawa, H. (eds.) Machine Learning in Document Analysis and Recognition. SCI, vol. 90, pp. 79–112. Springer, Berlin (2008); ISBN 978-3-540-76279-9

    Chapter  Google Scholar 

  9. Jandhyala, R.C., Krishnamoorthy, M., Nagy, G., Padmanabhan, R., Seth, S., Silversmith, W.: From Tessellations to Table Interpretation. In: Carette, J., Dixon, L., Coen, C.S., Watt, S.M. (eds.) MKM 2009, Held as Part of CICM 2009. LNCS, vol. 5625, pp. 422–437. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  10. http://www.mathworks.com/help/toolbox/symbolic/horner.html

  11. Fateman, R. J.: Essays in Symbolic Simplification. MIT-LCS-TR-095, 4-1-1972, http://publications.csail.mit.edu/lcs/pubs/pdf/MIT-LCS-TR-095.pdf (downloaded November 10, 2010)

  12. Knuth, D.E.: 4.6.2 Factorization of Polynomials". Seminumerical Algorithms. In: The Art of Computer Programming, 2nd edn., pp. 439–461, 678–691. Addison-Wesley, Reading (1997)

    Google Scholar 

  13. Kaltofen, E.: Polynomial factorization: a success story. In: ISSAC 2003 Proc. 2003 Internat. Symp. Symbolic Algebraic Comput. [-12], pp. 3–4 (2003)

    Google Scholar 

  14. Brayton, R.K., McMullen, C.: The Decomposition and Factorization of Boolean Expressions. In: Proceedings of the International Symposium on Circuits and Systems, pp. 49–54 (May1982)

    Google Scholar 

  15. Vasudevamurthy, J., Rajski, J.: A Method for Concurrent Decomposition and Factorization of Boolean Expressions. In: Proceedings of the International Conference on Computer-Aided Design, pp. 510–513 (November 1990)

    Google Scholar 

  16. Sentovich, E.M., Singh, K.J., Lavagno, L., Moon, C., Murgai, R., Saldanha, A., Savoj, H., Stephan, P.R., Brayton, R.K., Sangiovanni-Vincentelli, A.L.: SIS: A System for Sequential Circuit Synthesis. In: Memorandum No. UCB/ERL M92/41, Electronics Research Laboratory, University of California, Berkeley (May 1992), http://www.eecs.berkeley.edu/Pubs/TechRpts/1992/ERL-92-41.pdf (downloaded November 4, 2010)

    Google Scholar 

  17. (Quickmath-ref), http://www.quickmath.com/webMathematica3/quickmath/page.jsp?s1=algebra&s2=factor&s3=advanced (last accessed November 12, 2010)

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Embley, D.W., Krishnamoorthy, M., Nagy, G., Seth, S. (2011). Factoring Web Tables. In: Mehrotra, K.G., Mohan, C.K., Oh, J.C., Varshney, P.K., Ali, M. (eds) Modern Approaches in Applied Intelligence. IEA/AIE 2011. Lecture Notes in Computer Science(), vol 6703. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21822-4_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-21822-4_26

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-21821-7

  • Online ISBN: 978-3-642-21822-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics