Skip to main content
Log in

Design of an end-to-end method to extract information from tables

  • Original Paper
  • Published:
International Journal of Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

This paper plans an end-to-end method for extracting information from tables embedded in documents; input format is ASCII, to which any richer format can be converted, preserving all textual and much of the layout information. We start by defining table. Then we describe the steps involved in extracting information from tables and analyse table-related research to place the contribution of different authors, find the paths research is following, and identify issues that are still unsolved. We then analyse current approaches to evaluating table processing algorithms and propose two new metrics for the task of segmenting cells/columns/rows. We proceed to design our own end-to-end method, where there is a higher interaction between different steps; we indicate how back loops in the usual order of the steps can reduce the possibility of errors and contribute to solving previously unsolved problems. Finally, we explore how the actual interpretation of the table not only allows inferring the accuracy of the overall extraction process but also contributes to actually improving its quality. In order to do so, we believe interpretation has to consider context-specific knowledge; we explore how the addition of this knowledge can be made in a plug-in/out manner, such that the overall method will maintain its operability in different contexts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Auto Industrial, Relatorio e contas consolidadas 2000, available at http://www.grupoindustrial.pt/pdf/rel2000.pdf (2000)

  2. Baum, L., Lawrence, S., Boose, J.H., Boose, M., Chaplin, C.S., Cheung, J., Larsen, O.B., Lafever, M.R., Provine, R.C., Shema, D.: Document layout problems facing the aerospace industry. In: Proceedings of the Third International Workshop in Document Analysis and its Applications, DLJA 2003, Edinburgh, UK (2003)

  3. Buchsbaum, A.L., Caldwell, D., Church, K.W., Fowler, G.S., Muthukrishnan, S.: Engineering the compression of massive tables: an experimental approach. In: Proceedings of the 11th ACM-SIAM Symposium on Discrete Algorithms, pp. 175–184. Philadelphia, USA (2000)

  4. Cameron, J.P.: A cognitive model for table editing. Technical report OSU-CISRC6/89-TR 26, Computer and Information Science Research Centre, Ohio State University, USA (1989)

  5. Cesarini, F., Marinai, S., Sarti, L., Soda, G.: Trainable table location in document images. International Conference on Pattern Recognition, ICPR 2002, vol. 3, pp. 236–240. Quebec, Canada (2002)

  6. Chao, H.: Background pattern recognition in multi-page PDF document. In: Proceedings of the Third International Workshop in Document Analysis and its Applications, DLIA 2003, Edinburgh, UK (2003)

  7. Chen, H.-H., Tsai, S.-C., Tsai, J.-H.: Mining tables from large scale HTML texts. In: 18th International Conference on Computational Linguistics (COLING), pp. 166-172. Saarbrucken, Germany (2000)

  8. Cohen, W.W., Hurst, M., Jensen, L.S.: A flexible learning system for wrapping tables and lists in HTML documents. In: Proceedings of the Eleventh International World Wide Web Conference (WWW2002), pp. 232–241. Hawaii, USA (2002)

  9. Douglas, S., Hurst, M., Quinn, D.: Using natural language processing for identifying and interpreting tables in plain text. In: Fourth Symposium on Document Analysis and Information Retrieval, pp. 535–545. Nevada, USA (1995)

  10. EDP, Electricidade de Portugal, SA: Relatorio e contas consolidadas 2000, available at http://www.edp.pt/download/EDP_RC.pdf (2000)

  11. Ferguson, D.: Parsing financial statements efficiently and accurately using C and Prolog. In: Practical Applications of Prolog Conference ’97. London, UK (1997)

  12. Green, E., Krishnamoorthy, M.: Model-based of printed tables. In: Proceeding of International Conference of Document Analysis and Recognition 95 (ICDAR95), pp. 214–217. Montreal, Canada, (1995)

  13. Handley, J.C.: Table analysis for multiline cell identification. In: Document Recognition and Retrieval VIII, Proceedings of SPIE, vol. 4307–04. San Jose, USA (2001)

  14. Hu, J., Kashi, R., Lopresti, D., Wilfong, G.: Table detection across multiple media. In: International Workshop on Document Layout Interpretation and Its Applications. Bangalore, India (1999)

  15. Hu, J., Kashi, R., Lopresti, D., Wilfong, G.: Medium-independent table detection. In: Document Recognition and Retrieval VII, Proceedings of SPIE, vol. 3967, pp. 291–302. USA (2000)

  16. Hu, J., Kashi, R., Lopresti, D., Wilfong, G.: Table structure recognition and its evaluation. In: Document Recognition and Retrieval VIII, Proceedings of SPIE, vol. 4307–05. San Jose, USA (2001)

  17. Hu, J., Ramanujan, K., Lopresti, D., Nagy, G., Wilfongm, G.: Why Table Ground-Truthing is Hard. In: Proceedings of Sixth International Conference on Document Analysis and Recognition (ICDAR VI). Seattle, USA (2001)

  18. Hu, J., Kashi, R., Lopresti, D., Wilfong, G.: Evaluating the performance of table processing algorithms. Int. J. Doc. Anal. Recog. (2002)

  19. Hurst, M., Douglas, S.: Layout and Language: Preliminary investigations in recognizing the structure of tables. In: Proceedings of International Conference on Document Analysis and Recognition (ICDAR'97), pp. 1043–1047. Ulm, Germany (1997)

  20. Hurst, M.: The interpretation of tables in texts. PhD. Thesis, School of Cognitive Science, Informatics, The University of Edinburgh, UK (2000)

  21. Hurst, M., Tetsuya, N.: Layout and language: Integrating spatial and linguistic knowledge for layout understanding tasks. In: Proceedings of the 18th International Conference on Computational Linguistics, ICCL, Saarbruecken, Germany (2000)

  22. Hurst, M.: Layout and Language: An efficient algorithm for text block detection based on spatial and linguistic evidence. In: Document Recognition and Retrieval VIII, Proceedings of SPIE, vol. 4307, pp. 55–67. San Jose, USA (2001)

  23. Hurst, M.: A constraint-based approach to table structure derivation. In: Proceedings of International Conference on Document Analysis and Recognition (ICDAR'03), pp. 911–915. Edinburgh, UK (2003)

  24. ICDAR: International Conference on Document Analysis and Recognition (ICDAR). Edinburgh, United Kingdom (2003)

  25. IASCF, XBRL International: International financial reporting standards (IFRS), general purpose financial reporting for profit-oriented entities (GP), 2004-06-15, Exposure Draft, UK (2004)

  26. Kieninger, T.: Table structure recognition based on robust block segmentation. In: V Document Recognition, Proceedings of SPIE. San Jose, USA (1998)

  27. Kieninger, T., Dengel, A.: A paper-to-html table converting system. In: Proceedings of Document Analysis Systems (DAS'98). Nagano, Japan (1998)

  28. Klein, B., Serdar G., Kieninger, T., Dengel, A.: Three approaches to “industrial” table spotting. In: Proceedings of Sixth International Conference on Document Analysis and Recognition (ICDAR VI). Seattle, USA (2001)

  29. Kornfeld, W., Wattecamps, J.: Automatically locating, extracting and analyzing tabular data. In: Proceedings of the 21st annual international ACM SIGIR conference (SIGIR ‘98), pp. 347–349. Melbourne, Australia (1998)

  30. Lopresti, D., Wilfong, G.: Cross-domain approximate string matching. In: Proceedings of the Sixth International Symposium on String Processing and Information Retrieval, pp. 120–127. Cancun, Mexico (1999)

  31. Ng, H.T., Lim, C.Y., Jessica Li, T.K.: Learning to recognize tables in free text. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pp. 443–450. Maryland, USA, (1999)

  32. Pinto, D., McCallum, A., Wei, X., Croft, W.B.: Table extraction using conditional random fields. In: Proceedings of SIGIR 2003. ACM (2003)

  33. Pyreddy, Pallavi, W.B.C.: A system for retrieval in text tables, Technical report 105, Center for Intelligent Information Retrieval, Department of Computer Science, University of Massachusetts, Massachusetts, USA (1997)

  34. Ramel, J., Crucianu, M., Vincent, N., Faure, C.: Detection, Extraction and Representation of Tables. In: Proceedings of the International Conference on Document Analysis and Recognition (ICDAR). Edinburgh, UK (2003)

  35. Rus, D., Summers, K.: Using white space for automated document structuring. In: Workshop on the Principles of Document Processing. Seeheim, Germany (1994)

  36. S.A.I.C., Science Applications International corporation: an automated conversion of structured documents into SGML. Technical Report, Distributed Object Computation Testbed (DOCT), San Diego Supercomputer Center (available at http://www.sdsc.edu/DOCT/Publications/a3-3/a3-3.html) (1997)

  37. Shamillian, J.H., Henry, S.B., Thomas, L.W.: A retargetable table reader. In: Proceedings of the IAPR 97 International Conference on Document Analysis and Recognition 97, pp. 448–453. Ulm, Germany (1997)

  38. Silva, Ana Costa e.: Extracting information from tables in text—an application to financial statements of Portuguese companies. Thesis for Masters in Science in Data Analysis and Decision Support Systems, Faculty of Economics of the University of Oporto, Portugal (2003)

  39. e Silva, A.C., Jorge, A., Torgo, L.: Selection of table areas for information extraction. In: Proceedings of the Third International Workshop in Document Analysis and its Applications, DLIA 2003, Edinburgh, UK (2003)

  40. e Silva, A.C., Margarida, B.R.: Reporting standards for statistical purposes—the experience of Banco de Portugal. Best paper award at the Digita Accounting Research Conference 2004, Spain. Int. J. Digit. Account. Res. 4(8) 145–174 (2004)

  41. EPC, European Parliament and Council (2003), Directive 2003/58/EC of the European Parliament and Council of 15 July 2003 amending Council Directive 68/151/EEC, as regards disclosure requirements in respect of certain types of companies. Off J Eur Commun L 221, 13–16, (2003)

    Google Scholar 

  42. Tersteegen, W., Wenzel, C.: ScanTab—Table recognition by reference tables. In: Proceedings of Document Analysis Systems (DAS'98). Nagano, Japan (1998)

  43. Thompson, M.: A tables manifesto. In: Proceedings of SGMK Europe, pp. 151–153. Munich, Germany (1996)

  44. Tupaj, S., Shi, Z., Chang, C.H., Hassan, A.: Extracting tabular information from text files, EECS Department. Tufts University, Medford, USA (available o-line at http://www.ee.tufts.edu/hchang/paperl.ps) (1996)

  45. Yoshida, M., Torisawa, K., Tsujii, J.: A method to integrate tables of theWorldWideWeb. In: First International Workshop on Web Document Analysis (WDA2001). Seattle, USA (2001)

  46. Wang, Y., Ihsin, T.P., Robert, H.: Improvements of zone content classification by using background analysis. In: Proceedings of Document Analysis Systems, (DAS'00), Rio de Janeiro, Brazil (2000)

  47. Wang, Y., Ihsin T.P., Robert, H.: Automatic ground truth generation and A background-analysis-based table structure extraction method. In: Sixth International Conference on Document Analysis and Recognition (ICDAR ‘01). Seattle, USA (2001)

  48. Wang, Y., Ihsin T.P., Robert, H.: Table detection via probability optimization. In: Proceedings of Document Analysis Systems, (DAS'02). Princeton, NY, USA (2002)

  49. Wang, Y., Hu, J.: A machine learning based approach for table detection on the web. In: Proceedings of the Eleventh International World Wide Web Conference (WWW2002), pp. 242–250. Hawaii, USA (2002)

  50. Zanibbi, R., Blostein, D., Cordy, J.R.: A survey of table recognition: models, observations, transformations, and inferences. In: International Journal of Document Analysis and Recognition (2003)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ana Costa e Silva.

Additional information

The opinions expressed in this article are the responsibility of the authors and do not necessarily reflect those of Banco de Portugal.

Rights and permissions

Reprints and permissions

About this article

Cite this article

e Silva, A.C., Jorge, A.M. & Torgo, L. Design of an end-to-end method to extract information from tables. IJDAR 8, 144–171 (2006). https://doi.org/10.1007/s10032-005-0001-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-005-0001-x

Keywords

Navigation