Skip to main content
Log in

Biblio: automatic meta-data extraction

  • Original Paper
  • Published:
International Journal of Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

Biblio is an adaptive system that automatically extracts meta-data from semi-structured and structured scanned documents. Instead of using hand-coded templates or other methods manually customized for each given document format, it uses example-based machine learning to adapt to customer-defined document and meta-data types. We provide results from experiments on the recognition of document information in two document corpuses: a set of scanned journal articles and a set of scanned legal documents. The first set is semi-structured, as the different journals use a variety of flexible layouts. The second set is largely free-form text based on poor quality scans of FAX-quality legal documents. We demonstrate accuracy on the semi-structured document set roughly comparable to hand-coded systems, and much worse performance on the legal documents.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Baumann, S., Malburg, M., Hein, H.-G., Hoch, R., Kieninger, T., Kuhn, N.: Document analysis at DFKI, part 2: information extraction, German Research Center for Artificial Intelligence (DFKI). In: DFKI Research Report, no. RR-95-03, Kaiserslautern, Germany (1995)

  2. Bishop C.M. (1995) Neural Networks for Pattern Recognition. Oxford University Press, Oxford

    Google Scholar 

  3. Burges C.J.C. (1998) A tutorial on Support Vector Machines for pattern recognition. Data Mining Knowl. Discov. 2(2): 121–167

    Article  Google Scholar 

  4. Casey R., Ferguson D., Mohiuddin K., Walach E. (1992) Intelligent forms processing system. Mach. Vis. Appl. 5, 143–155

    Article  Google Scholar 

  5. Cristianini N., John Shawe-Taylor J. (2000) An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press, Cambridge

    Google Scholar 

  6. Dengel, A., Bleisinger, R., Fein, F., Hoch, R., Hones, F., Malburg, M.: OfficeMAID - a system for office mail analysis, interpretation, and delivery. In: Proceedings of the First International Conference on Document Analysis and Recognition, Kaiserslautern, Germany, pp. 253-275 (1994)

  7. Gori M., Scarselli F. (1998) Are multilayer perceptrons adequate for pattern recognition and verification. IEEE Trans. Pattern Anal. Mach. Intell. 20(11): 1121–1132

    Article  Google Scholar 

  8. Hull, J.J., Hart, P.: The infinite memory multifunction machine (IM3). In: Proceedings of the 3rd IAPR Workshop on Document Analysis Systems, Nagano, Japan, pp. 49–58 (1998)

  9. Hull J.L., TaylorS.L., (eds)(1998) Document Analysis Systems (II). World Scientific Publications, Singapore

    MATH  Google Scholar 

  10. Koller, D., Sahami, M.: Hierarchically classifying documents using very few words. In: Proceedings on the 14th Conference on Machine Learning (1997)

  11. Krishnamoorthy M., Nagy G., Seth S., Viswanathan M. (1993) Syntactic segmentation and labeling of digitized pages and technical journals. IEEE Trans Pattern Anal. Mach. Intell. 15(7): 737–747

    Article  Google Scholar 

  12. Lam, S.W., Spitz, A.L., Dengel, A.: An adaptive approach to document classification and understanding. In: Proceedings IAPR Workshop on Document Analysis Systems, World Scientific, Kaiserlautern, Germany, pp. 114–134 (1994)

  13. Lawrence S., Giles C.L., Kurt Bollacker K. (1999) Digital libraries and autonomous citation indexing. IEEE Comput. 32(6): 67–71

    Google Scholar 

  14. Liu D.C., Nocedal J. (1989) On the limited memory method for large scale optimization. Math. Program. B, 45(3): 503–528

    Article  MATH  MathSciNet  Google Scholar 

  15. Lopresti D.P., Hu J., Kashi R. (2002) Document analysis systems V. In: Lecture Notes in Computer Science, vol. 2423. Springer, Berlin Heidelberg New York

  16. Maarek, Y.: Automatically organizing bookmarks per contents. In: Proceedings WWW5 (1996)

  17. Manevitz L.M., Yousef M. (2001) One-class SVMs for document classification. J. Mach. Learn. Res. 2, 139–154

    Article  Google Scholar 

  18. O’Gorman L., Kasturi R. (1995) Document image analysis. IEEE Computer Society Press, USA

    Google Scholar 

  19. Platt J. (1998) Fast training of Support Vector Machines using sequential minimal optimization. In: Scholkopf B., Burges C., Smola A. (eds) Advances in Kernel Methods–Support Vector Learning. MIT Press, Cambridge

    Google Scholar 

  20. Sahami, M., Yusufali, S., Baldonado, M.Q.W.: SONIA: a service for organizing networked information autonomously. In: Proceedings of the 3rd ACM Conference on Digital Libraries (1998)

  21. Sahami, M.: Using machine learning to improve information access. In: Thesis Computer Science Department, Stanford University, Stanford, CA. (1998)

  22. Salton G., Buckley C. (1988) Term weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5): 513–523

    Article  Google Scholar 

  23. Savic, D.: Automatic classification of office documents: review of available methods and techniques. Reco. Manag. Q. 3–18 (1995)

  24. Shmueli, O., Staelin, C., Greig, D., Elad, M.: Classifying Semi-Structured Documents Using Image Signatures. Hewlett-Packard Laboratories, HPL-1999-65, Palo Alto, CA (1999)

  25. Spitz A.L., Dengel A. (eds) (1995) Document Analysis Systems. World Scientific Publishing, Singapore

    MATH  Google Scholar 

  26. Srihari, S.N., Lam, S.W., Govindaraju, V., Srihari, R.K., Hull, J.J.: Document image understanding: research directions. In: Center of Excellence for Document Analysis and Recognition, CEDAR-TR-92-1 (1992)

  27. Staelin, C.: Parameter Selection for Support Vector Machines. Hewlett-Packard Laboratories, HPL-2002-354R1, Palo Alto, CA (2002)

  28. Taylor, S.L., Lipshutz, M.: Document understanding system for multiple document representations. In: Document Analysis Systems II, pp. 283–300. World Scientific, Singapore (1998)

  29. Vapnik, V.: The nature of statistical learning theory. In: Statistics for Engineering and Information Science, 2nd edn. Springer, Berlin Heidelberg New York (2000)

  30. Walischewski, H.: Automatic knowledge acquisition for spatial document interpretation. In: Proceedings of the 4th International Conference on Document Analysis and Recognition, Ulm, Germany, pp. 243–247 (1997)

  31. Watanabe, T., Huang, X.: Automatic acquisition of layout knowledge for understanding business cards. In: Proceedings of the 4th International Conference on Document Analysis and Recognition, Ulm, Germany, pp. 216–220 (1997)

  32. Weibel S., Oskins M., Vizine-Goetz D. (1989) Automated title page cataloging: a feasibility study. Inf. Process. Manag. 25(2): 187–203

    Article  Google Scholar 

  33. Weiss S., Kulikowski C. (1991) Computer Systems that Learn– Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufman, New York

    Google Scholar 

  34. Wenzel, C., Baumann, S., Jager, T.: Advances in document classification by voting of competitive approaches. In: Document Analysis Systems II, pp. 385–405 World Scientific, Singapore (1998)

  35. Wenzel, C.: Supporting information extraction from printed document by lexico-semantic pattern matching. In: Proceedings of the 4th International Conference on Document Analysis and Recognition, pp. 732–735 (1997)

  36. Wnek, J.: Machine learning of generalized document templates for data extraction. In: Document Analysis Systems V. Lecture Notes in Computer Science, vol. 2423, pp. 457–468. Springer, Berlin Heidelberg New York (2002)

  37. Yao X., Liu Y. (1997) A new evolutionary system for evolving artificial neural networks. IEEE Transa. Neural Netw. 8(3): 694–713

    Article  MathSciNet  Google Scholar 

  38. Yao X., Liu Y. (1998) Making use of population information in evolutionary artificial neural networks. IEEE Tran. Syst. Man Cybern. 28(3): 417–425

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Carl Staelin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Staelin, C., Elad, M., Greig, D. et al. Biblio: automatic meta-data extraction. IJDAR 10, 113–126 (2007). https://doi.org/10.1007/s10032-006-0032-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-006-0032-y

Keywords

Navigation