Abstract
Biblio is an adaptive system that automatically extracts meta-data from semi-structured and structured scanned documents. Instead of using hand-coded templates or other methods manually customized for each given document format, it uses example-based machine learning to adapt to customer-defined document and meta-data types. We provide results from experiments on the recognition of document information in two document corpuses: a set of scanned journal articles and a set of scanned legal documents. The first set is semi-structured, as the different journals use a variety of flexible layouts. The second set is largely free-form text based on poor quality scans of FAX-quality legal documents. We demonstrate accuracy on the semi-structured document set roughly comparable to hand-coded systems, and much worse performance on the legal documents.
Similar content being viewed by others
References
Baumann, S., Malburg, M., Hein, H.-G., Hoch, R., Kieninger, T., Kuhn, N.: Document analysis at DFKI, part 2: information extraction, German Research Center for Artificial Intelligence (DFKI). In: DFKI Research Report, no. RR-95-03, Kaiserslautern, Germany (1995)
Bishop C.M. (1995) Neural Networks for Pattern Recognition. Oxford University Press, Oxford
Burges C.J.C. (1998) A tutorial on Support Vector Machines for pattern recognition. Data Mining Knowl. Discov. 2(2): 121–167
Casey R., Ferguson D., Mohiuddin K., Walach E. (1992) Intelligent forms processing system. Mach. Vis. Appl. 5, 143–155
Cristianini N., John Shawe-Taylor J. (2000) An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press, Cambridge
Dengel, A., Bleisinger, R., Fein, F., Hoch, R., Hones, F., Malburg, M.: OfficeMAID - a system for office mail analysis, interpretation, and delivery. In: Proceedings of the First International Conference on Document Analysis and Recognition, Kaiserslautern, Germany, pp. 253-275 (1994)
Gori M., Scarselli F. (1998) Are multilayer perceptrons adequate for pattern recognition and verification. IEEE Trans. Pattern Anal. Mach. Intell. 20(11): 1121–1132
Hull, J.J., Hart, P.: The infinite memory multifunction machine (IM3). In: Proceedings of the 3rd IAPR Workshop on Document Analysis Systems, Nagano, Japan, pp. 49–58 (1998)
Hull J.L., TaylorS.L., (eds)(1998) Document Analysis Systems (II). World Scientific Publications, Singapore
Koller, D., Sahami, M.: Hierarchically classifying documents using very few words. In: Proceedings on the 14th Conference on Machine Learning (1997)
Krishnamoorthy M., Nagy G., Seth S., Viswanathan M. (1993) Syntactic segmentation and labeling of digitized pages and technical journals. IEEE Trans Pattern Anal. Mach. Intell. 15(7): 737–747
Lam, S.W., Spitz, A.L., Dengel, A.: An adaptive approach to document classification and understanding. In: Proceedings IAPR Workshop on Document Analysis Systems, World Scientific, Kaiserlautern, Germany, pp. 114–134 (1994)
Lawrence S., Giles C.L., Kurt Bollacker K. (1999) Digital libraries and autonomous citation indexing. IEEE Comput. 32(6): 67–71
Liu D.C., Nocedal J. (1989) On the limited memory method for large scale optimization. Math. Program. B, 45(3): 503–528
Lopresti D.P., Hu J., Kashi R. (2002) Document analysis systems V. In: Lecture Notes in Computer Science, vol. 2423. Springer, Berlin Heidelberg New York
Maarek, Y.: Automatically organizing bookmarks per contents. In: Proceedings WWW5 (1996)
Manevitz L.M., Yousef M. (2001) One-class SVMs for document classification. J. Mach. Learn. Res. 2, 139–154
O’Gorman L., Kasturi R. (1995) Document image analysis. IEEE Computer Society Press, USA
Platt J. (1998) Fast training of Support Vector Machines using sequential minimal optimization. In: Scholkopf B., Burges C., Smola A. (eds) Advances in Kernel Methods–Support Vector Learning. MIT Press, Cambridge
Sahami, M., Yusufali, S., Baldonado, M.Q.W.: SONIA: a service for organizing networked information autonomously. In: Proceedings of the 3rd ACM Conference on Digital Libraries (1998)
Sahami, M.: Using machine learning to improve information access. In: Thesis Computer Science Department, Stanford University, Stanford, CA. (1998)
Salton G., Buckley C. (1988) Term weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5): 513–523
Savic, D.: Automatic classification of office documents: review of available methods and techniques. Reco. Manag. Q. 3–18 (1995)
Shmueli, O., Staelin, C., Greig, D., Elad, M.: Classifying Semi-Structured Documents Using Image Signatures. Hewlett-Packard Laboratories, HPL-1999-65, Palo Alto, CA (1999)
Spitz A.L., Dengel A. (eds) (1995) Document Analysis Systems. World Scientific Publishing, Singapore
Srihari, S.N., Lam, S.W., Govindaraju, V., Srihari, R.K., Hull, J.J.: Document image understanding: research directions. In: Center of Excellence for Document Analysis and Recognition, CEDAR-TR-92-1 (1992)
Staelin, C.: Parameter Selection for Support Vector Machines. Hewlett-Packard Laboratories, HPL-2002-354R1, Palo Alto, CA (2002)
Taylor, S.L., Lipshutz, M.: Document understanding system for multiple document representations. In: Document Analysis Systems II, pp. 283–300. World Scientific, Singapore (1998)
Vapnik, V.: The nature of statistical learning theory. In: Statistics for Engineering and Information Science, 2nd edn. Springer, Berlin Heidelberg New York (2000)
Walischewski, H.: Automatic knowledge acquisition for spatial document interpretation. In: Proceedings of the 4th International Conference on Document Analysis and Recognition, Ulm, Germany, pp. 243–247 (1997)
Watanabe, T., Huang, X.: Automatic acquisition of layout knowledge for understanding business cards. In: Proceedings of the 4th International Conference on Document Analysis and Recognition, Ulm, Germany, pp. 216–220 (1997)
Weibel S., Oskins M., Vizine-Goetz D. (1989) Automated title page cataloging: a feasibility study. Inf. Process. Manag. 25(2): 187–203
Weiss S., Kulikowski C. (1991) Computer Systems that Learn– Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufman, New York
Wenzel, C., Baumann, S., Jager, T.: Advances in document classification by voting of competitive approaches. In: Document Analysis Systems II, pp. 385–405 World Scientific, Singapore (1998)
Wenzel, C.: Supporting information extraction from printed document by lexico-semantic pattern matching. In: Proceedings of the 4th International Conference on Document Analysis and Recognition, pp. 732–735 (1997)
Wnek, J.: Machine learning of generalized document templates for data extraction. In: Document Analysis Systems V. Lecture Notes in Computer Science, vol. 2423, pp. 457–468. Springer, Berlin Heidelberg New York (2002)
Yao X., Liu Y. (1997) A new evolutionary system for evolving artificial neural networks. IEEE Transa. Neural Netw. 8(3): 694–713
Yao X., Liu Y. (1998) Making use of population information in evolutionary artificial neural networks. IEEE Tran. Syst. Man Cybern. 28(3): 417–425
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Staelin, C., Elad, M., Greig, D. et al. Biblio: automatic meta-data extraction. IJDAR 10, 113–126 (2007). https://doi.org/10.1007/s10032-006-0032-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10032-006-0032-y