Biblio: automatic meta-data extraction

Staelin, Carl; Elad, Michael; Greig, Darryl; Shmueli, Oded; Vans, Marie

doi:10.1007/s10032-006-0032-y

Carl Staelin¹,
Michael Elad³,
Darryl Greig²,
Oded Shmueli³ &
…
Marie Vans¹

122 Accesses
Explore all metrics

Abstract

Biblio is an adaptive system that automatically extracts meta-data from semi-structured and structured scanned documents. Instead of using hand-coded templates or other methods manually customized for each given document format, it uses example-based machine learning to adapt to customer-defined document and meta-data types. We provide results from experiments on the recognition of document information in two document corpuses: a set of scanned journal articles and a set of scanned legal documents. The first set is semi-structured, as the different journals use a variety of flexible layouts. The second set is largely free-form text based on poor quality scans of FAX-quality legal documents. We demonstrate accuracy on the semi-structured document set roughly comparable to hand-coded systems, and much worse performance on the legal documents.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Baumann, S., Malburg, M., Hein, H.-G., Hoch, R., Kieninger, T., Kuhn, N.: Document analysis at DFKI, part 2: information extraction, German Research Center for Artificial Intelligence (DFKI). In: DFKI Research Report, no. RR-95-03, Kaiserslautern, Germany (1995)
Bishop C.M. (1995) Neural Networks for Pattern Recognition. Oxford University Press, Oxford
Google Scholar
Burges C.J.C. (1998) A tutorial on Support Vector Machines for pattern recognition. Data Mining Knowl. Discov. 2(2): 121–167
Article Google Scholar
Casey R., Ferguson D., Mohiuddin K., Walach E. (1992) Intelligent forms processing system. Mach. Vis. Appl. 5, 143–155
Article Google Scholar
Cristianini N., John Shawe-Taylor J. (2000) An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press, Cambridge
Google Scholar
Dengel, A., Bleisinger, R., Fein, F., Hoch, R., Hones, F., Malburg, M.: OfficeMAID - a system for office mail analysis, interpretation, and delivery. In: Proceedings of the First International Conference on Document Analysis and Recognition, Kaiserslautern, Germany, pp. 253-275 (1994)
Gori M., Scarselli F. (1998) Are multilayer perceptrons adequate for pattern recognition and verification. IEEE Trans. Pattern Anal. Mach. Intell. 20(11): 1121–1132
Article Google Scholar
Hull, J.J., Hart, P.: The infinite memory multifunction machine (IM3). In: Proceedings of the 3rd IAPR Workshop on Document Analysis Systems, Nagano, Japan, pp. 49–58 (1998)
Hull J.L., TaylorS.L., (eds)(1998) Document Analysis Systems (II). World Scientific Publications, Singapore
MATH Google Scholar
Koller, D., Sahami, M.: Hierarchically classifying documents using very few words. In: Proceedings on the 14th Conference on Machine Learning (1997)
Krishnamoorthy M., Nagy G., Seth S., Viswanathan M. (1993) Syntactic segmentation and labeling of digitized pages and technical journals. IEEE Trans Pattern Anal. Mach. Intell. 15(7): 737–747
Article Google Scholar
Lam, S.W., Spitz, A.L., Dengel, A.: An adaptive approach to document classification and understanding. In: Proceedings IAPR Workshop on Document Analysis Systems, World Scientific, Kaiserlautern, Germany, pp. 114–134 (1994)
Lawrence S., Giles C.L., Kurt Bollacker K. (1999) Digital libraries and autonomous citation indexing. IEEE Comput. 32(6): 67–71
Google Scholar
Liu D.C., Nocedal J. (1989) On the limited memory method for large scale optimization. Math. Program. B, 45(3): 503–528
Article MATH MathSciNet Google Scholar
Lopresti D.P., Hu J., Kashi R. (2002) Document analysis systems V. In: Lecture Notes in Computer Science, vol. 2423. Springer, Berlin Heidelberg New York
Maarek, Y.: Automatically organizing bookmarks per contents. In: Proceedings WWW5 (1996)
Manevitz L.M., Yousef M. (2001) One-class SVMs for document classification. J. Mach. Learn. Res. 2, 139–154
Article Google Scholar
O’Gorman L., Kasturi R. (1995) Document image analysis. IEEE Computer Society Press, USA
Google Scholar
Platt J. (1998) Fast training of Support Vector Machines using sequential minimal optimization. In: Scholkopf B., Burges C., Smola A. (eds) Advances in Kernel Methods–Support Vector Learning. MIT Press, Cambridge
Google Scholar
Sahami, M., Yusufali, S., Baldonado, M.Q.W.: SONIA: a service for organizing networked information autonomously. In: Proceedings of the 3rd ACM Conference on Digital Libraries (1998)
Sahami, M.: Using machine learning to improve information access. In: Thesis Computer Science Department, Stanford University, Stanford, CA. (1998)
Salton G., Buckley C. (1988) Term weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5): 513–523
Article Google Scholar
Savic, D.: Automatic classification of office documents: review of available methods and techniques. Reco. Manag. Q. 3–18 (1995)
Shmueli, O., Staelin, C., Greig, D., Elad, M.: Classifying Semi-Structured Documents Using Image Signatures. Hewlett-Packard Laboratories, HPL-1999-65, Palo Alto, CA (1999)
Spitz A.L., Dengel A. (eds) (1995) Document Analysis Systems. World Scientific Publishing, Singapore
MATH Google Scholar
Srihari, S.N., Lam, S.W., Govindaraju, V., Srihari, R.K., Hull, J.J.: Document image understanding: research directions. In: Center of Excellence for Document Analysis and Recognition, CEDAR-TR-92-1 (1992)
Staelin, C.: Parameter Selection for Support Vector Machines. Hewlett-Packard Laboratories, HPL-2002-354R1, Palo Alto, CA (2002)
Taylor, S.L., Lipshutz, M.: Document understanding system for multiple document representations. In: Document Analysis Systems II, pp. 283–300. World Scientific, Singapore (1998)
Vapnik, V.: The nature of statistical learning theory. In: Statistics for Engineering and Information Science, 2nd edn. Springer, Berlin Heidelberg New York (2000)
Walischewski, H.: Automatic knowledge acquisition for spatial document interpretation. In: Proceedings of the 4th International Conference on Document Analysis and Recognition, Ulm, Germany, pp. 243–247 (1997)
Watanabe, T., Huang, X.: Automatic acquisition of layout knowledge for understanding business cards. In: Proceedings of the 4th International Conference on Document Analysis and Recognition, Ulm, Germany, pp. 216–220 (1997)
Weibel S., Oskins M., Vizine-Goetz D. (1989) Automated title page cataloging: a feasibility study. Inf. Process. Manag. 25(2): 187–203
Article Google Scholar
Weiss S., Kulikowski C. (1991) Computer Systems that Learn– Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufman, New York
Google Scholar
Wenzel, C., Baumann, S., Jager, T.: Advances in document classification by voting of competitive approaches. In: Document Analysis Systems II, pp. 385–405 World Scientific, Singapore (1998)
Wenzel, C.: Supporting information extraction from printed document by lexico-semantic pattern matching. In: Proceedings of the 4th International Conference on Document Analysis and Recognition, pp. 732–735 (1997)
Wnek, J.: Machine learning of generalized document templates for data extraction. In: Document Analysis Systems V. Lecture Notes in Computer Science, vol. 2423, pp. 457–468. Springer, Berlin Heidelberg New York (2002)
Yao X., Liu Y. (1997) A new evolutionary system for evolving artificial neural networks. IEEE Transa. Neural Netw. 8(3): 694–713
Article MathSciNet Google Scholar
Yao X., Liu Y. (1998) Making use of population information in evolutionary artificial neural networks. IEEE Tran. Syst. Man Cybern. 28(3): 417–425
Google Scholar

Download references

Author information

Authors and Affiliations

Hewlett-Packard Laboratories, Technion City, Haifa, 32000, Israel
Carl Staelin & Marie Vans
Hewlett-Packard Laboratories, Filton Road, Bristol, Stoke Gifford, BS34 8QZ, UK
Darryl Greig
The Computer Science Department, Israel Institute of Technology, 516 Taub building, Haifa, 32000, Israel
Michael Elad & Oded Shmueli

Authors

Carl Staelin
View author publications
You can also search for this author inPubMed Google Scholar
Michael Elad
View author publications
You can also search for this author inPubMed Google Scholar
Darryl Greig
View author publications
You can also search for this author inPubMed Google Scholar
Oded Shmueli
View author publications
You can also search for this author inPubMed Google Scholar
Marie Vans
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Carl Staelin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Staelin, C., Elad, M., Greig, D. et al. Biblio: automatic meta-data extraction. IJDAR 10, 113–126 (2007). https://doi.org/10.1007/s10032-006-0032-y

Download citation

Received: 17 November 2004
Revised: 04 May 2006
Accepted: 18 September 2006
Published: 07 November 2006
Issue Date: November 2007
DOI: https://doi.org/10.1007/s10032-006-0032-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Biblio: automatic meta-data extraction

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Guide to Dictionary-Based Text Mining

SIMARA: A Database for Key-Value Information Extraction from Full-Page Handwritten Documents

Document analysis systems that improve with use

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Biblio: automatic meta-data extraction

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Guide to Dictionary-Based Text Mining

SIMARA: A Database for Key-Value Information Extraction from Full-Page Handwritten Documents

Document analysis systems that improve with use

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now