ABSTRACT
Figures in digital documents contain important information. Current digital libraries do not summarize and index information available within figures for document retrieval. We present our system on automatic categorization of figures and extraction of data from 2-D plots. A machine-learning based method is used to categorize figures into a set of predefined types based on image features. An automated algorithm is designed to extract data values from solid line curves in 2-D plots. The semantic type of figures and extracted data values from 2-D plots can be integrated with textual information within documents to provide more effective document retrieval services for digital library users. Experimental evaluation has demonstrated that our system can produce results suitable for real-world use.
- S. Carberry, S. Elzer, and S. Demir. Information graphics: an untapped resource for digital libraries. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 581--588, 2006. Google ScholarDigital Library
- C.L. Giles, K. Bollacker, and S. Lawrence. CiteSeer: An automatic citation indexing system. In Proceedings of the ACM Conference on Digital Libraries, pages 89--98, 1998. Google ScholarDigital Library
- J. Li and R.M. Gray. Context--based multiscale classification of document images using wavelet coefficient distributions. IEEE Transactions on Image Processing, 9(9):1604--1616, 2000. Google ScholarDigital Library
- X. Lu, P. Mitra, J.Z. Wang, and C.L. Giles. Automatic categorization of figures in scientific documents. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, pages 129--138, 2006. Google ScholarDigital Library
- M. Seul, L. O'Gorman, and M.J. Sammon. Practical Algorithms for Image Analysis. Cambridge University Press, 2000. Google ScholarDigital Library
Index Terms
- Deriving knowledge from figures for digital libraries
Recommendations
Automatic categorization of figures in scientific documents
JCDL '06: Proceedings of the 6th ACM/IEEE-CS joint conference on Digital librariesFigures are very important non-textual information contained in scientific documents. Current digital libraries do not provide users tools to retrieve documents based on the information available within the figures. We propose an architecture for ...
An Architecture for Information Extraction from Figures in Digital Libraries
WWW '15 Companion: Proceedings of the 24th International Conference on World Wide WebScholarly documents contain multiple figures representing experimental findings. These figures are generated from data which is not reported anywhere else in the paper. We propose a modular architecture for analyzing such figures. Our architecture ...
Comments