ABSTRACT
In this paper we propose a knowledge-base approach to help extracting the correct components of citations in any given format. Differently from related approaches that rely on manually built knowledge-bases (KBs) for recognizing the components of a citation, in our case, such a KB is automatically constructed from an existing set of sample metadata records from a given area (e.g., computer science or health sciences). Our approach does not rely on patterns encoding specific delimitators of a particular citation style. It is also unsupervised, in the sense that it does not rely on a learning method that requires a training phase. These features assign to our technique a high degree of automation and flexibility. To demonstrate the effectiveness and applicability of our proposed approach we have run experiments in which we applied it to extract information from citations in papers of two different domains. Results of these experiments indicate precision and recall levels above 94% and perfect extraction for the large majority of citations tested.
- A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In SIGMOD '03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages 337--348, New York, NY, USA, 2003. ACM Press. Google ScholarDigital Library
- S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst., 30(1-7):107--117, 1998. Google ScholarDigital Library
- P. Calado, M. Cristo, M. A. Gonçalves, E. S. de Moura, B. Ribeiro-Neto, and N. Ziviani. Link-based similarity measures for the classification of web documents. J. Am. Soc. Inf. Sci. Technol., 57(2):208--221, 2006. Google ScholarDigital Library
- T. Couto, M. Cristo, M. A. Gonçalves, P. Calado, N. Ziviani, E. Moura, and B. Ribeiro-Neto. A comparative study of citations and links in document classification. In JCDL '06: Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, pages 75--84, New York, NY, USA, 2006. ACM Press. Google ScholarDigital Library
- V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large websites. In VLDB '01: Proceedings of the 27th International Conference on Very Large Data Bases, pages 109--118, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
- M. -Y. Day, T. -H. Tsai, C. -L. Sung, C. -W. Lee, S. -H.Wu, C. -S. Ong, and W. -L. Hsu. A knowledge-based approach to citation extraction. In IRI '05: Proceedings of the 2005 IEEE International Conference on Information Reuse and Integration, pages 50--55, New York, NY, USA, 2005. IEEE Systems, Man, and Cybernetics Society. Google ScholarDigital Library
- D. W. Embley, D. M. Campbell, Y. S. Jiang, S. W. Liddle, D. W. Lonsdale, Y. -K. Ng, and R. D. Smith. Conceptual-model-based data extraction from multiple-record web pages. Data Knowl. Eng., 31(3):227--251, 1999. Google ScholarDigital Library
- D. Freitag and A. McCallum. Information extraction with hmm structures learned by stochastic optimization. In Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, pages 584--589. AAAI Press /The MIT Press, 2000. Google ScholarDigital Library
- M. A. Gonçalves, B. L. Moreira, E. A. Fox, and L. T. Watson. What is a good digital library? - defining aquality model for digital libraries. To appear in Information Processing and Management, 2007.Google Scholar
- H. Han, C. L. Giles, E. Manavoglu, H. Zha, Z. Zhang, and E. A. Fox. Automatic document metadata extraction using support vector machines. In ACM/IEEE Joint Conference on Digital Libraries, JCDL 2003, pages 37--48. IEEE Computer Society, 2003. Google ScholarDigital Library
- C. -N. Hsu and M. -T. Dung. Generating finite-state transducers for semi-structured data extraction from the web. Inf. Syst., 23(9):521--538, 1998. Google ScholarDigital Library
- Y. Hu, H. Li, Y. Cao, D. Meyerzon, and Q. Zheng. Automatic extraction of titles from general documents using machine learning. In JCDL'05: Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries, Tools & techniques: supporting classification, pages 145--154, 2005. Google ScholarDigital Library
- N. Kushmerick. Wrapper induction: efficiency and expressiveness. Artif. Intell., 118(1-2):15--68, 2000. Google ScholarDigital Library
- A. H. F. Laender, B. A. Ribeiro-Neto, and A. S. da Silva. Debye - data extraction by example. Data Knowl. Eng., 40(2):121--154, 2002. Google ScholarDigital Library
- A. H. F. Laender, B. A. Ribeiro-Neto, A. S. da Silva, and J. S. Teixeira. A brief survey of web data extraction tools. SIGMOD Record, 31(2):84--93, 2002. Google ScholarDigital Library
- S. Lawrence, C. L. Giles, and K. Bollacker. Digital libraries and autonomous citation indexing. Computer, 32(6):67--71, 1999. Google ScholarDigital Library
- D. Lee, J. Kang, P. Mitra, C. L. Giles, and B.-W. On. Are your citations clean? new scenarios and challenges in maintaining digital libraries. To appear in Communications of the ACM, 2007. Google ScholarDigital Library
- B. Liu, R. Grossman, and Y. Zhai. Mining data records in web pages. In KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 601--606, New York, NY, USA, 2003. ACM Press. Google ScholarDigital Library
- F. Mesquita, A. S. da Silva, E. S. de Moura, P. Calado, and A. H. F. Laender. Labrador: Efficiently publishing relational databases on the web by using keyword-based query interfaces. Information Processing & Management, 2007. Article in Press, Corrected Proof. Google ScholarDigital Library
- I. Muslea, S. Minton, and C. A. Knoblock. Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems, 4(1--2):93--114, 2001. Google ScholarDigital Library
- G. W. Paynter. Developing practical automatic metadata assignment and evaluation tools for internet resources. In M. Marlino, T. Sumner, and F. M. S. III, editors, ACM/IEEE Joint Conference on Digital Libraries, JCDL 2005, Denver, CA, USA, June 7-11, 2005, Proceedings, pages 291--300. ACM, 2005. Google ScholarDigital Library
- D. C. Reis, P. B. Golgher, A. S. Silva, and A. F.Laender. Automatic web news extraction using tree edit distance. In WWW '04: Proceedings of the 13th international conference on World Wide Web, pages 502--511, New York, NY, USA, 2004. ACM Press. Google ScholarDigital Library
- S. Soderland. Learning information extraction rules for semi-structured and free text. Machine Learning, 34(1-3):233--272, 1999. Google ScholarDigital Library
- O. Yilmazel, Finneran, C. M., Liddy, and E. D. Metaextract: an NLP system to automatically assign metadata. In JCDL'04: Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries, Collaboration and group work, pages 241--242, 2004. Google ScholarDigital Library
Index Terms
- FLUX-CIM: flexible unsupervised extraction of citation metadata
Recommendations
A simple method for citation metadata extraction using hidden markov models
JCDL '08: Proceedings of the 8th ACM/IEEE-CS joint conference on Digital librariesThis paper describes a simple method for extracting metadata fields from citations using hidden Markov models. The method is easy to implement and can achieve levels of precision and recall for heterogeneous citations comparable to or greater than other ...
A comparison of layout based bibliographic metadata extraction techniques
WIMS '12: Proceedings of the 2nd International Conference on Web Intelligence, Mining and SemanticsSocial research networks such as Mendeley and CiteULike offer various services for collaboratively managing bibliographic metadata. Compared with traditional libraries, metadata quality is of crucial importance in order to create a crowdsourced ...
Evaluation of header metadata extraction approaches and tools for scientific PDF documents
JCDL '13: Proceedings of the 13th ACM/IEEE-CS joint conference on Digital librariesThis paper evaluates the performance of tools for the extraction of metadata from scientific articles. Accurate metadata extraction is an important task for automating the management of digital libraries. This comparative study is a guide for developers ...
Comments