skip to main content
10.1145/1255175.1255219acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
Article

FLUX-CIM: flexible unsupervised extraction of citation metadata

Published:18 June 2007Publication History

ABSTRACT

In this paper we propose a knowledge-base approach to help extracting the correct components of citations in any given format. Differently from related approaches that rely on manually built knowledge-bases (KBs) for recognizing the components of a citation, in our case, such a KB is automatically constructed from an existing set of sample metadata records from a given area (e.g., computer science or health sciences). Our approach does not rely on patterns encoding specific delimitators of a particular citation style. It is also unsupervised, in the sense that it does not rely on a learning method that requires a training phase. These features assign to our technique a high degree of automation and flexibility. To demonstrate the effectiveness and applicability of our proposed approach we have run experiments in which we applied it to extract information from citations in papers of two different domains. Results of these experiments indicate precision and recall levels above 94% and perfect extraction for the large majority of citations tested.

References

  1. A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In SIGMOD '03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages 337--348, New York, NY, USA, 2003. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst., 30(1-7):107--117, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. Calado, M. Cristo, M. A. Gonçalves, E. S. de Moura, B. Ribeiro-Neto, and N. Ziviani. Link-based similarity measures for the classification of web documents. J. Am. Soc. Inf. Sci. Technol., 57(2):208--221, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. T. Couto, M. Cristo, M. A. Gonçalves, P. Calado, N. Ziviani, E. Moura, and B. Ribeiro-Neto. A comparative study of citations and links in document classification. In JCDL '06: Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, pages 75--84, New York, NY, USA, 2006. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large websites. In VLDB '01: Proceedings of the 27th International Conference on Very Large Data Bases, pages 109--118, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. -Y. Day, T. -H. Tsai, C. -L. Sung, C. -W. Lee, S. -H.Wu, C. -S. Ong, and W. -L. Hsu. A knowledge-based approach to citation extraction. In IRI '05: Proceedings of the 2005 IEEE International Conference on Information Reuse and Integration, pages 50--55, New York, NY, USA, 2005. IEEE Systems, Man, and Cybernetics Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D. W. Embley, D. M. Campbell, Y. S. Jiang, S. W. Liddle, D. W. Lonsdale, Y. -K. Ng, and R. D. Smith. Conceptual-model-based data extraction from multiple-record web pages. Data Knowl. Eng., 31(3):227--251, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. Freitag and A. McCallum. Information extraction with hmm structures learned by stochastic optimization. In Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, pages 584--589. AAAI Press /The MIT Press, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. A. Gonçalves, B. L. Moreira, E. A. Fox, and L. T. Watson. What is a good digital library? - defining aquality model for digital libraries. To appear in Information Processing and Management, 2007.Google ScholarGoogle Scholar
  10. H. Han, C. L. Giles, E. Manavoglu, H. Zha, Z. Zhang, and E. A. Fox. Automatic document metadata extraction using support vector machines. In ACM/IEEE Joint Conference on Digital Libraries, JCDL 2003, pages 37--48. IEEE Computer Society, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. C. -N. Hsu and M. -T. Dung. Generating finite-state transducers for semi-structured data extraction from the web. Inf. Syst., 23(9):521--538, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Y. Hu, H. Li, Y. Cao, D. Meyerzon, and Q. Zheng. Automatic extraction of titles from general documents using machine learning. In JCDL'05: Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries, Tools & techniques: supporting classification, pages 145--154, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. N. Kushmerick. Wrapper induction: efficiency and expressiveness. Artif. Intell., 118(1-2):15--68, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. H. F. Laender, B. A. Ribeiro-Neto, and A. S. da Silva. Debye - data extraction by example. Data Knowl. Eng., 40(2):121--154, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. H. F. Laender, B. A. Ribeiro-Neto, A. S. da Silva, and J. S. Teixeira. A brief survey of web data extraction tools. SIGMOD Record, 31(2):84--93, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Lawrence, C. L. Giles, and K. Bollacker. Digital libraries and autonomous citation indexing. Computer, 32(6):67--71, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. Lee, J. Kang, P. Mitra, C. L. Giles, and B.-W. On. Are your citations clean? new scenarios and challenges in maintaining digital libraries. To appear in Communications of the ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. B. Liu, R. Grossman, and Y. Zhai. Mining data records in web pages. In KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 601--606, New York, NY, USA, 2003. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. F. Mesquita, A. S. da Silva, E. S. de Moura, P. Calado, and A. H. F. Laender. Labrador: Efficiently publishing relational databases on the web by using keyword-based query interfaces. Information Processing & Management, 2007. Article in Press, Corrected Proof. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. I. Muslea, S. Minton, and C. A. Knoblock. Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems, 4(1--2):93--114, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. G. W. Paynter. Developing practical automatic metadata assignment and evaluation tools for internet resources. In M. Marlino, T. Sumner, and F. M. S. III, editors, ACM/IEEE Joint Conference on Digital Libraries, JCDL 2005, Denver, CA, USA, June 7-11, 2005, Proceedings, pages 291--300. ACM, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. C. Reis, P. B. Golgher, A. S. Silva, and A. F.Laender. Automatic web news extraction using tree edit distance. In WWW '04: Proceedings of the 13th international conference on World Wide Web, pages 502--511, New York, NY, USA, 2004. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. Soderland. Learning information extraction rules for semi-structured and free text. Machine Learning, 34(1-3):233--272, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. O. Yilmazel, Finneran, C. M., Liddy, and E. D. Metaextract: an NLP system to automatically assign metadata. In JCDL'04: Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries, Collaboration and group work, pages 241--242, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. FLUX-CIM: flexible unsupervised extraction of citation metadata

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          JCDL '07: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
          June 2007
          534 pages
          ISBN:9781595936448
          DOI:10.1145/1255175

          Copyright © 2007 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 18 June 2007

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • Article

          Acceptance Rates

          Overall Acceptance Rate415of1,482submissions,28%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader