Article

FLUX-CIM: flexible unsupervised extraction of citation metadata

Authors:
Eli Cortez

Universidade Federal do Amazonas, Manaus, Brazil

Universidade Federal do Amazonas, Manaus, Brazil
View Profile

,
Altigran S. da Silva

Universidade Federal do Amazonas, Manaus, Brazil

Universidade Federal do Amazonas, Manaus, Brazil
View Profile

,
Marcos André Gonçalves

Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
View Profile

,
Filipe Mesquita

Universidade Federal do Amazonas, Manaus, Brazil

Universidade Federal do Amazonas, Manaus, Brazil
View Profile

,
Edleno S. de Moura

Universidade Federal do Amazonas, Manaus, Brazil

Universidade Federal do Amazonas, Manaus, Brazil
View Profile

JCDL '07: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital librariesJune 2007Pages 215–224https://doi.org/10.1145/1255175.1255219

Published:18 June 2007Publication History

JCDL '07: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries

Pages 215–224

ABSTRACT

In this paper we propose a knowledge-base approach to help extracting the correct components of citations in any given format. Differently from related approaches that rely on manually built knowledge-bases (KBs) for recognizing the components of a citation, in our case, such a KB is automatically constructed from an existing set of sample metadata records from a given area (e.g., computer science or health sciences). Our approach does not rely on patterns encoding specific delimitators of a particular citation style. It is also unsupervised, in the sense that it does not rely on a learning method that requires a training phase. These features assign to our technique a high degree of automation and flexibility. To demonstrate the effectiveness and applicability of our proposed approach we have run experiments in which we applied it to extract information from citations in papers of two different domains. Results of these experiments indicate precision and recall levels above 94% and perfect extraction for the large majority of citations tested.

References

A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In SIGMOD '03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages 337--348, New York, NY, USA, 2003. ACM Press. Google ScholarDigital Library
S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst., 30(1-7):107--117, 1998. Google ScholarDigital Library
P. Calado, M. Cristo, M. A. Gonçalves, E. S. de Moura, B. Ribeiro-Neto, and N. Ziviani. Link-based similarity measures for the classification of web documents. J. Am. Soc. Inf. Sci. Technol., 57(2):208--221, 2006. Google ScholarDigital Library
T. Couto, M. Cristo, M. A. Gonçalves, P. Calado, N. Ziviani, E. Moura, and B. Ribeiro-Neto. A comparative study of citations and links in document classification. In JCDL '06: Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, pages 75--84, New York, NY, USA, 2006. ACM Press. Google ScholarDigital Library
V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large websites. In VLDB '01: Proceedings of the 27th International Conference on Very Large Data Bases, pages 109--118, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
M. -Y. Day, T. -H. Tsai, C. -L. Sung, C. -W. Lee, S. -H.Wu, C. -S. Ong, and W. -L. Hsu. A knowledge-based approach to citation extraction. In IRI '05: Proceedings of the 2005 IEEE International Conference on Information Reuse and Integration, pages 50--55, New York, NY, USA, 2005. IEEE Systems, Man, and Cybernetics Society. Google ScholarDigital Library
D. W. Embley, D. M. Campbell, Y. S. Jiang, S. W. Liddle, D. W. Lonsdale, Y. -K. Ng, and R. D. Smith. Conceptual-model-based data extraction from multiple-record web pages. Data Knowl. Eng., 31(3):227--251, 1999. Google ScholarDigital Library
D. Freitag and A. McCallum. Information extraction with hmm structures learned by stochastic optimization. In Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, pages 584--589. AAAI Press /The MIT Press, 2000. Google ScholarDigital Library
M. A. Gonçalves, B. L. Moreira, E. A. Fox, and L. T. Watson. What is a good digital library? - defining aquality model for digital libraries. To appear in Information Processing and Management, 2007.Google Scholar
H. Han, C. L. Giles, E. Manavoglu, H. Zha, Z. Zhang, and E. A. Fox. Automatic document metadata extraction using support vector machines. In ACM/IEEE Joint Conference on Digital Libraries, JCDL 2003, pages 37--48. IEEE Computer Society, 2003. Google ScholarDigital Library
C. -N. Hsu and M. -T. Dung. Generating finite-state transducers for semi-structured data extraction from the web. Inf. Syst., 23(9):521--538, 1998. Google ScholarDigital Library
Y. Hu, H. Li, Y. Cao, D. Meyerzon, and Q. Zheng. Automatic extraction of titles from general documents using machine learning. In JCDL'05: Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries, Tools & techniques: supporting classification, pages 145--154, 2005. Google ScholarDigital Library
N. Kushmerick. Wrapper induction: efficiency and expressiveness. Artif. Intell., 118(1-2):15--68, 2000. Google ScholarDigital Library
A. H. F. Laender, B. A. Ribeiro-Neto, and A. S. da Silva. Debye - data extraction by example. Data Knowl. Eng., 40(2):121--154, 2002. Google ScholarDigital Library
A. H. F. Laender, B. A. Ribeiro-Neto, A. S. da Silva, and J. S. Teixeira. A brief survey of web data extraction tools. SIGMOD Record, 31(2):84--93, 2002. Google ScholarDigital Library
S. Lawrence, C. L. Giles, and K. Bollacker. Digital libraries and autonomous citation indexing. Computer, 32(6):67--71, 1999. Google ScholarDigital Library
D. Lee, J. Kang, P. Mitra, C. L. Giles, and B.-W. On. Are your citations clean? new scenarios and challenges in maintaining digital libraries. To appear in Communications of the ACM, 2007. Google ScholarDigital Library
B. Liu, R. Grossman, and Y. Zhai. Mining data records in web pages. In KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 601--606, New York, NY, USA, 2003. ACM Press. Google ScholarDigital Library
F. Mesquita, A. S. da Silva, E. S. de Moura, P. Calado, and A. H. F. Laender. Labrador: Efficiently publishing relational databases on the web by using keyword-based query interfaces. Information Processing & Management, 2007. Article in Press, Corrected Proof. Google ScholarDigital Library
I. Muslea, S. Minton, and C. A. Knoblock. Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems, 4(1--2):93--114, 2001. Google ScholarDigital Library
G. W. Paynter. Developing practical automatic metadata assignment and evaluation tools for internet resources. In M. Marlino, T. Sumner, and F. M. S. III, editors, ACM/IEEE Joint Conference on Digital Libraries, JCDL 2005, Denver, CA, USA, June 7-11, 2005, Proceedings, pages 291--300. ACM, 2005. Google ScholarDigital Library
D. C. Reis, P. B. Golgher, A. S. Silva, and A. F.Laender. Automatic web news extraction using tree edit distance. In WWW '04: Proceedings of the 13th international conference on World Wide Web, pages 502--511, New York, NY, USA, 2004. ACM Press. Google ScholarDigital Library
S. Soderland. Learning information extraction rules for semi-structured and free text. Machine Learning, 34(1-3):233--272, 1999. Google ScholarDigital Library
O. Yilmazel, Finneran, C. M., Liddy, and E. D. Metaextract: an NLP system to automatically assign metadata. In JCDL'04: Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries, Collaboration and group work, pages 241--242, 2004. Google ScholarDigital Library

Index Terms

FLUX-CIM: flexible unsupervised extraction of citation metadata

Recommendations

A simple method for citation metadata extraction using hidden markov models
JCDL '08: Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries

This paper describes a simple method for extracting metadata fields from citations using hidden Markov models. The method is easy to implement and can achieve levels of precision and recall for heterogeneous citations comparable to or greater than other ...
Read More
A comparison of layout based bibliographic metadata extraction techniques
WIMS '12: Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics

Social research networks such as Mendeley and CiteULike offer various services for collaboratively managing bibliographic metadata. Compared with traditional libraries, metadata quality is of crucial importance in order to create a crowdsourced ...
Read More
Evaluation of header metadata extraction approaches and tools for scientific PDF documents
JCDL '13: Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries

This paper evaluates the performance of tools for the extraction of metadata from scientific articles. Accurate metadata extraction is an important task for automating the management of digital libraries. This comparative study is a guide for developers ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
JCDL '07: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
June 2007
534 pages
ISBN:9781595936448
DOI:10.1145/1255175
General Chair:
Edie Rasmussen
University of British Columbia, Canada
,
Program Chairs:
Ray R. Larson
University of California, Berkeley
,
Elaine Toms
Dalhousie University, Canada
,
Shigeo Sugimoto
University of Tsukuba, Japan
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 June 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
citation management
metadata extraction
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate415of1,482submissions,28%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 46
  Total Citations
  View Citations
- 696
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

FLUX-CIM: flexible unsupervised extraction of citation metadata

JCDL '07: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries

ABSTRACT

References

Cited By

Index Terms

Recommendations

A simple method for citation metadata extraction using hidden markov models

A comparison of layout based bibliographic metadata extraction techniques

Evaluation of header metadata extraction approaches and tools for scientific PDF documents

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

FLUX-CIM: flexible unsupervised extraction of citation metadata

JCDL '07: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries

ABSTRACT

References

Cited By

Index Terms

Recommendations

A simple method for citation metadata extraction using hidden markov models

A comparison of layout based bibliographic metadata extraction techniques

Evaluation of header metadata extraction approaches and tools for scientific PDF documents

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media