Abstract
Research data and publications are usually stored in separate and structurally distinct information systems. Often, links between these resources are not explicitly available which complicates the search for previous research. In this paper, we propose a pattern induction method for the detection of study references in full texts. Since these references are not specified in a standardized way and may occur inside a variety of different contexts – i.e., captions, footnotes, or continuous text – our algorithm is required to induce very flexible patterns. To overcome the sparse distribution of training instances, we induce patterns iteratively using a bootstrapping approach. We show that our method achieves promising results for the automatic identification of data references and is a first step towards building an integrated information system.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Afzal, M.T., Maurer, H., Balke, W.T., Kulathuramaiyer, N.: Rule based autonomous citation mining with tierl. Journal of Digital Information Management 8(3), 196–204 (2010)
Councill, I.G., Giles, C.L., Kan, M.Y.: Parscit: An open-source crf reference string parsing package. In: Proceedings of the Language Resources and Evaluation Conference, European Language Resources Association (2008)
Gipp, B., Beel, J.: Citation Proximity Analysis (CPA) - A new approach for identifying related work based on Co-Citation Analysis. In: Proceedings of the 12th International Conference on Scientometrics and Informetrics, vol. 2, pp. 571–575 (2009)
Kubala, F., Schwartz, R., Stone, R., Weischedel, R.: Named entity extraction from speech. In: DARPA Workshop on Broadcast News Understanding Systems, pp. 287–292 (1998)
Etzioni, O., Cafarella, M., Downey, D., Popescu, A.M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Unsupervised named-entity extraction from the web: An experimental study. Artificial Intelligence 165, 91–134 (2005)
Hearst, M.: Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th Conference on Computational Linguistics, pp. 539–545. Association for Computational Linguistics, Stroudsburg (1992)
Berland, M., Charniak, E.: Finding parts in very large corpora. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, pp. 57–64. Association for Computational Linguistics, Stroudsburg (1999)
Pennacchiotti, M., Pantel, P.: A bootstrapping algorithm for automatically harvesting semantic relations. In: Proceedings of the Inference in Computational Semantics, pp. 87–96 (2006)
Meusel, R., Niepert, M., Eckert, K., Stuckenschmidt, H.: Thesaurus Extension Using Web Search Engines. In: Chowdhury, G., Koo, C., Hunter, J. (eds.) ICADL 2010. LNCS, vol. 6102, pp. 198–207. Springer, Heidelberg (2010)
Thelen, M., Riloff, E.: A bootstrapping method for learning semantic lexicons using extraction pattern contexts. In: Proceedings of the 2002 Conference on Empirical Methods in NLP, pp. 214–221. Association for Computational Linguistics, Stroudsburg (2002)
Xu, R., Morgan, A., Das, A.K., Garber, A.: Investigation of unsupervised pattern learning techniques for bootstrap construction of a medical treatment lexicon. In: Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing, BioNLP 2009, pp. 63–70. Association for Computational Linguistics, Stroudsburg (2009)
Altman, M., King, G.: A proposed standard for the scholarly citation of quantitative data. D-Lib Magazine 13(3) (2007)
Green, T.: We need publishing standards for datasets and data tables. OECD Publishing White Paper (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Boland, K., Ritze, D., Eckert, K., Mathiak, B. (2012). Identifying References to Datasets in Publications. In: Zaphiris, P., Buchanan, G., Rasmussen, E., Loizides, F. (eds) Theory and Practice of Digital Libraries. TPDL 2012. Lecture Notes in Computer Science, vol 7489. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33290-6_17
Download citation
DOI: https://doi.org/10.1007/978-3-642-33290-6_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33289-0
Online ISBN: 978-3-642-33290-6
eBook Packages: Computer ScienceComputer Science (R0)