Skip to main content

Creating an Annotated Corpus for Extracting Canonical Citations from Classics-Related Texts by Using Active Annotation

  • Conference paper
Book cover Computational Linguistics and Intelligent Text Processing (CICLing 2013)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7816))

Abstract

This paper describes the creation of an annotated corpus supporting the task of extracting information–particularly canonical citations, that are references to the ancient sources–from Classics-related texts. The corpus is multilingual and contains approximately 30,000 tokens of POS-tagged, cleanly transcribed text drawn from the L’Année Philologique. In the corpus the named entities that are needed to capture such citations were annotated by using an annotation scheme devised specifically for this task.

The contribution of the paper is two-fold: firstly, it describes how the corpus was created using Active Annotation, an approach which combines automatic and manual annotation to optimize the human resources required to create any corpus. Secondly, the performances of an NER classifier, based on Conditional Random Fields, are evaluated using the created corpus as training and test set: the results obtained by using three different feature sets are compared and discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Mimno, D.: Computational Historiography: Data Mining in a Century of Classics Journals. ACM Transactions on Computational Logic, 1–19 (2005)

    Google Scholar 

  2. McCarty, W.: Humanities Computing. Palgrave Macmillan (2005)

    Google Scholar 

  3. Crane, G.: From the old to the new: intergrating hypertext into traditional scholarship. In: Proceedings of the ACM Conference on Hypertext, Chapel Hill, North Carolina, United States, pp. 51–55. ACM (1987)

    Google Scholar 

  4. Bolter, J.D.: The Computer, Hypertext, and Classical Studies. The American Journal of Philology 112, 541–545 (1991)

    Article  Google Scholar 

  5. Bolter, J.D.: Hypertext and the Classical Commentary. In: Accessing Antiquity: The Computerization of Classical Studies, pp. 157–171. University of Arizona Press, Tucson (1993)

    Google Scholar 

  6. Ruddy, D., Rebillard, E.: Text Linking in the Humanities: Citing Canonical Works Using OpenURL (2009)

    Google Scholar 

  7. Smith, N.: Digital Infrastructure and the Homer Multitext Project. In: Bodard, G., Mahony, S. (eds.) Digital Research in the Study of Classical Antiquity, pp. 121–137. Ashgate Publishing, Burlington (2010)

    Google Scholar 

  8. Romanello, M.: New Value-Added Services for Electronic Journals in Classics. JLIS.it 2 (2011)

    Google Scholar 

  9. Romanello, M.: A semantic linking framework to provide critical value-added services for E-journals on classics. In: Mornati, S., Chan, L. (eds.) ELPUB 2008. Open Scholarship: Authority, Community, and Sustainability in the Age of Web 2.0 - Proceedings of the 12th International Conference on Electronic Publishing held in Toronto, Canada, June 25-27, pp. 401–414 (2008)

    Google Scholar 

  10. Crane, G., Seales, B., Terras, M.: Cyberinfrastructure for Classical Philology. Digital Humanities Quarterly 3 (2009)

    Google Scholar 

  11. Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 30, 3–26 (2007)

    Article  Google Scholar 

  12. Romanello, M., Boschetti, F., Crane, G.: Citations in the digital library of classics: extracting canonical references by using conditional random fields. In: Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries. NLPIR4DL 2009, Morristown, NJ, USA, pp. 80–87. Association for Computational Linguistics (2009)

    Google Scholar 

  13. Romanello, M., Thomas, A.: The World of Thucydides: From Texts to Artefacts and Back. In: Zhou, M., Romanowska, I., Zhongke, W., Pengfei, X., Verhagen, P. (eds.) Revive the Past. Proceeding of the 39th Conference on Computer Applications and Quantitative Methods in Archaeology, Beijing, April 12-16, pp. 276–284. Amsterdam University Press (2012)

    Google Scholar 

  14. Smith, D.A., Crane, G.: Disambiguating Geographic Names in a Historical Digital Library. LNCS, pp. 127–136 (2001)

    Google Scholar 

  15. Babeu, A., Bamman, D., Crane, G., Kummer, R., Weaver, G.: Named Entity Identification and Cyberinfrastructure. In: Kovács, L., Fuhr, N., Meghini, C. (eds.) ECDL 2007. LNCS, vol. 4675, pp. 259–270. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  16. Kramer, M., Kaprykowsky, H., Keysers, D., Breuel, T.: Bibliographic Meta-Data Extraction Using Probabilistic Finite State Transducers (2007)

    Google Scholar 

  17. Councill, I.G., Giles, C.L., Kan, M.Y.: ParsCit: An open-source CRF Reference String Parsing Package. In: Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Tapias, D. (eds.) Proceedings of LREC, vol. (3), pp. 661–667. Citeseer, European Language Resources Association, ELRA (2008)

    Google Scholar 

  18. Kim, Y.M., Bellot, P., Faath, E., Dacos, M.: Automatic annotation of incomplete and scattered bibliographical references in Digital Humanities papers. In: Beigbeder, M., Eglin, V., Ragot, N., Géry, M. (eds.) CORIA, pp. 329–340 (2012)

    Google Scholar 

  19. Galibert, O., Rosset, S., Tannier, X., Grandry, F.: Hybrid Citation Extraction from Patents. In: Chair, N.C.C., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M., Tapias, D. (eds.) Proceedings of the Seventh Conference on International Language Resources and Evaluation, LREC 2010. European Language Resources Association, ELRA (2010)

    Google Scholar 

  20. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Brodley, C.E., Danyluk, A.P. (eds.) Machine Learning International Workshop then Conference, ICML 2001, pp. 282–289. Citeseer (2001)

    Google Scholar 

  21. Sutton, C., McCallum, A.: An Introduction to Conditional Random Fields for Relational Learning. In: Getoor, L., Taskar, B. (eds.) Introduction to Statistical Relational Learning. MIT Press (2006)

    Google Scholar 

  22. Vlachos, A.: Active annotation. In: Proceedings of the Workshop on Adaptive Text Extraction and Mining (ATEM 2006), pp. 64–71 (2006)

    Google Scholar 

  23. Ekbal, A., Bonin, F., Saha, S., Stemle, E., Barbu, E., Cavulli, F., Girardi, C., Poesio, M.: Rapid Adaptation of NE Resolvers for Humanities Domains using Active Annotation. Journal for Language Technology and Computational Linguistics 26, 39–51 (2011)

    Google Scholar 

  24. Settles, B.: Active Learning Literature Survey. Computer Sciences Technical Report 1648, University of Wisconsin- Madison (2009)

    Google Scholar 

  25. Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. Journal of the ACM 53, 918–936 (2006)

    Article  MathSciNet  Google Scholar 

  26. Settles, B.: Biomedical named entity recognition using conditional random fields and rich feature sets. In: JNLPBA 2004: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications, Morristown, NJ, USA, pp. 104–107. Association for Computational Linguistics (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Romanello, M. (2013). Creating an Annotated Corpus for Extracting Canonical Citations from Classics-Related Texts by Using Active Annotation. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2013. Lecture Notes in Computer Science, vol 7816. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37247-6_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-37247-6_6

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-37246-9

  • Online ISBN: 978-3-642-37247-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics