Automatic Pragmatic Text Segmentation of Historical Letters

Hendrickx, Iris; Généreux, Michel; Marquilhas, Rita

doi:10.1007/978-3-642-20227-8_8

Iris Hendrickx⁴,
Michel Généreux⁴ &
Rita Marquilhas⁴

Part of the book series: Theory and Applications of Natural Language Processing ((NLP))

664 Accesses
3 Citations

Abstract

In this investigation we aim to reduce the manual workload by automatic processing of the corpus of historical letters for pragmatic research. We focus on two consecutive sub tasks: the first task is automatic text segmentation of the letters in formal/informal parts using a statistical n-gram based technique. As a second task we perform semantic labeling of the formal parts of the letters using supervised machine learning. The main stumbling block in our investigation is data sparsity due to the small size of the data set and enlarged by the spelling variation present in the historical letters. We try to address the latter problem with a dictionary look up and edit distance text normalization step. We achieve results of 86% micro-averaged F-score for the text segmentation task and 66.3% for the semantic labeling task. Even though these scores are not high enough to completely replace the manual annotation with automatic annotation, our results are promising and demonstrate that an automatic approach based on such small data set is feasible.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Archer, D., Culpeper, J.: Identifying key sociophilological usage in plays and trial proceedings): An empirical approach via corpus annotation. Journal of Historical Pragmatics 10(2), 286–309 (2009)
Article Google Scholar
Archer, D., McEnery, T., Rayson, P., Hardie, A.: Developing an automated semantic analysis system for early modern english. In: Proceedings of the Corpus Linguistics 2003 conference, pp. 22 – 31 (2003)
Google Scholar
Baron, A., Rayson, P.: VARD2: A tool for dealing with spelling variation in historical corpora. In: Proceedings of the Postgraduate Conference in Corpus Linguistics (2008)
Google Scholar
Baroni, M., Bernardini, S.: Bootcat: Bootstrapping corpora and terms from the web. In: Proceedings of Language Resources and Evaluation (LREC) 2004, pp. 1313–1316 (2004)
Google Scholar
Blecua, A.: Manual de Crítica Textual. Castalia, Madrid (1983)
Google Scholar
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT: Proceedings of the Workshop on Computational Learning Theory, Morgan Kaufmann Publishers (1998)
Google Scholar
Bluteau, R.: Vocabulario portuguez, e latino [followed by] supplemento ao vocabulario portuguez. vols. 1-8, I-II. Coimbra-Lisboa. (1712–1728)
Google Scholar
Brown, P., Levinson, S.C.: Politeness: some universals in language usage. Cambridge University Press, Cambridge (1987)
Google Scholar
Cohen, J.: A coefficient of agreement for nominal scales. Education and Psychological Measuremen 20, 37–46 (1960)
Article Google Scholar
Daelemans, W., A.Van den Bosch: Memory-Based Language Processing. Cambridge University Press, Cambridge, UK (2005)
Book Google Scholar
Daelemans, W., Zavrel, J., Van den Bosch, A., Van der Sloot, K.: Mbt: Memory-based tagger, version 3.1, reference guide. Tech. rep., ILK Technical Report Series 07-08 (2007)
Google Scholar
Dossena, M., van Ostade, I.T.B. (eds.): Studies in Late Modern English Correspondence. Peter Lang, Bern (2008)
Google Scholar
Edmonds, P., Kilgarriff, A.: Introduction to the special issue on evaluating word sense disambiguation systems. Natural Language Engineerin 8(4), 279–291 (2002)
Article Google Scholar
Ernst-Gerlach, A., Fuhr, N.: Retrieval in text collections with historic spelling using linguistic and spelling variants. In: Proceedings of the ACM/IEEE-CS conference on Digital libraries, pp. 333–341 (2007)
Google Scholar
Everitt, B.: The Analysis of Contingency Tables, 2nd edn. Chapman and Hall (1992)
Google Scholar
Ferret, O.: Segmenter et structurer thématiquement des textes par l’utilisation conjointe de collocations et de la récurrence lexicale. In: TALN 2002. Nancy (2002)
Google Scholar
Fitzmaurice, S.M.: Epistolary identity: convention and idiosyncrasy in late modern english letters. In: Studies in Late Modern English Correspondence, pp. 77–112. Peter Lang (2008)
Google Scholar
Guillén, C.: Renaissance Genres: Essays on Theory, History and Interpretation, chap. Notes towards the study of the Renaissance letter, pp. 70–101. Harvard University Press (1986)
Google Scholar
Hachey, B., Grover, C.: Extractive summarisation of legal texts. Artificial Intelligence and Law: Special Issue on E-government 14, 305–345 (2007)
Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: An update. SIGKDD Explorations 11(1) (2009)
Google Scholar
Hearst, M.A.: Texttiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics 23(1), 33–64 (1997)
Google Scholar
Jurafsky, D., Martin, J.H.: Speech and Language Processing. 2nd edition. Prentice-Hall (2009)
Google Scholar
Kilgarriff, A., Palmer, M.: Introduction to the special issue on senseval. Computers in the Humanities 34(1-2), 1–13. (2000)
Article Google Scholar
Koolen, M., Adriaans, F., Kamps, J., de Rijke, M.: A cross-language approach to historic document retrieval. In: Advances in Information Retrieval: 28th European Conference on IR Research (ECIR 2006), LNCS, vol. 3936, pp. 407–419. Springer Verlag, Heidelberg (2006)
Google Scholar
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. 18th International Conf. on Machine Learning, pp. 282–289. Morgan Kaufmann, San Francisco, CA (2001)
Google Scholar
Levenshtein, V.: Binary codes capable of correcting deletions, insertions, and reversals. Sovjet Physics Doklady 10, 707–710 (1966)
MathSciNet Google Scholar
Merity, S., Murphy, T., Curran, J.R.: Accurate argumentative zoning with maximum entropy models. In: NLPIR4DL ’09: Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries, pp. 19–26. Association for Computational Linguistics, Morristown, NJ, USA (2009)
Google Scholar
Mikheev, A.: Periods, capitalized words, etc. Computational Linguistics 28, 289–318 (1999)
Article Google Scholar
Moon, R.: Fixed Expressions and Idioms in English: A Corpus-Based Approach. Oxford University Press, Oxford (1998)
Google Scholar
Nevalainen, T., Tanskanen, S.K. (eds.): Letter Writing. John Benjamins Publishing Company, Amsterdam/Philadelphia (2007)
Google Scholar
Ng, H.T., Lim, C.Y., Foo, S.K.: A case study on inter-annotator agreement for word sense disambiguation. In: Proceedings of the SIGLEX Workshop On Standardizing Lexical Resources (1999)
Google Scholar
Noreen, E.W.: Computer-Intensive Methods for Testing Hypotheses. John Wiley & Sons (1989)
Google Scholar
Pevzner, L., Hearst, M.A.: A critique and improvement of an evaluation metric for text segmentation. Comp. Linguistics 28, 1–19 (2002)
Article Google Scholar
Ramshaw, L., Marcus, M.: Text chunking using transformation-based learning. In: Proceedings of the Third Workshop on Very Large Corpora, pp. 82–94 (1995)
Google Scholar
Rayson, P., Archer, D., Piao, S.L., McEnery, T.: The UCREL semantic analysis system. In: Proceedings of the workshop on Beyond Named Entity Recognition Semantic labelling for NLP tasks (LREC 2004), pp. 7–12 (2004)
Google Scholar
Reynar, J.C., Ratnaparkhi, A.: A maximum entropy approach to identifying sentence boundaries. In: Proceedings of the Fifth Conference on Applied Natural Language Processing, pp. 16–19 (1997)
Google Scholar
Sporleder, C., Lapata, M.: Broad coverage paragraph segmentation across languages and domains. ACM Transactions on Speech and Language Processing 3(2), 1–35 (2006)
Article Google Scholar
Teufel, S., Moens, M.: What’s yours and what’s mine: Determining intellectual attribution in scientific text. In: In EMNLP-VLC (2000)
Google Scholar
Watts, R.: Politeness. Cambridge University Press, Cambridge (2003)
Book Google Scholar

Download references

Acknowledgements

We would like to thank Mariana Gomes, Ana Rita Guilherme and Leonor Tavares for the manual annotation. We are grateful to JoÃčo Paulo Silvestre for sharing his electronic version of the Bluteau Dictionary and frequency counts. This work is funded by the Portuguese Science Foundation, FCT (FundaÃğÃčo para a CiÃłncia e a Tecnologia).

Author information

Authors and Affiliations

Centro de Linguística da Universidade de Lisboa, Av. Prof. Gama Pinto, 2, 1649-003, Lisboa, Portugal
Iris Hendrickx, Michel Généreux & Rita Marquilhas

Authors

Iris Hendrickx
View author publications
You can also search for this author in PubMed Google Scholar
Michel Généreux
View author publications
You can also search for this author in PubMed Google Scholar
Rita Marquilhas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Iris Hendrickx .

Editor information

Editors and Affiliations

, Computational Linguistics / MMCI, Saarland University, Saarbrücken, 66041, Germany
Caroline Sporleder
Fac. Humanities, Tilburg University, Tilburg, Netherlands
Antal van den Bosch
Tilburg School for Humanities, Tilburg Center for Cognition and Communi, University of Tilburg, Tilburg, 5000, Netherlands
Kalliopi Zervanou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hendrickx, I., Généreux, M., Marquilhas, R. (2011). Automatic Pragmatic Text Segmentation of Historical Letters. In: Sporleder, C., van den Bosch, A., Zervanou, K. (eds) Language Technology for Cultural Heritage. Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20227-8_8

Download citation

DOI: https://doi.org/10.1007/978-3-642-20227-8_8
Published: 26 April 2011
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20226-1
Online ISBN: 978-3-642-20227-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics