Abstract
In this investigation we aim to reduce the manual workload by automatic processing of the corpus of historical letters for pragmatic research. We focus on two consecutive sub tasks: the first task is automatic text segmentation of the letters in formal/informal parts using a statistical n-gram based technique. As a second task we perform semantic labeling of the formal parts of the letters using supervised machine learning. The main stumbling block in our investigation is data sparsity due to the small size of the data set and enlarged by the spelling variation present in the historical letters. We try to address the latter problem with a dictionary look up and edit distance text normalization step. We achieve results of 86% micro-averaged F-score for the text segmentation task and 66.3% for the semantic labeling task. Even though these scores are not high enough to completely replace the manual annotation with automatic annotation, our results are promising and demonstrate that an automatic approach based on such small data set is feasible.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Archer, D., Culpeper, J.: Identifying key sociophilological usage in plays and trial proceedings): An empirical approach via corpus annotation. Journal of Historical Pragmatics 10(2), 286–309 (2009)
Archer, D., McEnery, T., Rayson, P., Hardie, A.: Developing an automated semantic analysis system for early modern english. In: Proceedings of the Corpus Linguistics 2003 conference, pp. 22 – 31 (2003)
Baron, A., Rayson, P.: VARD2: A tool for dealing with spelling variation in historical corpora. In: Proceedings of the Postgraduate Conference in Corpus Linguistics (2008)
Baroni, M., Bernardini, S.: Bootcat: Bootstrapping corpora and terms from the web. In: Proceedings of Language Resources and Evaluation (LREC) 2004, pp. 1313–1316 (2004)
Blecua, A.: Manual de Crítica Textual. Castalia, Madrid (1983)
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT: Proceedings of the Workshop on Computational Learning Theory, Morgan Kaufmann Publishers (1998)
Bluteau, R.: Vocabulario portuguez, e latino [followed by] supplemento ao vocabulario portuguez. vols. 1-8, I-II. Coimbra-Lisboa. (1712–1728)
Brown, P., Levinson, S.C.: Politeness: some universals in language usage. Cambridge University Press, Cambridge (1987)
Cohen, J.: A coefficient of agreement for nominal scales. Education and Psychological Measuremen 20, 37–46 (1960)
Daelemans, W., A.Van den Bosch: Memory-Based Language Processing. Cambridge University Press, Cambridge, UK (2005)
Daelemans, W., Zavrel, J., Van den Bosch, A., Van der Sloot, K.: Mbt: Memory-based tagger, version 3.1, reference guide. Tech. rep., ILK Technical Report Series 07-08 (2007)
Dossena, M., van Ostade, I.T.B. (eds.): Studies in Late Modern English Correspondence. Peter Lang, Bern (2008)
Edmonds, P., Kilgarriff, A.: Introduction to the special issue on evaluating word sense disambiguation systems. Natural Language Engineerin 8(4), 279–291 (2002)
Ernst-Gerlach, A., Fuhr, N.: Retrieval in text collections with historic spelling using linguistic and spelling variants. In: Proceedings of the ACM/IEEE-CS conference on Digital libraries, pp. 333–341 (2007)
Everitt, B.: The Analysis of Contingency Tables, 2nd edn. Chapman and Hall (1992)
Ferret, O.: Segmenter et structurer thématiquement des textes par l’utilisation conjointe de collocations et de la récurrence lexicale. In: TALN 2002. Nancy (2002)
Fitzmaurice, S.M.: Epistolary identity: convention and idiosyncrasy in late modern english letters. In: Studies in Late Modern English Correspondence, pp. 77–112. Peter Lang (2008)
Guillén, C.: Renaissance Genres: Essays on Theory, History and Interpretation, chap. Notes towards the study of the Renaissance letter, pp. 70–101. Harvard University Press (1986)
Hachey, B., Grover, C.: Extractive summarisation of legal texts. Artificial Intelligence and Law: Special Issue on E-government 14, 305–345 (2007)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: An update. SIGKDD Explorations 11(1) (2009)
Hearst, M.A.: Texttiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics 23(1), 33–64 (1997)
Jurafsky, D., Martin, J.H.: Speech and Language Processing. 2nd edition. Prentice-Hall (2009)
Kilgarriff, A., Palmer, M.: Introduction to the special issue on senseval. Computers in the Humanities 34(1-2), 1–13. (2000)
Koolen, M., Adriaans, F., Kamps, J., de Rijke, M.: A cross-language approach to historic document retrieval. In: Advances in Information Retrieval: 28th European Conference on IR Research (ECIR 2006), LNCS, vol. 3936, pp. 407–419. Springer Verlag, Heidelberg (2006)
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. 18th International Conf. on Machine Learning, pp. 282–289. Morgan Kaufmann, San Francisco, CA (2001)
Levenshtein, V.: Binary codes capable of correcting deletions, insertions, and reversals. Sovjet Physics Doklady 10, 707–710 (1966)
Merity, S., Murphy, T., Curran, J.R.: Accurate argumentative zoning with maximum entropy models. In: NLPIR4DL ’09: Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries, pp. 19–26. Association for Computational Linguistics, Morristown, NJ, USA (2009)
Mikheev, A.: Periods, capitalized words, etc. Computational Linguistics 28, 289–318 (1999)
Moon, R.: Fixed Expressions and Idioms in English: A Corpus-Based Approach. Oxford University Press, Oxford (1998)
Nevalainen, T., Tanskanen, S.K. (eds.): Letter Writing. John Benjamins Publishing Company, Amsterdam/Philadelphia (2007)
Ng, H.T., Lim, C.Y., Foo, S.K.: A case study on inter-annotator agreement for word sense disambiguation. In: Proceedings of the SIGLEX Workshop On Standardizing Lexical Resources (1999)
Noreen, E.W.: Computer-Intensive Methods for Testing Hypotheses. John Wiley & Sons (1989)
Pevzner, L., Hearst, M.A.: A critique and improvement of an evaluation metric for text segmentation. Comp. Linguistics 28, 1–19 (2002)
Ramshaw, L., Marcus, M.: Text chunking using transformation-based learning. In: Proceedings of the Third Workshop on Very Large Corpora, pp. 82–94 (1995)
Rayson, P., Archer, D., Piao, S.L., McEnery, T.: The UCREL semantic analysis system. In: Proceedings of the workshop on Beyond Named Entity Recognition Semantic labelling for NLP tasks (LREC 2004), pp. 7–12 (2004)
Reynar, J.C., Ratnaparkhi, A.: A maximum entropy approach to identifying sentence boundaries. In: Proceedings of the Fifth Conference on Applied Natural Language Processing, pp. 16–19 (1997)
Sporleder, C., Lapata, M.: Broad coverage paragraph segmentation across languages and domains. ACM Transactions on Speech and Language Processing 3(2), 1–35 (2006)
Teufel, S., Moens, M.: What’s yours and what’s mine: Determining intellectual attribution in scientific text. In: In EMNLP-VLC (2000)
Watts, R.: Politeness. Cambridge University Press, Cambridge (2003)
Acknowledgements
We would like to thank Mariana Gomes, Ana Rita Guilherme and Leonor Tavares for the manual annotation. We are grateful to JoÃčo Paulo Silvestre for sharing his electronic version of the Bluteau Dictionary and frequency counts. This work is funded by the Portuguese Science Foundation, FCT (FundaÃğÃčo para a CiÃłncia e a Tecnologia).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hendrickx, I., Généreux, M., Marquilhas, R. (2011). Automatic Pragmatic Text Segmentation of Historical Letters. In: Sporleder, C., van den Bosch, A., Zervanou, K. (eds) Language Technology for Cultural Heritage. Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20227-8_8
Download citation
DOI: https://doi.org/10.1007/978-3-642-20227-8_8
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20226-1
Online ISBN: 978-3-642-20227-8
eBook Packages: Computer ScienceComputer Science (R0)