On the applicability of word sense discrimination on 201 years of modern english

Tahmasebi, Nina; Niklas, Kai; Zenz, Gideon; Risse, Thomas

doi:10.1007/s00799-013-0105-8

On the applicability of word sense discrimination on 201 years of modern english

Published: 16 March 2013

Volume 13, pages 135–153, (2013)
Cite this article

International Journal on Digital Libraries Aims and scope Submit manuscript

Nina Tahmasebi¹,
Kai Niklas¹,
Gideon Zenz¹ &
…
Thomas Risse¹

1491 Accesses
2 Citations
3 Altmetric
Explore all metrics

Abstract

As language evolves over time, documents stored in long- term archives become inaccessible to users. Automatically, detecting and handling language evolution will become a necessity to meet user’s information needs. In this paper, we investigate the performance of modern tools and algorithms applied on modern English to find word senses that will later serve as a basis for finding evolution. We apply the curvature clustering algorithm on all nouns and noun phrases extracted from The Times Archive (1785–1985). We use natural language processors for part-of-speech tagging and lemmatization and report on the performance of these processors over the entire period. We evaluate our clusters using WordNet to verify whether they correspond to valid word senses. Because The Times Archive contains OCR errors, we investigate the effects of such errors on word sense discrimination results. Finally, we present a novel approach to correct OCR errors present in the archive and show that the coverage of the curvature clustering algorithm improves. We increase the number of clusters by 24 %. To verify our results, we use the New York Times corpus (1987–2007), a recent collection that is considered error free, as a ground truth for our experiments. We find that after correcting OCR errors in The Times Archive, the performance of word sense discrimination applied on The Times Archive is comparable to the ground truth.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Natural language processing: state of the art, current trends and challenges

Article 14 July 2022

Near-term advances in quantum natural language processing

Article 11 April 2024

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

Notes

References

IMPACT Project. Improving Access to Text. http://www.impact-project.eu
Oxford English Dictionary. The Oxford English Dictionary, 2nd edn. 1989. OED Online, Oxford University Press, Oxford (2000). http://dictionary.oed.com
Oxford English Dictionary, Writing the OED (2010). http://www.oed.com/about/writing/
Google books (2011). http://books.google.com/
Project gutenberg (2011). http://www.gutenberg.org/
The Times, November 29 (1814). http://archive.timesonline.co.uk/tol/viewArticle.arc?articleId=ARCHIVE-The_Times-1814-11-29-03-003&pageId=ARCHIVE-The_Times-1814-11-29-03
Abdulkader, A., Casey, M.R.: Low cost correction of ocr errors using learning in a multi-engine environment. In: ICDAR, pp. 576–580 (2009)
Abecker, A., Stojanovic, L.: Ontology evolution: Medline case study. In: Proceedings of Wirtschaftsinformatik 2005: eEconomy, eGovernment, eSociety, pp. 1291–1308 (2005)
Atkinson, K.: Gnu aspell version 0.60.6 (2008). http://aspell.net/
Cheng, P.-J., Kan, M.-Y., Lam, W., Nakov, P. (eds.): Sixth Asia Information Retrieval Societies Conference (AIRS 2010). Springer, Berlin (2010)
Google Scholar
Coburn, A.: Lingua::EN::Tagger—part-of-speech tagger for english natural language processing (2008). http://search.cpan.org/acoburn/Lingua-EN-Tagger-0.15/Tagger.pm
Damerau, F.J.: A technique for computer detection and correction of spelling errors. Commun. ACM 7(3), 171–176 (1964)
Article Google Scholar
Sebastian, D., Ciura, M.G.: Correcting spelling errors by modeling their causes. Int. J. Appl. Math. Comput. Sci. 15, 275–285 (2005)
Google Scholar
Dorow, B.: A Graph Model for Words and their Meanings. PhD thesis, University of Stuttgart (2007)
Dorow, B., Eckmann, J.-P., Sergi, D.: Using curvature and markov clustering in graphs for lexical acquisition and word sense discrimination. In: Workshop MEANING-2005 (2004)
Ernst-Gerlach, A., Fuhr, N.: Retrieval in text collections with historic spelling using linguistic and spelling variants. In: JCDL ’07: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 333–341, ACM, Vancouver, BC, Canada (2007)
Ferret, O.: Discovering word senses from a network of lexical cooccurrences. In: COLING ’04: Proceedings of the 20th International Conference on Computational Linguistics, 1326, Geneva, Switzerland (2004)
Finlayson, M.A.: MIT Java Wordnet Interface version 2.1.5, Released under Creative Commons Attribution-NonCommerical Version 3.0 Unported License. http://projects.csail.mit.edu/jwi/
Annette, G., Ulrich, R., Christoph, R., Schulz, K.U., Andreas, N.: Towards information retrieval on historical document collections: The role of matching procedures and special lexica. Int. J. Doc. Anal. Recognit. 14(2), 159–171 (2011)
Article Google Scholar
Hauser, A., Heller, M., Leiss, E., Schulz, K.U., Wanzeck, C.: Information access to historical documents from the early New High german period. In: IJCAI-2007 Workshop on Analytics for Noisy Unstructured Text Data (2006)
Hauser, A.W., Schulz, K.U.: Unsupervised learning of edit distance weights for retrieving historical spelling variations. In: Proceedings of the First Workshop on Finite-State Techniques and Approximate Search, pp. 1–6, Borovets, Bulgaria (2007)
Hong, T., Hull, J.J., Srihari, S.N., Deborah, Walters, K., Henry, S.B.: Degraded Text Recognition Using Visual And Linguistic, Context (1995)
Lee Daniel, D., Sebastian, S.H.: Algorithms for non-negative matrix factorization. In: Leen Todd, K., Dietterich, T.G., Volker, T. (eds.) NIPS, pp. 556–562. MIT Press, Cambridge (2000)
Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966)
MathSciNet Google Scholar
Lin, D.: Automatic retrieval and clustering of similar words. In: Proceedings of the 17th International Conference on Computational Linguistics, pp. 768–774, Montreal, QC, Canada (1998)
Lopresti, D.P.: Optical character recognition errors and their effects on natural language processing. IJDAR 12(3), 141–151 (2009)
Google Scholar
Miller, G.A.: WordNet: A lexical database for English. Commun. ACM 38, 39–41 (1995)
Google Scholar
Pantel, P., Lin, D.: Discovering word senses from text. In: KDD ’02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 613–619. ACM, Edmonton, Alberta, Canada (2002)
Pedersen, T., Bruce, R.: Distinguishing word senses in untagged text. In: Proceedings of the Second Conference on Empirical Methods in Natural Language Processing, pp. 197–207, Providence, RI (1997)
Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence, pp. 448–453, Montreal, QC, Canada (1995)
Reynaert, M.: Text Induced Spelling Correction. In: COLING ’04: Proceedings of the 20th International Conference on Computational Linguistics, p. 834. Association for Computational Linguistics, Morristown (2004)
Reynaert, M.: Non-interactive OCR post-correction for giga-scale digitization projects. In: Computational Linguistics and Intelligent Text Processing, pp. 617–630 (2008)
Evan, S.: The New York Times Annotated Corpus. Linguistic Data Consortium, Philadelphia (2008)
Google Scholar
Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proceedings of the International Conference on New Methods in Language Processing, pp. 44–49, Manchester. http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html (1994)
Heinrich, S.: Automatic word sense discrimination. Comput. Linguistics 24(1), 97–123 (1998)
MathSciNet Google Scholar
Spitz, A.L.: An ocr based on character shape codes and lexical information. In: ICDAR, pp. 723–728 (1995)
Strohmaier, C.M.: Methoden der lexikalischen Nachkorrektur OCR-erfasster Dokumente (2004)
Kazem, T., Eric, S.: OCRSpell: An interactive spelling correction system for OCR errors in text. Int. J. Doc. Anal. Recognit. 3, 2001 (2001)
Google Scholar
Tahmasebi, N., Niklas, K., Theuerkauf, T., Risse, T.: Using word sense discrimination on historic document collections. In: 10th ACM/IEEE Joint Conference on Digital Libraries (JCDL), Surfers Paradise, Gold Coast (2010)
Tahmasebi, N.: Automatic detection of terminology evolution. In: Meersman, R., Herrero, P., Dillon, T.S. (eds.) OTM Workshops, vol. 5872 of Lecture Notes in Computer Science, pp. 769–778. Springer, Berlin (2009)
Tahmasebi, N., Gossen, G., Risse, T.: Which words do you remember? Temporal properties of language use in digital archives. In: Zaphiris, P., Buchanan, G., Rasmussen, E., Loizides, F. (eds.) TPDL, volume 7489 of Lecture Notes in Computer Science, pp. 32–37. Springer, Berlin (2012)
Tahmasebi, N., Ramesh, S., Risse, T.: First results on detecting term evolutions. In: 9th International Web Archiving Workshop, Corfu, Greece (2009)
The Times of London (2008). http://archive.timesonline.co.uk/tol/archive/
Van de Cruys, T.: Using three way data for word sense discrimination. In: Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pp. 929–936. Coling 2008 Organizing Committee, Manchester (2008)
Watts, D.J., Strogatz, S.: Collective dynamics of “small-world” networks. Nature 393, 440–442 (1998)
Article Google Scholar

Download references

Acknowledgments

We would like to thank Times Newspapers Limited for providing the archive of The Times, London for our research. A special thanks to Gertrud Erbach for her valuable contributions. This work is partly funded by the European Commission under LiWA (IST 216267) and ARCOMEM (IST 270239).

Author information

Authors and Affiliations

L3S Research Center, Hanover, Germany
Nina Tahmasebi, Kai Niklas, Gideon Zenz & Thomas Risse

Authors

Nina Tahmasebi
View author publications
You can also search for this author in PubMed Google Scholar
Kai Niklas
View author publications
You can also search for this author in PubMed Google Scholar
Gideon Zenz
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Risse
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nina Tahmasebi.

Appendices

Appendix A: Calculating BigramBoost

When calculating the final score (see Formula 3, Sect. 5.6) of a correction proposal for a given term $w$, we take the context of the term into consideration. This is done using the variable $bigramBoost$.

First, we form two bigrams using the term $w$ and the term before in the text (left bigram) and after (right bigram). We permute the bigrams using the correction proposals for the terms in the left and the right bigrams and add up their frequencies to find $bigramBoost$ for $w$.

Let $w_i$ be the $i$-th word of the text. Furthermore, let $c_j^i$ be the $j$-th correction proposal of the word $w_i$ and $|c^i|$ the number of correction proposals corresponding to $w_i$. Then there exists two ways to form a bigram including the word $w_i$.

1.
The left bigrams $b^{i-}_{jk}$, are formed using the proposals of the word $w_i$ and $w_{i-1}$.
2.
The right bigram $b^{i+}_{jk}$ are formed using the proposals of the word $w_i$ and $w_{i+1}$.

The bigrams are defined as follows, using ␣ as white-space:

1.
$b^{i-}_{jk} = \{ c_{j}^{i-1} \circ \textvisiblespace \circ c_{k}^{i}, \ 1 \le j \le \min (5, |c^{i-1}|), \ 1 \le k \le \min (5, |c^{i}|)\}$
2.
$b^{i+}_{jk} = \{c_{k}^{i} \circ \textvisiblespace \circ c_{j}^{i+1}, \ 1 \le k \le \min (5, |c^{i}|), \ 1 \le j \le \min (5, |c^{i+1}|)\} $

To respect the context, bigrams are only generated if they are not blocked by punctuation marks. If $w_{i-1}$ ends with a punctuation mark the left bigrams do not contribute to the final score. Analogue, if $w_{i}$ ends with a punctuation mark the right bigrams do not contribute. We consider all non-digit and non-alphabetic characters as punctuation marks.

Once the bigrams are created, they are queried against the Anagram Hash to retrieve their occurrence count. The frequency of the bigram $b$ in the Anagram Hash is denoted as $f(b)$. The resulting $bigramBoost^i_k$ for the correction proposal $c^i_k$ corresponding to the word $w_i$, is computed as follows:

$$\begin{aligned} bigramBoost^i_k&= \sum ^{\min (5, |c^{i-1}|)}_{j=1}{f(b^{i-}_{jk})} + \sum ^{\min (5, |c^{i+1}|)}_{j=1}{f(b^{i+}_{jk})}, \\&1 \le k \le \min (5, |c^i|) \end{aligned}$$

The bigramBoost value is used for computing final score of a correction proposal for a term $w$ according to Formula 3.

When the final proposal list is obtained, the highest ranking term is chosen as an automatic replacement for the target term. The other terms in the list can be used for a semi-automatic correction strategy.

Appendix B: Example clusters

We will present some sample clusters from The Times Archive as well as the NYT corpus. Due to repetitions, the clusters shown in this Appendix are sampled from all clusters mentioning each term and a limited number of terms are shown for each cluster. In both cluster sets, we find that the number of terms in each cluster increases over time. It should be clear that clusters displayed here do not follow the evolution of each term as a whole, but as it was mentioned in The Times Archive.

In Table 3, we see clusters for the term flight. Among the displayed clusters, it is clear that the senses for flight are several and mostly grouped together. Between 1826 and 1832 there are six clusters (only two of them displayed here) that all refer to a company Flight & Robson which built church (finger) organs. Three decades later, between 1867 and 1894 there are 5 clusters that all refer to hurdle races. 1938–1957 the clusters refer to cricket, the terms in the clusters are referring to the ball. Starting from 1973, the clusters correspond to the modern sense of flight as a means of travel, especially for holidays. The introduction of among others pocket money, visa, accommodation, differentiates the latter clusters from the earlier.

Table 3 Selected clusters and cluster members for the term flight from The Times Archive after error correction

Full size table

In Table 4, we see some selected clusters for the term computer. The first clusters reveal the computer as a tool for working with terms like spreadsheet, database, printer, language translator. Over time the clusters reveal the computer as an every day tool for entertainment with terms like game, home shopping, commercial, movie and communication. We can also find terms that are now much less frequently used like cdrom, vcr, qvc and based on the surrounding terms infer meaning and context.

Table 4 Selected clusters and cluster members for the term computer from the NYT collection

Full size table

In Table 5, we see some selected clusters corresponding to the term travel. We can see that the concept of travel changes over time. In the 19th century, it referred primarily to books and was not an everyday activity for ordinary people. Early 20th century the concept changes and travel becomes more common. With the introduction of terms like sightseeing, full board, good hotel, fishing including locations for travel, the concept of travel clearly becomes more concrete rather than something only available through books.

Table 5 Selected clusters and cluster members for the term travel from The Times Archive after correction

Full size table

A similar shift in concept can also be seen in clusters concerning travellers. In Table 6, we see that the type of people that traveled change. The first two clusters containing the term yellow admiral refer to the classic “The Wags, or the Camp of Pleasure” by Charles Dibdin. As with the senses of travel, the traveller transforms from being a salesman, clerk or merchant to being more concrete with terms like visa, passport, ticket, commuter. In all our clusters business men seem to be highly represented.

Table 6 Selected clusters and cluster members for the term traveller from The Times Archive after correction

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tahmasebi, N., Niklas, K., Zenz, G. et al. On the applicability of word sense discrimination on 201 years of modern english. Int J Digit Libr 13, 135–153 (2013). https://doi.org/10.1007/s00799-013-0105-8

Download citation

Received: 02 November 2011
Revised: 14 February 2013
Accepted: 19 February 2013
Published: 16 March 2013
Issue Date: September 2013
DOI: https://doi.org/10.1007/s00799-013-0105-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On the applicability of word sense discrimination on 201 years of modern english

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Near-term advances in quantum natural language processing

A Comprehensive Survey of Clustering Algorithms

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: Calculating BigramBoost

Appendix B: Example clusters

Rights and permissions

About this article

Cite this article

Keywords

Navigation

On the applicability of word sense discrimination on 201 years of modern english

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Near-term advances in quantum natural language processing

A Comprehensive Survey of Clustering Algorithms

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: Calculating BigramBoost

Appendix B: Example clusters

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation