Abstract
This paper reports the preliminary results of an experiment carried out on a large scale for the extraction of PUs (phraseological units, also called idioms) from large web corpora in four languages (English, Spanish, French, Chinese). The use of a new algorithm based on metric clustering techniques, of optimized database storage and of interaction with users and researchers by means of a web application, made it possible to reach high precision scores for most common PUs in the four languages, while further experimentation is still necessary for establishing recall levels with long n-grams. In the meantime, the freely accessible web application makes it possible to visualize the high proportion of phraseology in the broad sense (or of formulaic language): about 30 to 60% of the newspaper articles tested in the experiments consisted of PUs. The most surprising results, however, came from Chinese: as the algorithm had to be changed for taking into account the associations between morphemes, the methodology used made it possible to partly confirm, from a statistical point of view, one of the major claims of construction grammar: the existence of a probabilistic network of constructions, from morphemes to idiomatic phrases.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The web corpora used in the IdiomSearch experiment were assembled using the WebBootCat tool provided by the Sketch Engine (http://sketchengine.co.uk), on the basis of seed words and following the methodology described in [2].
- 2.
For Chinese, the corpus does not consist of 200 million Chinese characters (hans), but of 200 million Chinese words (as tokens).
- 3.
IdiomSearch is accessible on the web at: http://idiomsearch.LSTI.ucl.ac.be.
- 4.
The Guardian, http://www.theguardian.com, 7 August 2017.
- 5.
Athelstan Homepage, http://www.athel.com/cspatg.html, last accessed 2017/08/09.
- 6.
The computational issue is well known: many web pages contain Unicode errors; the robot assumes that the downloaded web page is in Unicode, but the errors remain and appear in the web corpus.
- 7.
shū zhōng zì yǒu huángjīn wū, A book holds a house of gold.
- 8.
According to Wikipedia, English is good for 51.6% of all web pages, Spanish for 5.1%, French for 4.1%, and Chinese for 2.0%. Wikipedia homepage, https://en.wikipedia.org/wiki/Languages_used_on_the_Internet, last accessed 2017/08/17.
References
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM Press/Addison Wesley, New York (1999)
Baroni, M., Bernardini, S., Ferraresi, A., Zanchetta, E.: The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. J. Lang. Res. Eval. 43, 209–226 (2009)
Booij, G.: Morphology in construction grammar. In: Hoffmann, T., Trousdale, G. (eds.) The Oxford Handbook of Construction Grammar, pp. 255–273. Oxford University Press, Oxford/New York (2013)
Burger, H., Dobrovol’skij, D., Kühn, P., Norrick, N. (eds.): Phraseologie/Phraseology. Ein internationales Handbuch der zeitgenössischen Forschung/An International Hand-book of Contemporary Research. De Gruyter, Berlin/New York (2007)
Colson, J-P.: The World Wide Web as a corpus for set phrases. In: Burger, H., Dobro-vol’skij, D., Kühn, P., Norrick, N. (eds.) Phraseologie/Phraseology. Ein internationales Handbuch der zeitgenössischen Forschung/An International Handbook of Contemporary Research, pp. 1071–1077. De Gruyter, Berlin/ New York (2007)
Colson, J.-P.: Set phrases around globalization: an experiment in corpus-based computational phraseology. In: Alonso Almeida, F., Ortega Barrera, I., Quintana Toledo, E., Sanchez Cuervo, M.E. (eds.) Input a Word, Analyze the World. Selected Approaches to Corpus Linguistics. Cambridge Scholars Publishing, Newcastle, pp. 141–152 (2016)
Croft, W.: Radical Construction Grammar: Syntactic Theory in Typological Perspective. Oxford University Press, Oxford (2001)
Croft, W.: Radical construction grammar. In: Hoffmann, T.H., Trousdale, G. (eds.) The Oxford Handbook of Construction Grammar, pp. 211–232. Oxford University Press, Oxford/New York (2013)
Fillmore, C.H.: The mechanisms of construction grammar. Berkeley Linguistic Soc. 14, 35–55 (1988)
Goldberg, A.: Constructions: A Construction Grammar Approach to Argument Structure. University of Chicago Press, Chicago (1995)
Goldberg, A.: Constructions: a new theoretical approach to language. Trends Cogn. Sci. 7(5), 219–224 (2003)
Goldberg, A.: Constructions at Work: The Nature of Generalization in Language. Oxford University Press, Oxford (2006)
Gries, S.: 50-something years of work on collocations. What is or should be next …. Int. J. Corpus Linguist. 18, 137–165 (2013)
Gries, S.: Data in construction grammar. In: Hoffmann, T.H., Trousdale, G. (eds.) The Oxford Handbook of Construction Grammar, pp. 93–108. Oxford University Press, Oxford/New York (2013)
Gries, S., Stefanowitsch, A.: Extending collostructional analysis: a corpus-based perspective on ‘Alternations’. Int. J. Corpus Linguist. 9(1), 97–129 (2004)
Henry, K.: Les chengyu du chinois: caractérisation de phrasèmes hors norme. Yearb. Phraseology 7, 99–126 (2016)
Hoffmann, T.H., Trousdale, G. (eds.): The Oxford Handbook of Construction Grammar. Oxford University Press, Oxford/New York (2013)
Manning, C.H., Raghavan, P., Schütze, H.: An Introduction to Information Retrieval. Cambridge University Press, Cambridge (2009)
Moon, R.: Fixed Expressions and Idioms in English. Clarendon Press, Oxford (1998)
Sinclair, J.: Corpus, Concordance, Collocation. Oxford University Press, Oxford (1991)
Stefanowitsch, A.: Collostructional analysis. In: Hoffmann, T.H., Trousdale, G. (eds.) The Oxford Handbook of Construction Grammar, pp. 290–306. Oxford University Press, Oxford/New York (2013)
Wray, A.: Formulaic Language: Pushing the Boundaries. Oxford University Press, Oxford (2008)
Wulff, S.: Words and idioms. In: Hoffmann, T.H., Trousdale, G. (eds.) The Oxford Hand-book of Construction Grammar, pp. 274–289. Oxford University Press, Oxford/New York (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Colson, JP. (2017). The IdiomSearch Experiment: Extracting Phraseology from a Probabilistic Network of Constructions. In: Mitkov, R. (eds) Computational and Corpus-Based Phraseology. EUROPHRAS 2017. Lecture Notes in Computer Science(), vol 10596. Springer, Cham. https://doi.org/10.1007/978-3-319-69805-2_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-69805-2_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69804-5
Online ISBN: 978-3-319-69805-2
eBook Packages: Computer ScienceComputer Science (R0)