Abstract
Multiword Expressions (MWEs) appear frequently and ungrammatically in natural languages. Identifying MWEs in free texts is a very challenging problem. This paper proposes a knowledge-free, unsupervised, and language-independent Multiword Expression Distance (MED). The new metric is derived from an accepted physical principle, measures the distance from an n-gram to its semantics, and outperforms other state-of-the-art methods on MWEs in two applications: question answering and named entity extraction.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Choueka Y. Looking for needles in a haystack or locating interesting collocation expressions in large textual databases. In Proc. the RIAO Conf. User-Orient Content-Based Text and Image Hamdling, Cambridge, USA, Mar. 21–24, 1988, pp.38–43.
Jackendoff R. The Architecture of the Language Faculty. MIT Press, Cambridge, MA, 1997.
Fellbaum C. WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA, 1998.
Church K W, Hanks P. Word association norms, mutual information and lexicography. Computational Linguistics, 1990, 16(1): 22–29.
Dias G, Guilloré S, Lopes J G P. Mining textual associations in text corpora. In Proc. Sixth ACM SIGKDD, Workshop on Text Mining, Boston, USA, Aug. 20–23, 2000, pp.92–95.
Pecina P. An extensive empirical study of collocation extraction methods. In Proc. COLING-ACL, Sydney, Australia, Jul. 17–21, 2006, pp.953–960.
Silva J, Lopes G. A local maxima method and a fair dispersion normalization for extracting multiword units. In Proc. Sixth Meeting on Mathematics of Language, Orlando, USA, Jul. 23–25, 1999, pp.369–381.
Schone P, Jurafsky D. Is knowledge-free induction of multiword unit dictionary headwords a solved problem? In Proc. EMNLP, Pittsburgh, USA, Jun. 3–4, 2001, pp.100–108.
Zhang W, Yoshida T, Tang X, Ho T B. Improving effectiveness of mutual information for substantival multiword expression extraction. Expert Systems with Applications, 2009, 36(8): 10919–10930.
Bennett C H, Gács P, Li M et al. Information distance. IEEE Trans. Information Theory, 1998, 44(4): 1407–1423.
Downey D, Broadhead M, Etzioni O. Locating complex named entities in Web text. In Proc. IJCAI, Hyderabad, India, Jan. 6–12, 2007, pp.2733–2739.
Justeson J S, Katz S M. Technical terminology: Some linguistic properties and an algorithm for identification in text. Natural Language Engineering, 1995, 1(1): 9–27.
Argamon S, Dagan I, Krymolowski Y. A memory-based approach to learning shallow natural language patterns. In Proc. COLING, Montreal, Canada, Aug. 10–14, 1998, pp.67–73.
McCallum A, Li W. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proc. the 7th Conference on Natural Language Learning at HLT-NAACL, Edmonton, Canada, May 27-June 1, 2003, pp.188–191.
Finkel J R, Grenager T, Manning C. Incorporating non-local information into information extraction systems by Gibbs sampling. In Proc. ACL, Michigan, USA, Jun. 25–30, 2005, pp.363–370.
Dunning T. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 1993, 19(1): 61–74.
Lin D. Automatic identification of non-compositional phrases. In Proc. ACL 1999, College Park, USA, Jun. 20–26, 1999, pp.317–324.
Park Y, Byrd R J, Boguraev B K. Automatic glossary extraction: Beyond terminology identification. In Proc. the 19th Int. Conf. Computational Linguistics, Taipei, China, Aug. 24-Sept. 1, 2002, pp.1–7.
Li M, Badger J H, Chen X et al. An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics, 2001, 17(2): 149–154.
Li M, Chen X, Li X, Ma B, Vitányi P M B. The similarity metric. IEEE Trans. IT, 2004, 50(12): 3250–3264.
Bennett C H, Li M, Ma B. Chain letters and evolutionary histories. Scientific American, 2003, 288(6): 76–81. (Feature Article)
Chen X, Francia B, Li M, Mckinnon B, Seker A. Shared information and program plagiarism detection. IEEE Trans. Information Theory, 2004, 50(7): 1545–1550.
Keogh E J, Lonardi S, Ratanamahatana C A. Towards parameter-free data mining. In Proc. ACM SIGKDD, Seattle, USA, Aug. 22–25, 2004, pp.206–215.
Cilibrasi R L, Vitányi P M B. The Google similarity distance. IEEE Trans-Knowledge and Data Engineering, 2007, 19(3): 370–383.
Baldwin T. Multiword expressions. Advanced Course at the Australasian Language Technology Summer School, 2004.
Bu F, Zhu X, Li M. Measuring the non-compositionality of multiword expressions. In Proc. the 23rd International Conference on Computational Linguistics, Beijing, China, 2010, pp.116–124.
Manning C D, Schütze H. Foundations of Statistical Natural Language Processing, MIT Press, Cambridge, MA, 1999.
Li M, Vitányi P M B. An Introduction to Kolmogorov Complexity and Its Applications, Third Edition. New York: Springer-Verlag, 2008.
Zhang Y, Kordoni V, Villavicencio A, Idiart M. Automated multiword expression prediction for grammar engineering. In Proc. the ACL Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties, Sydney, Australia, Jul. 17–21, 2006, pp.36–44.
Magnini B, Negri M, Tanev H. Is it the right answer? Exploiting Web redundancy for answer validation. In Proc. ACL, Philadelphia, USA, Jul. 6–12, 2002, pp.425–432.
Zhang X, Hao Y, Zhu X, Li M. New information measure and its application in question answering system. J. Comput. Sci. Tech., 2008, 23(4): 557–572.
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was supported mainly by Canada’s IDRC Research Chair in Information Technology Program, under Grant No. 104519-006. It is also supported by the National Natural Science Foundation of China under Grant No. 60973104, the National Basic Research 973 Program of China under Grant No. 2007CB311003, NSERC Grant OGP0046506, Canada Research Chair’Program, MITACS, an NSERC Collaborative Grant, and Ontario’s Premier’s Discovery Award.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Bu, F., Zhu, XY. & Li, M. A New Multiword Expression Metric and Its Applications. J. Comput. Sci. Technol. 26, 3–13 (2011). https://doi.org/10.1007/s11390-011-9410-0
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-011-9410-0