skip to main content
10.1145/3290420.3290473acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiccipConference Proceedingsconference-collections
research-article

State-of-art: text similarity computing

Published: 02 November 2018 Publication History

Abstract

In recent years, there have been extensive studies and rapid progresses in text similarity computing that is one of the host and important techniques in many NLP applications. This paper first introduces the background, the basic computing process, the related resources and the techniques of text similarity computing. By comparing several typical models, three key issues about text similarity computing are addressed in details which include text representation model, the similarity calculation and the quality evaluation. The typical applications of text similarity computing are addressed. Finally, the difficulties to compute the text similarity and many future research directions are discussed.

References

[1]
Lin, D.K.: An Information Theoretic Definition of Similarity. In: Fifteenth International Conference on Machine Learning, pp. 296--304(1998)
[2]
Atoum, I., Otoom, A.: Efficient Hybrid Semantic Text Similarity using Wordnet and a Corpus. International Journal of Advanced Computer Science & Applications.vol. 7, no. 9, pp. 124--130(2016)
[3]
Dong, Z.D., Dong, Q.: HowNet. http://www.keenage.com (1999)
[4]
Jacinto, C.: User-Driven Ontology Learning from Structured Data. In: 11th International Conference on Computer and Information Science, pp. 184 -- 189(2012)
[5]
Xu, L.H., Sun S.T., Wang Q.: Text Similarity Algorithm based on Semantic Vector Space Model. In: IEEE/ACIS International Conference on Computer & Information Science. pp. 1--4(2016)
[6]
Adebayo K., Caro, L.D., Boella, G.: A Multi-Feature Approach to Semantic Text Similarity. In: International Workshop on Semantic Evaluation, pp. 718--725(2016)
[7]
Kashyap, A., Han, L., Yus, R., Sleeman, J.: Robust Semantic Text Similarity using LSA, Machine Learning, and Linguistic Resources. Language Resources & Evaluation, vol. 50, no. 1, pp. 125--131(2016)
[8]
Chen, W.L., Zhu, J.B., Zhu, M.H., Yao, T.S.: Text Representation Using Domain Dictionary. Journal of Computer Research and Development. vol. 42, no. 12, pp. 2155--2160(2005)
[9]
Mohammad O.N., Feras, A.M., Eman, A. M.: Improving the User Query for the Boolean Model Using Genetic Algorithms. International Journal of Computer Science Issues, vol. 8, issue 5, pp. 66--70(2011)
[10]
Zhao, Y.H., Shi X.F.: The Application of Vector Space Model in the Information Retrieval System. Advances in Intelligent and Soft Computing, vol. 162, pp. 43--49(2012)
[11]
Roberson, S.E. Sparck, J.K.: Relevance Weighting of Search Terms. Journal of the American Society for Information Science, vol. 27, no. 3, pp. 129--146(1976)
[12]
Wang, K., Thrasher, C., Viegas, E., Li, X., Hsu, B.J.: An Overview of Microsoft Web N-gram Corpus and Applications, In: NAACL HLT 2010, pp. 45--48(2010)
[13]
Chen, D.: Relevance Calculation of Web Text Based on Lexical Cohesion. Master's Thesis. Harbin Institute of Technology(2007)
[14]
Ramachandran, L.: Determining Degree of Relevance of Reviews Using a Graph Based Text Representation. In: 23rd IEEE International Conference on Tools with Artificial Intelligence, pp. 442 -- 445(2011)
[15]
Zhou, Z.T., Bu, D.B., Cheng, X.Q.: Towards Graph-based Text Representation. Journal of Chinese Information Processing. vol. 19, no. 2, pp. 36--43(2005)
[16]
Pab, C., Miriam, F., David V.: An Adaptation of the Vector-Space Model for Ontology-Based Information Retrieval. IEEE Transactions on Knowledge and Data Engineering, vol.19, no.2, pp.261--272(2007)
[17]
Mrabet Y., Kilicoglu H.: TextFlow: A Text Similarity Measure based on Continuous Sequences. In: ACL2017, pp. 763--772(2017)
[18]
Wei, J., Rohini, K. S.: Graph-based Text Representation and Knowledge Discovery. In: ACM symposium on Applied computing on Information access and retrieval, pp. 807-- 811, Seoul, Korea (2007).
[19]
Tuukka, R., Eero, H.: A Method for Determining Ontology-Based Semantic Relevance. Database and Expert Systems Applications, vol. 4653, pp. 680--688(2007)
[20]
Wang, J.: Research on Ontology-Based Semantic Information Retrieval. Ph. Degree thesis. University of Science and Technology of China(2006)
[21]
Zheng, H.T., Kang, B.Y., Kim. H.G.: Exploiting Noun Phrases and Semantic Relationships for Text Document Clustering. Information Sciences, vol. 179, issue 13, pp. 2249--2262(2009)
[22]
Smadi, M., Jaradat, Z., Ayyoub M., Jararweh, Y.: Paraphrase Identification and Semantic Text Similarity Analysis in Arabic news Tweets Using Lexical, Syntactic, and Semantic Features. Information Processing & Management. vol. 53, no. 3, pp. 640--652(2017)
[23]
Huang, L., Milne D., Frank E.: Learning a Concept-based Document Similarity Measure. Journal of the American Society for Information Science and Technology. vol.63, issue 8, pp.1593--1608(2012)
[24]
Gao, M.T., Wang, Z. O.: Document Similarity Strategy Based on Document Index Graph Model. Computer Engineering. vol. 34, no. 7., pp. 19--22(2008)
[25]
Shishehchi, S. Review of Personalized Recommendation Techniques for Learners in E-learning Systems. In: International Conference on Semantic Technology and Information Retrieval. pp. 277--281(2011)
[26]
Neculoiu, P., Versteegh, M., Rotaru, M.: Learning Text Similarity with Siamese Recurrent Networks. In: 1st Workshop on Representation Learning for NLP, pp.148--157. Berlin, Germany (2016)
[27]
Al-Anzi, F.S., Abuzeina, D.: Toward an Enhanced Arabic Text Classification using Cosine Similarity and Latent Semantic Indexing. Journal of King Saud University. vol. 29, issue 2, pp. 189--195(2017)
[28]
Hong, Y., Zhang, Y., Fan, J.L., Liu, T.: Chinese Topic Link Detection based on Semantic Domain Language Model. Journal of Software. vol. 19, no. 9, pp. 2265--2275(2008)
[29]
Zhang, X. M., Li, Z.J., Chao, W.H.: Research of Automatic Topic Detection Based on Incremental Clustering. Journal of Software. vol. 23, no. 6, pp. 1578--1587(2012)
[30]
Lin, Y., Lin, H.F., Zhang, P.: A Learning to Rank Approach based on Ranking Positions. Journal of Shandong University.vol. 42, no. 1, pp. 19--24(2012)
[31]
Song, W.P.: Applications of Short Text Similarity Assessment in User-interactive Question Answering. Ph. Degree thesis. University of Science and Technology of China(2010)
[32]
Li X.F.: The Research and Implementation on Question Understanding and Similarity Computation of Chinese Question Answering System. Master's Thesis. South China University of Technology(2010)
[33]
Prajol, S., Christine, J., Béatrice, D.: Clustering Short Text and Its Evaluation. Lecture Notes in Computer Science, vol. 7182, pp. 169--180(2012)
[34]
Yih, W., Meek, C.: Improving Similarity Measures for Short Segments of Text. In: AAAI-07, pp. 1489--1494, Vancouver (2007)
[35]
Aminul, I., Evangelos, M., Vlado, K.: Text Similarity Using Google Tri-grams. In: Canadian Conference on Advances in Artificial Intelligence, pp. 312--317(2012)
[36]
Li, Y.H., McLean, D., Bandar, Z.A.: Sentence Similarity based on Semantic Nets and Corpus Statistics. IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 8, pp. 1138--1150(2006)
[37]
Jin, C.X., Zhou, H.Y., Bai, Q.C.: Short Text Clustering Algorithm with Feature Keyword Expansion. Advanced Materials Research, vol. 532, pp. 1716--1720(2012)
[38]
Kenter, T., Rijke M.D.: Short Text Similarity with Word Embeddings. In: 24th ACM International Conference on Information and Knowledge Management. pp. 1411--1420, Melbourne, Australia(2015)

Index Terms

  1. State-of-art: text similarity computing

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    ICCIP '18: Proceedings of the 4th International Conference on Communication and Information Processing
    November 2018
    326 pages
    ISBN:9781450365345
    DOI:10.1145/3290420
    • Conference Chairs:
    • Jalel Ben-Othman,
    • Hui Yu,
    • Program Chairs:
    • Herwig Unger,
    • Masayuki Arai
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 02 November 2018

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. knowledge acquisition
    2. natural language processing
    3. text representation
    4. text similarity computing

    Qualifiers

    • Research-article

    Conference

    ICCIP 2018

    Acceptance Rates

    Overall Acceptance Rate 61 of 301 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 243
      Total Downloads
    • Downloads (Last 12 months)8
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 18 Feb 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media