Skip to main content
Log in

Exploration on efficient similar sentences extraction

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

Measuring the semantic similarity between sentences is an essential issue for many applications, such as text summarization, Web page retrieval, question-answer model, image extraction, and so forth. A few studies have explored on this issue by several techniques, e.g., knowledge-based strategies, corpus-based strategies, hybrid strategies, etc. Most of these studies focus on how to improve the effectiveness of the problem. In this paper, we address the efficiency issue, i.e., for a given sentence collection, how to efficiently discover the top-k semantic similar sentences to a query. The previous methods cannot handle the big data efficiently, i.e., applying such strategies directly is time consuming because every candidate sentence needs to be tested. In this paper, we propose efficient strategies to tackle such problem based on a general framework. The basic idea is that for each similarity, we build a corresponding index in the preprocessing. Traversing these indices in the querying process can avoid to test many candidates, so as to improve the efficiency. Moreover, an optimal aggregation algorithm is introduced to assemble these similarities. Our framework is general enough that many similarity metrics can be incorporated, as will be discussed in the paper. We conduct extensive experimental evaluation on three real datasets to evaluate the efficiency of our proposal. In addition, we illustrate the trade-off between the effectiveness and efficiency. The experimental results demonstrate that the performance of our proposal outperforms the state-of-the-art techniques on efficiency while keeping the same high precision as them.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Atallah, M.J., Fox, S.: Algorithms and Theory of Computation Handbook, 1st edn. CRC Press, Inc. (1998)

  2. Blake, M.B., Cabral, L., König-Ries, B., Küster, U., Martin, D.: Semantic Web Services: Advancement through Evaluation. Springer (2012)

  3. Bollegala, D., Matsuo, Y., Ishizuka, M.: Measuring semantic similarity between words using web search engines. In: Proceedings of the International Conference on World Wide Web, WWW’07, pp. 757–766 (2007)

  4. Burgess, C., Livesay, K., Lund, K.: Explorations in context space: words, sentences, discourse. Discourse Process. 25, 211–257 (1998)

    Article  Google Scholar 

  5. Ceccarelli, D., Lucchese, C., Orlando, S., Perego, R., Silvestri, F.: Caching query-biased snippets for efficient retrieval. In: Proceedings of the International Conference on Extending Database Technology, EDBT/ICDT ’11, pp. 93–104 (2011)

  6. Chowdhury, G.G.: Introduction to Modern Information Retrieval, 3rd edn. Facet (2010)

  7. Cohen, W.W.: Integration of heterogeneous databases without common domains using queries based on textual similarity. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD ’98, pp. 201–212 (1998)

  8. Cui, H., Sun, R., Li, K., Kan, M.Y., Chua, T.S.: Question answering passage retrieval using dependency relations. In: Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’05, pp. 400–407 (2005)

  9. Damerau, F.J.: A technique for computer detection and correction of spelling errors. Commun. ACM 7(3), 171–176 (1964)

    Article  Google Scholar 

  10. Dolan, B., Quirk, C., Brockett, C.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the International Conference on Computational Linguistics, COLING ’04, pp. 350–356 (2004)

  11. Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. In: Proceedings of the ACM SIGMOD symposium on Principles of Database Systems, PODS ’01, pp. 102–113 (2001)

  12. Foltz, P.W., Kintsch, W., Landauer, T.K.: The measurement of textual coherence with latent semantic analysis. Discourse Process. 25, 285–307 (1998)

    Article  Google Scholar 

  13. Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings of the International Joint Conference on Artifical Intelligence, IJCAI’07, pp. 1606–1611 (2007)

  14. Goyal, A., Daumé III, H.: Approximate scalable bounded space sketch for large data nlp. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’11, pp. 250–261 (2011)

  15. Goyal, A., Daumé III, H., Venkatasubramanian, S.: Streaming for large scale nlp: language modeling. In: Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL ’09, pp. 512–520 (2009)

  16. Hamming, R.W.: Error detecting and error correcting codes. Bell Syst. Tech. J. 29(2), 147–160 (1950)

    Article  MathSciNet  Google Scholar 

  17. Han, W.S., Lee, J., Moon, Y.S., Jiang, H.: Ranked subsequence matching in time-series databases. In: Proceedings of the International Conference on Very Large Databases, VLDB ’07, pp. 423–434 (2007)

  18. Hatzivassiloglou, V., Klavans, J.L., Eskin, E.: Detecting text similarity over short passages: Exploring linguistic feature combinations via machine learning. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, EMNLP/VLC ’99, pp. 203–212 (1999)

  19. Hellerstein, J.M., Avnur, R., Chou, A., Hidber, C., Olston, C., Raman, V., Roth, T., Haas, P.J.: Interactive data analysis: the control project. IEEE Comput. 32(8), 51–59 (1999)

    Article  Google Scholar 

  20. Hirschberg, D.S.: A linear space algorithm for computing maximal common subsequences. Commun. ACM 18(6), 341–343 (1975)

    Article  MATH  MathSciNet  Google Scholar 

  21. Hua, M., Pei, J., Fu, A.W., Lin, X., Leung, H.F.: Top-k typicality queries and efficient query answering methods on large databases. VLDB J. 18(3), 809–835 (2009)

    Article  Google Scholar 

  22. Hua, M., Pei, J., Fu, A.W.C., Lin, X., Leung, H.F.: Efficiently answering top-k typicality queries on large databases. In: Proceedings of the International Conference on Very Large Databases, VLDB ’07, pp. 890–901 (2007)

  23. Islam, A., Inkpen, D.: Second order co-occurrence pmi for determining the semantic similarity of words. In: Proceedings of the International Conference on Language Resources and Evaluation, LREC ’06, pp. 1033–1038 (2006)

  24. Islam, A., Inkpen, D.: Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans. Knowl. Discov. Data 2(2), 1–25 (2008)

    Article  Google Scholar 

  25. Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J. Am. Stat. Assoc. 84(406), 414–420 (1989)

    Article  Google Scholar 

  26. Jiang, J.J., Conrath, D.W.: Semantic similarity based on corpus statistics and lexical taxonomy. CoRR cmp-lg/9709008 (1997)

  27. Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28(1), 11–20 (1972)

    Article  Google Scholar 

  28. Kim, J.W., Kashyap, A., Li, D., Bhamidipati, S.: Efficient wikipedia-based semantic interpreter by exploiting top-k processing. In: Proceedings of the International Conference on Information and Knowledge Management, CIKM ’10, pp. 1813–1816 (2010)

  29. Koren, J., Zhang, Y., Liu, X.: Personalized interactive faceted search. In: Proceedings of the International Conference on World Wide Web, WWW ’08, pp. 477–486 (2008)

  30. Landauer, T.K., Dumais, S.T.: A solution to Plato’s problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge. Psychol. Rev. 104, 211–240 (1997)

    Article  Google Scholar 

  31. Landauer, T.K., Folt, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Process. 25(2), 259–284 (1998)

    Article  Google Scholar 

  32. Leacock, C., Chodorow, M.: Combining local context and wordnet similarity for word sense identification. In: Fellbaum, C. (ed.) WordNet: An Electronic Lexical Database, pp. 305–332. MIT Press (1998)

  33. Levenshtein, V.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966)

    MathSciNet  Google Scholar 

  34. Li, Y., Bandar, Z.A., McLean, D.: An approach for measuring semantic similarity between words using multiple information sources. IEEE Trans. Knowl. Data Eng. 15(4), 871–882 (2003)

    Article  Google Scholar 

  35. Li, Y., McLean, D., Bandar, Z., O’Shea, J., Crockett, K.A.: Sentence similarity based on semantic nets and corpus statistics. IEEE Trans. Knowl. Data Eng. 18(8), 1138–1150 (2006)

    Article  Google Scholar 

  36. Maguitman, A.G., Menczer, F., Roinestad, H., Vespignani, A.: Algorithmic detection of semantic similarity. In: Proceedings of the International Conference on World Wide Web, WWW ’05, pp. 107–116 (2005)

  37. Maynard, D., Greenwood, M.A.: Large scale semantic annotation, indexing and search at the national archives. In: Proceedings of the International Conference on Language Resources and Evaluation, LREC ’12, pp. 3487–3494 (2012)

  38. Meadow, C.T.: Text Information Retrieval Systems. Academic Press, Inc. (1992)

  39. Metzler, D., Dumais, S.T., Meek, C.: Similarity measures for short segments of text. In: Proceedings of the European Conference on Information Retrieval, ECIR ’07, pp. 16–27 (2007)

  40. Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and knowledge-based measures of text semantic similarity. In: Proceedings of the AAAI Conference on Artificial Intelligence, AAAI’06, pp. 775–780 (2006)

  41. Mihalcea, R., Tarau, P.: Textrank: Bringing order into text. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP’04, pp. 404–411 (2004)

  42. Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)

    Article  Google Scholar 

  43. Navarro, G., Baeza-Yates, R.A.: A practical q -gram index for text retrieval allowing errors. CLEI Electr. J. 1(2) (1998)

  44. Pantel, P., Crestan, E., Borkovsky, A., Popescu, A.M., Vyas, V.: Web-scale distributional similarity and entity set expansion. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP’09, pp. 938–947 (2009)

  45. Rada, R., Mili, H., Bicknell, E., Blettner, M.: Development and application of a metric on semantic nets. IEEE Trans. Syst. Man Cybern. 19(1), 17–30 (1989)

    Article  Google Scholar 

  46. Radinsky, K., Agichtein, E., Gabrilovich, E., Markovitch, S.: A word at a time: computing word relatedness using temporal semantic analysis. In: Proceedings of the International Conference on World Wide Web, WWW ’11, pp. 337–346 (2011)

  47. Radlinski, F., Broder, A., Ciccolo, P., Gabrilovich, E., Josifovski, V., Riedel, L.: Optimizing relevance and revenue in ad search: a query substitution approach. In: Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’08, pp. 403–410 (2008)

  48. Re, C., Dalvi, N.N., Suciu, D.: Efficient top-k query evaluation on probabilistic data. In: Proceedings of the International Conference on Data Engineering, ICDE’07, pp. 886–895 (2007)

  49. Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the International Joint Conference on Artificial Intelligence, IJCAI’95, pp. 448–453 (1995)

  50. Ryeng, N.H., Vlachou, A., Doulkeridis, C., Nørvåg, K.: Efficient distributed top-k query processing with caching. In: Proceedings of the International Conference on Database Systems for Advanced Applications, DASFAA’11, pp. 280–295 (2011)

  51. Sahami, M., Heilman, T.D.: A web-based kernel function for measuring the similarity of short text snippets. In: Proceedings of the International Conference on World Wide Web, WWW ’06 (2006)

  52. Salton, G.: Automatic Text Processing. Addison-Wesley Longman Publishing Co., Inc. (1988)

  53. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process Manag. 24(5), 513–523 (1988)

    Article  Google Scholar 

  54. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)

    Article  MATH  Google Scholar 

  55. Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD ’04, pp. 743–754 (2004)

  56. Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423 (1948)

    Article  MATH  MathSciNet  Google Scholar 

  57. Tsatsaronis, G., Varlamis, I., Vazirgiannis, M.: Text relatedness based on a word thesaurus. J. Artif. Intell. Res. 37, 1–39 (2010)

    MATH  Google Scholar 

  58. Turney, P.: Mining the web for synonyms: Pmi-ir versus lsa on toefl. In: Proceedings of the European Conference on Machine Learning, ECML’01, pp. 491–502 (2001)

  59. Ukkonen, E.: Approximate string matching with q-grams and maximal matches. Theo. Comp. Sci. 92(1), 191–211 (1992)

    Article  MATH  MathSciNet  Google Scholar 

  60. Vernica, R., Li, C.: Efficient top-k algorithms for fuzzy search in string collections. In: Proceedings of the International Workshop on Keyword Search on Structured Data, KEYS’09, pp. 9–14 (2009)

  61. Vlachou, A., Doulkeridis, C., Nørvåg, K., Vazirgiannis, M.: On efficient top-k query processing in highly distributed environments. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, pp. 753–764 (2008)

  62. Wang, K., Ming, Z.Y., Hu, X., Chua, T.S.: Segmentation of multi-sentence questions: towards effective question retrieval in cqa services. In: Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’10, pp. 387–394 (2010)

  63. Wei, F., Li, W., Lu, Q., He, Y.: Query-sensitive mutual reinforcement chain and its application in query-oriented multi-document summarization. In: Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’08, pp. 283–290 (2008)

  64. Wu, Z., Palmer, M.: Verbs semantics and lexical selection. In: Proceedings of the annual meeting on Association for Computational Linguistics, ACL’94, pp. 133–138 (1994)

  65. Yang, Z., Kitsuregawa, M.: Efficient searching top-k semantic similar words. In: Proceedings of the International Joint Conference on Artificial Intelligence, IJCAI’11, pp. 2373–2378 (2011)

  66. Yang, Z., Yu, J., Kitsuregawa, M.: Fast algorithms for top-k approximate string matching. In: Proceedings of the AAAI Conference on Artificial Intelligence, AAAI’10, pp. 1467–1473 (2010)

  67. Zhang, X., Chomicki, J.: Semantics and evaluation of top-k queries in probabilistic databases. Distributed and Parallel Databases 26(1), 67–126 (2009)

    Article  Google Scholar 

  68. Zhuge, H.: The Web Resource Space Model. Springer (2008)

  69. Zhuge, H.: Communities and emerging semantics in semantic link network: discovery and learning. IEEE Trans. Knowl. Data Eng. 21(6), 785–799 (2009)

    Article  MathSciNet  Google Scholar 

  70. Zhuge, H.: Interactive semantics. Artif. Intell. 174(2), 190–204 (2010)

    Article  Google Scholar 

  71. Zhuge, H.: Special section: semantic link network. Future Gener. Comput. Syst. 26(3), 359–360 (2010)

    Article  Google Scholar 

  72. Zhuge, H.: Semantic linking through spaces for cyber-physical-socio intelligence: a methodology. Artif. Intell. 175(5–6), 988–1019 (2011)

    Article  Google Scholar 

  73. Zhuge, H.: The Knowledge Grid: Toward Cyber-Physical Society, 2nd edn. World Scientific Pub Co Inc. (2012)

  74. Zhuge, H., Xing, Y.: Probabilistic resource space model for managing resources in cyber-physical society. IEEE Trans. Serv. Comput. 5(3), 404–421 (2012)

    Article  Google Scholar 

  75. Zhuge, H., Xing, Y., Shi, P.: Resource space model, owl and database: mapping and integration. ACM Trans. Internet Technol. 8(4) (2008)

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Zhenglu Yang or Guandong Xu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gu, Y., Yang, Z., Xu, G. et al. Exploration on efficient similar sentences extraction. World Wide Web 17, 595–626 (2014). https://doi.org/10.1007/s11280-012-0195-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-012-0195-z

Keywords

Navigation