Exploration on efficient similar sentences extraction

Gu, Yanhui; Yang, Zhenglu; Xu, Guandong; Nakano, Miyuki; Toyoda, Masashi; Kitsuregawa, Masaru

doi:10.1007/s11280-012-0195-z

Exploration on efficient similar sentences extraction

Published: 06 January 2013

Volume 17, pages 595–626, (2014)
Cite this article

World Wide Web Aims and scope Submit manuscript

Yanhui Gu¹,
Zhenglu Yang¹,
Guandong Xu²,
Miyuki Nakano¹,
Masashi Toyoda¹ &
…
Masaru Kitsuregawa¹

421 Accesses
5 Citations
Explore all metrics

Abstract

Measuring the semantic similarity between sentences is an essential issue for many applications, such as text summarization, Web page retrieval, question-answer model, image extraction, and so forth. A few studies have explored on this issue by several techniques, e.g., knowledge-based strategies, corpus-based strategies, hybrid strategies, etc. Most of these studies focus on how to improve the effectiveness of the problem. In this paper, we address the efficiency issue, i.e., for a given sentence collection, how to efficiently discover the top-k semantic similar sentences to a query. The previous methods cannot handle the big data efficiently, i.e., applying such strategies directly is time consuming because every candidate sentence needs to be tested. In this paper, we propose efficient strategies to tackle such problem based on a general framework. The basic idea is that for each similarity, we build a corresponding index in the preprocessing. Traversing these indices in the querying process can avoid to test many candidates, so as to improve the efficiency. Moreover, an optimal aggregation algorithm is introduced to assemble these similarities. Our framework is general enough that many similarity metrics can be incorporated, as will be discussed in the paper. We conduct extensive experimental evaluation on three real datasets to evaluate the efficiency of our proposal. In addition, we illustrate the trade-off between the effectiveness and efficiency. The experimental results demonstrate that the performance of our proposal outperforms the state-of-the-art techniques on efficiency while keeping the same high precision as them.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Atallah, M.J., Fox, S.: Algorithms and Theory of Computation Handbook, 1st edn. CRC Press, Inc. (1998)
Blake, M.B., Cabral, L., König-Ries, B., Küster, U., Martin, D.: Semantic Web Services: Advancement through Evaluation. Springer (2012)
Bollegala, D., Matsuo, Y., Ishizuka, M.: Measuring semantic similarity between words using web search engines. In: Proceedings of the International Conference on World Wide Web, WWW’07, pp. 757–766 (2007)
Burgess, C., Livesay, K., Lund, K.: Explorations in context space: words, sentences, discourse. Discourse Process. 25, 211–257 (1998)
Article Google Scholar
Ceccarelli, D., Lucchese, C., Orlando, S., Perego, R., Silvestri, F.: Caching query-biased snippets for efficient retrieval. In: Proceedings of the International Conference on Extending Database Technology, EDBT/ICDT ’11, pp. 93–104 (2011)
Chowdhury, G.G.: Introduction to Modern Information Retrieval, 3rd edn. Facet (2010)
Cohen, W.W.: Integration of heterogeneous databases without common domains using queries based on textual similarity. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD ’98, pp. 201–212 (1998)
Cui, H., Sun, R., Li, K., Kan, M.Y., Chua, T.S.: Question answering passage retrieval using dependency relations. In: Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’05, pp. 400–407 (2005)
Damerau, F.J.: A technique for computer detection and correction of spelling errors. Commun. ACM 7(3), 171–176 (1964)
Article Google Scholar
Dolan, B., Quirk, C., Brockett, C.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the International Conference on Computational Linguistics, COLING ’04, pp. 350–356 (2004)
Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. In: Proceedings of the ACM SIGMOD symposium on Principles of Database Systems, PODS ’01, pp. 102–113 (2001)
Foltz, P.W., Kintsch, W., Landauer, T.K.: The measurement of textual coherence with latent semantic analysis. Discourse Process. 25, 285–307 (1998)
Article Google Scholar
Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings of the International Joint Conference on Artifical Intelligence, IJCAI’07, pp. 1606–1611 (2007)
Goyal, A., Daumé III, H.: Approximate scalable bounded space sketch for large data nlp. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’11, pp. 250–261 (2011)
Goyal, A., Daumé III, H., Venkatasubramanian, S.: Streaming for large scale nlp: language modeling. In: Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL ’09, pp. 512–520 (2009)
Hamming, R.W.: Error detecting and error correcting codes. Bell Syst. Tech. J. 29(2), 147–160 (1950)
Article MathSciNet Google Scholar
Han, W.S., Lee, J., Moon, Y.S., Jiang, H.: Ranked subsequence matching in time-series databases. In: Proceedings of the International Conference on Very Large Databases, VLDB ’07, pp. 423–434 (2007)
Hatzivassiloglou, V., Klavans, J.L., Eskin, E.: Detecting text similarity over short passages: Exploring linguistic feature combinations via machine learning. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, EMNLP/VLC ’99, pp. 203–212 (1999)
Hellerstein, J.M., Avnur, R., Chou, A., Hidber, C., Olston, C., Raman, V., Roth, T., Haas, P.J.: Interactive data analysis: the control project. IEEE Comput. 32(8), 51–59 (1999)
Article Google Scholar
Hirschberg, D.S.: A linear space algorithm for computing maximal common subsequences. Commun. ACM 18(6), 341–343 (1975)
Article MATH MathSciNet Google Scholar
Hua, M., Pei, J., Fu, A.W., Lin, X., Leung, H.F.: Top-k typicality queries and efficient query answering methods on large databases. VLDB J. 18(3), 809–835 (2009)
Article Google Scholar
Hua, M., Pei, J., Fu, A.W.C., Lin, X., Leung, H.F.: Efficiently answering top-k typicality queries on large databases. In: Proceedings of the International Conference on Very Large Databases, VLDB ’07, pp. 890–901 (2007)
Islam, A., Inkpen, D.: Second order co-occurrence pmi for determining the semantic similarity of words. In: Proceedings of the International Conference on Language Resources and Evaluation, LREC ’06, pp. 1033–1038 (2006)
Islam, A., Inkpen, D.: Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans. Knowl. Discov. Data 2(2), 1–25 (2008)
Article Google Scholar
Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J. Am. Stat. Assoc. 84(406), 414–420 (1989)
Article Google Scholar
Jiang, J.J., Conrath, D.W.: Semantic similarity based on corpus statistics and lexical taxonomy. CoRR cmp-lg/9709008 (1997)
Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28(1), 11–20 (1972)
Article Google Scholar
Kim, J.W., Kashyap, A., Li, D., Bhamidipati, S.: Efficient wikipedia-based semantic interpreter by exploiting top-k processing. In: Proceedings of the International Conference on Information and Knowledge Management, CIKM ’10, pp. 1813–1816 (2010)
Koren, J., Zhang, Y., Liu, X.: Personalized interactive faceted search. In: Proceedings of the International Conference on World Wide Web, WWW ’08, pp. 477–486 (2008)
Landauer, T.K., Dumais, S.T.: A solution to Plato’s problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge. Psychol. Rev. 104, 211–240 (1997)
Article Google Scholar
Landauer, T.K., Folt, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Process. 25(2), 259–284 (1998)
Article Google Scholar
Leacock, C., Chodorow, M.: Combining local context and wordnet similarity for word sense identification. In: Fellbaum, C. (ed.) WordNet: An Electronic Lexical Database, pp. 305–332. MIT Press (1998)
Levenshtein, V.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966)
MathSciNet Google Scholar
Li, Y., Bandar, Z.A., McLean, D.: An approach for measuring semantic similarity between words using multiple information sources. IEEE Trans. Knowl. Data Eng. 15(4), 871–882 (2003)
Article Google Scholar
Li, Y., McLean, D., Bandar, Z., O’Shea, J., Crockett, K.A.: Sentence similarity based on semantic nets and corpus statistics. IEEE Trans. Knowl. Data Eng. 18(8), 1138–1150 (2006)
Article Google Scholar
Maguitman, A.G., Menczer, F., Roinestad, H., Vespignani, A.: Algorithmic detection of semantic similarity. In: Proceedings of the International Conference on World Wide Web, WWW ’05, pp. 107–116 (2005)
Maynard, D., Greenwood, M.A.: Large scale semantic annotation, indexing and search at the national archives. In: Proceedings of the International Conference on Language Resources and Evaluation, LREC ’12, pp. 3487–3494 (2012)
Meadow, C.T.: Text Information Retrieval Systems. Academic Press, Inc. (1992)
Metzler, D., Dumais, S.T., Meek, C.: Similarity measures for short segments of text. In: Proceedings of the European Conference on Information Retrieval, ECIR ’07, pp. 16–27 (2007)
Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and knowledge-based measures of text semantic similarity. In: Proceedings of the AAAI Conference on Artificial Intelligence, AAAI’06, pp. 775–780 (2006)
Mihalcea, R., Tarau, P.: Textrank: Bringing order into text. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP’04, pp. 404–411 (2004)
Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)
Article Google Scholar
Navarro, G., Baeza-Yates, R.A.: A practical q -gram index for text retrieval allowing errors. CLEI Electr. J. 1(2) (1998)
Pantel, P., Crestan, E., Borkovsky, A., Popescu, A.M., Vyas, V.: Web-scale distributional similarity and entity set expansion. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP’09, pp. 938–947 (2009)
Rada, R., Mili, H., Bicknell, E., Blettner, M.: Development and application of a metric on semantic nets. IEEE Trans. Syst. Man Cybern. 19(1), 17–30 (1989)
Article Google Scholar
Radinsky, K., Agichtein, E., Gabrilovich, E., Markovitch, S.: A word at a time: computing word relatedness using temporal semantic analysis. In: Proceedings of the International Conference on World Wide Web, WWW ’11, pp. 337–346 (2011)
Radlinski, F., Broder, A., Ciccolo, P., Gabrilovich, E., Josifovski, V., Riedel, L.: Optimizing relevance and revenue in ad search: a query substitution approach. In: Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’08, pp. 403–410 (2008)
Re, C., Dalvi, N.N., Suciu, D.: Efficient top-k query evaluation on probabilistic data. In: Proceedings of the International Conference on Data Engineering, ICDE’07, pp. 886–895 (2007)
Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the International Joint Conference on Artificial Intelligence, IJCAI’95, pp. 448–453 (1995)
Ryeng, N.H., Vlachou, A., Doulkeridis, C., Nørvåg, K.: Efficient distributed top-k query processing with caching. In: Proceedings of the International Conference on Database Systems for Advanced Applications, DASFAA’11, pp. 280–295 (2011)
Sahami, M., Heilman, T.D.: A web-based kernel function for measuring the similarity of short text snippets. In: Proceedings of the International Conference on World Wide Web, WWW ’06 (2006)
Salton, G.: Automatic Text Processing. Addison-Wesley Longman Publishing Co., Inc. (1988)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process Manag. 24(5), 513–523 (1988)
Article Google Scholar
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
Article MATH Google Scholar
Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD ’04, pp. 743–754 (2004)
Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423 (1948)
Article MATH MathSciNet Google Scholar
Tsatsaronis, G., Varlamis, I., Vazirgiannis, M.: Text relatedness based on a word thesaurus. J. Artif. Intell. Res. 37, 1–39 (2010)
MATH Google Scholar
Turney, P.: Mining the web for synonyms: Pmi-ir versus lsa on toefl. In: Proceedings of the European Conference on Machine Learning, ECML’01, pp. 491–502 (2001)
Ukkonen, E.: Approximate string matching with q-grams and maximal matches. Theo. Comp. Sci. 92(1), 191–211 (1992)
Article MATH MathSciNet Google Scholar
Vernica, R., Li, C.: Efficient top-k algorithms for fuzzy search in string collections. In: Proceedings of the International Workshop on Keyword Search on Structured Data, KEYS’09, pp. 9–14 (2009)
Vlachou, A., Doulkeridis, C., Nørvåg, K., Vazirgiannis, M.: On efficient top-k query processing in highly distributed environments. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, pp. 753–764 (2008)
Wang, K., Ming, Z.Y., Hu, X., Chua, T.S.: Segmentation of multi-sentence questions: towards effective question retrieval in cqa services. In: Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’10, pp. 387–394 (2010)
Wei, F., Li, W., Lu, Q., He, Y.: Query-sensitive mutual reinforcement chain and its application in query-oriented multi-document summarization. In: Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’08, pp. 283–290 (2008)
Wu, Z., Palmer, M.: Verbs semantics and lexical selection. In: Proceedings of the annual meeting on Association for Computational Linguistics, ACL’94, pp. 133–138 (1994)
Yang, Z., Kitsuregawa, M.: Efficient searching top-k semantic similar words. In: Proceedings of the International Joint Conference on Artificial Intelligence, IJCAI’11, pp. 2373–2378 (2011)
Yang, Z., Yu, J., Kitsuregawa, M.: Fast algorithms for top-k approximate string matching. In: Proceedings of the AAAI Conference on Artificial Intelligence, AAAI’10, pp. 1467–1473 (2010)
Zhang, X., Chomicki, J.: Semantics and evaluation of top-k queries in probabilistic databases. Distributed and Parallel Databases 26(1), 67–126 (2009)
Article Google Scholar
Zhuge, H.: The Web Resource Space Model. Springer (2008)
Zhuge, H.: Communities and emerging semantics in semantic link network: discovery and learning. IEEE Trans. Knowl. Data Eng. 21(6), 785–799 (2009)
Article MathSciNet Google Scholar
Zhuge, H.: Interactive semantics. Artif. Intell. 174(2), 190–204 (2010)
Article Google Scholar
Zhuge, H.: Special section: semantic link network. Future Gener. Comput. Syst. 26(3), 359–360 (2010)
Article Google Scholar
Zhuge, H.: Semantic linking through spaces for cyber-physical-socio intelligence: a methodology. Artif. Intell. 175(5–6), 988–1019 (2011)
Article Google Scholar
Zhuge, H.: The Knowledge Grid: Toward Cyber-Physical Society, 2nd edn. World Scientific Pub Co Inc. (2012)
Zhuge, H., Xing, Y.: Probabilistic resource space model for managing resources in cyber-physical society. IEEE Trans. Serv. Comput. 5(3), 404–421 (2012)
Article Google Scholar
Zhuge, H., Xing, Y., Shi, P.: Resource space model, owl and database: mapping and integration. ACM Trans. Internet Technol. 8(4) (2008)

Download references

Author information

Authors and Affiliations

Institute of Industrial Science, The University of Tokyo, Tokyo, Japan
Yanhui Gu, Zhenglu Yang, Miyuki Nakano, Masashi Toyoda & Masaru Kitsuregawa
Advanced Analytics Institute, University of Technology Sydney, Sydney, New South Wales, Australia
Guandong Xu

Authors

Yanhui Gu
View author publications
You can also search for this author in PubMed Google Scholar
Zhenglu Yang
View author publications
You can also search for this author in PubMed Google Scholar
Guandong Xu
View author publications
You can also search for this author in PubMed Google Scholar
Miyuki Nakano
View author publications
You can also search for this author in PubMed Google Scholar
Masashi Toyoda
View author publications
You can also search for this author in PubMed Google Scholar
Masaru Kitsuregawa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Zhenglu Yang or Guandong Xu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gu, Y., Yang, Z., Xu, G. et al. Exploration on efficient similar sentences extraction. World Wide Web 17, 595–626 (2014). https://doi.org/10.1007/s11280-012-0195-z

Download citation

Received: 18 July 2012
Revised: 17 November 2012
Accepted: 21 November 2012
Published: 06 January 2013
Issue Date: July 2014
DOI: https://doi.org/10.1007/s11280-012-0195-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploration on efficient similar sentences extraction

Abstract

Access this article

Similar content being viewed by others

An Approach to Semantic Text Similarity Computing

Multi-word Similarity and Retrieval Model for a Refined Retrieval of Quranic Sentences

Query-based multi-documents summarization using linguistic knowledge and content word expansion

References

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Exploration on efficient similar sentences extraction

Abstract

Access this article

Similar content being viewed by others

An Approach to Semantic Text Similarity Computing

Multi-word Similarity and Retrieval Model for a Refined Retrieval of Quranic Sentences

Query-based multi-documents summarization using linguistic knowledge and content word expansion

References

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation