ETI: an efficient index for set similarity queries

Jia, Lianyin; Xi, Jianqing; Li, Mengjuan; Liu, Yong; Miao, Decheng

doi:10.1007/s11704-012-1237-5

ETI: an efficient index for set similarity queries

Research Article
Published: 10 November 2012

Volume 6, pages 700–712, (2012)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Lianyin Jia^1,2,
Jianqing Xi¹,
Mengjuan Li³,
Yong Liu¹ &
…
Decheng Miao¹

126 Accesses
8 Citations
Explore all metrics

Abstract

Set queries are an important topic and have attracted a lot of attention. Earlier research mainly concentrated on set containment queries. In this paper we focus on the T-Overlap query which is the foundation of the set similarity query. To address this issue, unlike traditional algorithms that are based on an inverted index, we design a new paradigm based on the prefix tree (trie) called the expanded trie index (ETI) which expands the trie node structure by adding some new properties. Based on ETI, we convert the TOverlap problem to finding query nodes with specific query depth equaling to T and propose a new algorithm called TSimilarity to solve T-Overlap efficiently. Then we carry out a three-step framework to extend T-Overlap to other similarity predicates. Extensive experiments are carried out to compare T-Similarity with other inverted index based algorithms from cardinality of query, overlap threshold, dataset size, the number of distinct elements and so on. Results show that T-Similarity outperforms the state-of-the-art algorithms in many aspects.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fast Exact Algorithm to Solve Continuous Similarity Search for Evolving Queries

Efficient query autocompletion with edit distance-based error tolerance

Article 14 December 2019

Index-based, High-dimensional, Cosine Threshold Querying with Optimality Guarantees

Article 26 October 2020

References

Helmer S, Moerkotte G. A performance study of four index structures for set-valued attributes of low cardinality. The VLDB Journal, 2003, 12(3): 244–261
Article Google Scholar
Helmer S. Index structures for databases containing data items with set-valued attributes. Technical Report 2/97, University at Mannheim, http://pi3.informatik.uni-mannheim.de
Helmer S, Aly R, Neumann T, Moerkotte G. Indexing set-valued attributes with a multi-level extendible hashing scheme. In: Proceedings of DEXA. 2007, 98-108
Agrawal P, Arasu A, Kaushik R. On indexing error-tolerant set containment. In: Proceedings of SIGMOD Conference. 2010, 927-938
Morzy M, Morzy T, Nanopoulos A, Manolopoulos Y. Hierarchical bitmap index: an efficient and scalable indexing technique for setvalued attributes. In: Proceedings of ADBIS. 2003, 236–252
Terrovitis M, Passas S, Vassiliadis P, Sellis T K. A combination of trietrees and inverted files for the indexing of set-valued attributes. In: Proceedings of CIKM. 2006, 728–737
Hossain S, Jamil H M. A hybrid index structure for set-valued attributes using item set tree and inverted list. In: Proceedings of DEXA. 2010, 349–357
Arasu A, Ganti V, Kaushik R. Efficient exact set-similarity joins. In: Proceedings of VLDB. 2006, 918–929
Chaudhuri S, Ganti V, Kaushik R. A primitive operator for similarity joins in data cleaning. In: Proceedings of ICDE. 2006
Xiao C, Wang W, Lin X, Yu J X, Wang G. Efficient similarity joins for near duplicate detection. ACM Transactions on Database Systems (TODS), 2008, 36(3): 15
Google Scholar
Hoad T C, Zobel J. Methods for identifying versioned and plagiarized documents. Journal of the American Society for Information Science and Technology, 2003, 54(3): 203–215
Article Google Scholar
Bayardo R J, Ma Y, Srikant R. Scaling up all pairs similarity search. In: Proceedings of the 16th International Conference on World Wide Web. 2007, 131–140
Ribeiro L, Härder T. Efficient set similarity joins using min-prefixes. In: Proceedings of ADBIS. 2009, 88–102
Li C, Lu J, Lu Y. Efficient merging and filtering algorithms for approximate string searches. In: Proceedings of ICDE. 2008, 257–266
Han J, Pei J, Yin Y, Mao R. Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Mining and Knowledge Discovery, 2004, 8(1): 53–87
Article MathSciNet Google Scholar
Agrawal R, Imielinski T, Swami A N. Mining association rules between sets of items in large databases. In: Proceedings of SIGMOD Conference. 1993, 207-216
Wang J N, Feng J H, Li G L. Trie-join: efficient trie-based string similarity joins with edit distance constraints. Proceedings of the VLDB Endowment, 2010, 3(1–2): 1219–1230
Google Scholar
Sarawagi S, Kirpal A. Efficient set joins on similarity predicates. In: Proceedings of SIGMOD Conference. 2004, 743–754
http://en.wikipedia.org/wiki/Trie
Bay S D, Kibler D, Pazzani M J, Smyth P. The UCI KDD archive of large data sets for data mining research and experimentation. ACM SIGKDD Explorations Newsletter, 2000, 2(2): 81–85
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science & Engineering, South China University of Technology, Guangzhou, 530641, China
Lianyin Jia, Jianqing Xi, Yong Liu & Decheng Miao
Department of Computer Science, Yunnan Agricultural University, Kunming, 650201, China
Lianyin Jia
Library, Yunnan Normal University, Kunming, 650092, China
Mengjuan Li

Authors

Lianyin Jia
View author publications
Search author on:PubMed Google Scholar
Jianqing Xi
View author publications
Search author on:PubMed Google Scholar
Mengjuan Li
View author publications
Search author on:PubMed Google Scholar
Yong Liu
View author publications
Search author on:PubMed Google Scholar
Decheng Miao
View author publications
Search author on:PubMed Google Scholar

Additional information

Lianyin Jia is currently a PhD student in the School of Computer Science and Engineering at the South China University of Technology and is also a lecturer of the Yunnan Agricultural University. His research activities cover databases, data mining, parallel computing, and web services.

Jianqing Xi is a professor of Computer Science and Engineering at the South China University of Technology. His current research interests include databases and data warehouses, P2P distributed systems, software development techniques.

Mengjuan Li is a librarian at Yunnan Normal University. She obtained her master degree in technology of computer apllications in 2008 from Kunming University of Science and Technology. Her research interests focus on databases, information retrieval, and parallel computing.

Yong Liu is currently a PhD candidate in the School of Computer Science and Engineering, South China University of Technology and is a lecturer at Guangxi University for Nationalities. His current research interests include databases and GPU based parallel computing.

Decheng Miao is currently a PhD candidate in the School of Computer Science and Engineering, South China University of Technology. He is a lecturer in the School of Mathematics and Information Science, Shaoguan University. His current research interests include software system formal theory, formal semantics, databases, and network computing.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jia, L., Xi, J., Li, M. et al. ETI: an efficient index for set similarity queries. Front. Comput. Sci. 6, 700–712 (2012). https://doi.org/10.1007/s11704-012-1237-5

Download citation

Received: 22 October 2011
Accepted: 14 June 2012
Published: 10 November 2012
Issue Date: December 2012
DOI: https://doi.org/10.1007/s11704-012-1237-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ETI: an efficient index for set similarity queries

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Fast Exact Algorithm to Solve Continuous Similarity Search for Evolving Queries

Efficient query autocompletion with edit distance-based error tolerance

Index-based, High-dimensional, Cosine Threshold Querying with Optimality Guarantees

Explore related subjects

References

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now