Squeezing Long Sequence Data for Efficient Similarity Search

Song, Guojie; Cui, Bin; Zheng, Baihua; Xie, Kunqing; Yang, Dongqing

doi:10.1007/978-3-540-78849-2_44

Guojie Song¹,
Bin Cui²,
Baihua Zheng³,
Kunqing Xie¹ &
…
Dongqing Yang²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4976))

Included in the following conference series:

Asia-Pacific Web Conference

893 Accesses
1 Citations

Abstract

Similarity search over long sequence dataset becomes increasingly popular in many emerging applications. In this paper, a novel index structure, namely Sequence Embedding Multiset tree(SEM-tree), has been proposed to speed up the searching process over long sequences. The SEM-tree is a multi-level structure where each level represents the sequence data with different compression level of multiset, and the length of multiset increases towards the leaf level which contains original sequences. The multisets, obtained using sequence embedding algorithms, have the desirable property that they do not need to keep the character order in the sequence, i.e. shorter representation, but can reserve the majority of distance information of sequences. Each level of the tree serves to prune the search space more efficiently as the multisets utilize the predicability to finish the searching process beforehand and reduce the computational cost greatly. A set of comprehensive experiments are conducted to evaluate the performance of the SEM-tree, and the experimental results show that the proposed method is much more efficient than existing representative methods.

Supported by the National Natural Science Foundation of China under Grant No. 60703066 and No. 60473051 and supported by the National High-Tech Research and Development Plan of China (863) under Grant No.2006AA12Z217.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

SISIS : Sequence Indexing for SImilarity Search

Sequence graph transform (SGT): a feature embedding function for sequence data mining

Article 04 January 2022

NegPSpan: efficient extraction of negative sequential patterns with embedding constraints

Article 21 January 2020

References

Traina, C., Traina, A.J.M., Seeger, B., Faloutsos, C.: Slim-Trees: High Performance Metric Trees Minimizing Overlap Between Nodes. In: Zaniolo, C., Grust, T., Scholl, M.H., Lockemann, P.C. (eds.) EDBT 2000. LNCS, vol. 1777, pp. 51–65. Springer, Heidelberg (2000)
Chapter Google Scholar
Faloutsos, C., Lin, K.I.: Fast Map: A fast algorithm for indexing, data mining and visualization of traditional and multimedia datasets. In: Proc. of the International Conference on Management of Data (SIGMOD 1995), pp. 163–174 (1995)
Google Scholar
Cormode, G., Muthukrishnan, S.: The string edit distance matching problem with moves. In: Proc. of the 13th annual ACM-SIAM symposium on Discrete Algorithms, pp. 667–676 (2002)
Google Scholar
Jagadish, H.V., Ooi, B.C., Tan, K.-L., Yu, C., Zhang, R.: iDistance: An adaptive B ⁺-tree based indexing method for nearest neighbor search. ACM Trans. on Data Base Systems 30(2), 364–397 (2005)
Article Google Scholar
Venkateswaran, J., Lachwani, D., Kahveci, T., Jermaine, C.M.: Reference-based Indexing of Sequence Databases. In: Proc. 24th VLDB Conference (VLDB 2006), pp. 906–917 (2006)
Google Scholar
Vieira, M.R., Traina, C., Chino, F.J.T., Traina, A.J.M.: DBM-Tree: A Dynamic Metric Access Method Sensitive to Local Density Data. In: Simposio Brasileiro de Bancos de Dados (SBBD 2004), pp. 163–177 (2004)
Google Scholar
Ciaccia, P., Patella, M., Zezula, P.: M-tree: An efficient access method for similarity search in metric spaces. In: Proc. 24th VLDB Conference (VLDB 1997), pp. 194–205 (1997)
Google Scholar
Weiner, P.: Linear Pattern Matching Algorithms. In: IEEE Symposium on Switching and Automata Theory, pp. 1–11 (1973)
Google Scholar
Filho, R.F.S., Traina, A.J.M., Traina, C., Faloutsos, C.: Similarity Search without Tears: The OMNI Family of All-purpose Access Methods. In: Roberto, F. (ed.) Proc, of the 19th International Conference on Data Engineering (ICDE 2001), pp. 623–630 (2001)
Google Scholar
Wang, T.L., Wang, X., Lin, K.I., Shasha, D., Shapiro, B., Zhang, K.: Evaluating a class of distance-mapping algorithms for data mining and clustering. In: Proc. of the 5th ACM International Conference of Knowledge Discovery and Data Mining (SIGKDD 1999), pp. 307–311 (1999)
Google Scholar
http://www-db.stanford.edu/pleiades/SUMATRA.html
Zhang, Z., Schwartz, S., Wagner, L., Miller, W.: A greedy algorithm for aligning DNA sequences. J. Comput. Biol. 7, 203–214 (2000)
Article Google Scholar
Sahinalp, S.C., Tasan, M., Macker, J., Ozsoyoglu, Z.M.: Distance-Based Indexing for String Proximity Search. In: Proceeding of the 19th International Conference on Data Engineering (ICDE 2003), Bangalore, India, March 2003, pp. 125–136 (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Key Laboratory of Machine Perception (Peking University), Ministry of Education, Beijing, China
Guojie Song & Kunqing Xie
School of Electronic Engineering and Computer Science, Peking University, Beijing, China
Bin Cui & Dongqing Yang
School of Information System, Singapore Management University, Singapore
Baihua Zheng

Authors

Guojie Song
View author publications
You can also search for this author in PubMed Google Scholar
Bin Cui
View author publications
You can also search for this author in PubMed Google Scholar
Baihua Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Kunqing Xie
View author publications
You can also search for this author in PubMed Google Scholar
Dongqing Yang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Yanchun Zhang Ge Yu Elisa Bertino Guandong Xu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Song, G., Cui, B., Zheng, B., Xie, K., Yang, D. (2008). Squeezing Long Sequence Data for Efficient Similarity Search. In: Zhang, Y., Yu, G., Bertino, E., Xu, G. (eds) Progress in WWW Research and Development. APWeb 2008. Lecture Notes in Computer Science, vol 4976. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78849-2_44

Download citation

DOI: https://doi.org/10.1007/978-3-540-78849-2_44
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78848-5
Online ISBN: 978-3-540-78849-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics