Conferences >2018 IEEE International Confe...

DLA: a Distributed, Location-based and Apriori-based Algorithm for Biological Sequence Pattern Mining

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

With the rapid growth of genomic data, the need for scalable data mining algorithms has increased. Frequent contiguous sequence mining is a technique that can help biolog...Show More

Metadata

Abstract:

With the rapid growth of genomic data, the need for scalable data mining algorithms has increased. Frequent contiguous sequence mining is a technique that can help biologists to better understand the function and structure of our DNA, by capturing the common characteristics among related sequences. Many sequence mining algorithms have been developed over time. However, most of them suffer from scaling issues when dealing with big data or give no warranty for the completeness of their result. In this paper, we propose a distributed sequential pattern mining algorithm implemented on Apache Spark. Specifically, the algorithm exploits the Apriori Property and information about each patterns location within the original sequence, to drastically reduce the number of candidates at each iteration. Experimental results on real-world datasets confirm our performance expectations, showing a better scalability when compared to other distributed solutions.

Published in: 2018 IEEE International Conference on Big Data (Big Data)

Date of Conference: 10-13 December 2018

Date Added to IEEE Xplore: 24 January 2019

ISBN Information:

DOI: 10.1109/BigData.2018.8622007

Conference Location: Seattle, WA, USA