Authors:
George Volis
;
Christos Makris
and
Andreas Kanavos
Affiliation:
University of Patras, Greece
Keyword(s):
Searching and Browsing, Web Information Filtering and Retrieval, Text Mining, Indexing Structures, Inverted Files, Index Compression-Gram Indexing, Sequence Analysis and Assembly.
Related
Ontology
Subjects/Areas/Topics:
Searching and Browsing
;
Web Information Systems and Technologies
;
Web Interfaces and Applications
Abstract:
The number and size of genomic databases have grown rapidly the last years. Consequently, the number of Internet-accessible databases has been rapidly growing .Therefore there is a need for satisfactory methods for managing this growing information. A lot of effort has been put to this direction. Contributing to this effort this paper presents two algorithms which can eliminate the amount of space for storing genomic information. Our first algorithm is based on the classic n-grams/2L technique for indexing a DNA sequence and it can convert the Inverted Index of this classic algorithm to a more compressed format. Researchers have revealed the existence of repeated and palindrome patterns in DNA of living organisms. The main motivation of this technique is based on this remark and proposes an alternative data structure for handling these sequences. Our experimental results show that our algorithm can achieve a more efficient index than the n-grams/2L algorithm and can be adapted by any
algorithm that is based to n-grams/2L The second algorithm is based on the n-grams technique. Perceiving the four symbols of DNA alphabet as vertex of a square scheme imprint a DNA sequence as a relation between vertices, sides and diagonals of a square. The experimental results shows that this second idea succeed even more successfully compression of our index structure.
(More)