loading
Papers Papers/2022 Papers Papers/2022

Research.Publish.Connect.

Paper

Paper Unlock

Authors: Petr Procházka and Jan Holub

Affiliation: Department of Theoretical Computer Science, Faculty of Information Technology, Czech Technical University, Thákurova 2700/9, Prague 6, Czech Republic

Keyword(s): Consensus Nucleotide Sequences, Genomic Sequences, Degenerate Pattern Matching, q-gram Inverted Index.

Abstract: We propose a novel pattern matching algorithm for consensus nucleotide sequences over IUPAC alphabet, called BADPM (Byte-Aligned Degenerate Pattern Matching). The consensus nucleotide sequences represent a consensus obtained by sequencing a population of the same species and they are considered as so-called degenerate strings. BADPM works at the level of single bytes and it achieves sublinear search time on average. The algorithm is based on tabulating all possible factors of the searched pattern. It needs O(m + mα2 logm)space data structure and O(mα2) time for preprocessing where m is a length of the pattern and α represents a maximum number of variants implied from a 4-gram over IUPAC alphabet. The worst-case locate time is bounded by O(nm2α4) for BADPM where n is the length of the input text. However, the experiments performed on real genomic data proved the sublinear search time. BADPM can easily cooperate with the block q-gram inverted index and so achieve still better locate ti me. We implemented two other pattern matching algorithms for IUPAC nucleotide sequences as a baseline: Boyer-Moore-Horspool (BMH) and Parallel Naive Search (PNS). Especially PNS proves its efficiency insensitive to the length of the searched pattern m. BADPM proved its strong superiority for searching middle and long patterns. (More)

CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 18.216.32.116

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Procházka, P. and Holub, J. (2019). On-line Searching in IUPAC Nucleotide Sequences. In Proceedings of the 12th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2019) - BIOINFORMATICS; ISBN 978-989-758-353-7; ISSN 2184-4305, SciTePress, pages 66-77. DOI: 10.5220/0007382900660077

@conference{bioinformatics19,
author={Petr Procházka. and Jan Holub.},
title={On-line Searching in IUPAC Nucleotide Sequences},
booktitle={Proceedings of the 12th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2019) - BIOINFORMATICS},
year={2019},
pages={66-77},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0007382900660077},
isbn={978-989-758-353-7},
issn={2184-4305},
}

TY - CONF

JO - Proceedings of the 12th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2019) - BIOINFORMATICS
TI - On-line Searching in IUPAC Nucleotide Sequences
SN - 978-989-758-353-7
IS - 2184-4305
AU - Procházka, P.
AU - Holub, J.
PY - 2019
SP - 66
EP - 77
DO - 10.5220/0007382900660077
PB - SciTePress