In high-variation genomics datasets, such as found in metagenomics or complex polyploid genome analysis, error detection and variant calling are impeded by the difficulty in discerning sequencing errors from actual biological variation. Confirming base candidates with high frequency of occurrence is no longer a reliable measure, because of the natural variation and the presence of rare bases.
This work employs machine learning models to classify bases into erroneous and rare variations, after preselecting potential error candidates with a weighted frequency measure, which aims to focus on unexpected variations by using the inter-sequence pairwise similarity. Different similarity measures are used to account for different types of datasets. Four machine learning models are tested.