short-paper

Mining massive SNP data for identifying associated SNPs and uncovering gene relationships

Authors:
Amy Webb

The Ohio State University, Columbus, OH

The Ohio State University, Columbus, OH
View Profile

,
Aaron Albin

The Ohio State University, Columbus, OH

The Ohio State University, Columbus, OH
View Profile

,
Zhan Ye

Biomedical Informatics Research Center, Marshfield Clinic Research Foundation, WI

Biomedical Informatics Research Center, Marshfield Clinic Research Foundation, WI
View Profile

,
Majid Rastegar-Mojarad

Biomedical Informatics Research Center, Marshfield Clinic Research Foundation, WI

Biomedical Informatics Research Center, Marshfield Clinic Research Foundation, WI
View Profile

,
Kun Huang

The Ohio State University, Columbus, OH

The Ohio State University, Columbus, OH
View Profile

,
Jeffrey Parvin

The Ohio State University, Columbus, OH

The Ohio State University, Columbus, OH
View Profile

,
Wolfgang Sadee

The Ohio State University, Columbus, OH

The Ohio State University, Columbus, OH
View Profile

,
Lang Li

Indiana University, Indianapolis, IN

Indiana University, Indianapolis, IN
View Profile

,
Simon Lin

Biomedical Informatics Research Center, Marshfield Clinic Research Foundation, WI

Biomedical Informatics Research Center, Marshfield Clinic Research Foundation, WI
View Profile

,
Yang Xiang

The Ohio State University, Columbus, OH

The Ohio State University, Columbus, OH
View Profile

BCB '14: Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health InformaticsSeptember 2014Pages 304–313https://doi.org/10.1145/2649387.2649395

Published:20 September 2014Publication History

BCB '14: Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics

Pages 304–313

ABSTRACT

Studies on SNP correlations have been focused on SNPs located on the same chromosome since SNPs on different chromosomes are expected to segregate randomly. Previous studies suggest that SNPs can be associated with each other over long distances and even across different chromosomes. To facilitate the study of SNP associations, our goal is to find SNPs that coexist in a significant number of samples regardless of their genomic distance, and subsequently to study the relationships among these associated SNPs and corresponding genes. This problem of mining co-occurrent SNP associations is computationally challenging and motivates us to design an efficient data mining algorithm FCIRC to mine SNP associations from massive SNP data. By applying our method on the original SNP data and random chromosome permutation data, we demonstrate that our method is able to find non-random SNP associations across multiple chromosomes. Among the large amount of associated SNPs identified by our method, many of them involve multiple chromosomes. Some SNP associations also suggest novel relationships among the corresponding genes, and some may imply biological and disease mechanisms related to corresponding genes.

References

David M Altshuler, Richard A Gibbs, Leena Peltonen, Emmanouil Dermitzakis, Stephen F Schaffner, Fuli Yu, Penelope E Bonnen, PI De Bakker, Panos Deloukas, Stacey B Gabriel, et al. Integrating common and rare genetic variation in diverse human populations. Nature, 467(7311):52--58, 2010.Google ScholarCross Ref
Hyo-Jeong Ban, Jee Yeon Heo, Kyung-Soo Oh, and Keun-Joon Park. Identification of type 2 diabetes-associated combination of snps using support vector machine. BMC genetics, 11(1):26, 2010.Google ScholarCross Ref
Christian Borgelt. Efficient implementations of apriori and eclat. In FIMI'3: Proceedings of the IEEE ICDM workshop on frequent itemset mining implementations, 2003.Google Scholar
Douglas Burdick, Manuel Calimlim, Jason Flannick, Johannes Gehrke, and Tomi Yiu. Mafia: A maximal frequent itemset algorithm. IEEE Trans. Knowl. Data Eng., 17(11):1490--1504, 2005. Google ScholarDigital Library
Jing Chen, Eric E Bardes, Bruce J Aronow, and Anil G Jegga. Toppgene suite for gene list enrichment analysis and candidate gene prioritization. Nucleic acids research, 37(suppl 2):W305--W311, 2009.Google Scholar
Heather J Cordell. Detecting gene--gene interactions that underlie human diseases. Nature Reviews Genetics, 10(6):392--404, 2009.Google ScholarCross Ref
Joshua C Denny, Marylyn D Ritchie, Melissa A Basford, Jill M Pulley, Lisa Bastarache, Kristin Brown-Gentry, Deede Wang, Dan R Masys, Dan M Roden, and Dana C Crawford. Phewas: demonstrating the feasibility of a phenome-wide scan to discover gene--disease associations. Bioinformatics, 26(9):1205--1210, 2010. Google ScholarDigital Library
Gang Feng, Pamela Shaw, Steven T Rosen, Simon M Lin, and Warren A Kibbe. Using the bioconductor geneanswers package to interpret gene lists. In Next Generation Microarray Bioinformatics, pages 101--112. Springer, 2012.Google Scholar
Obi L Griffith, Stephen B Montgomery, Bridget Bernier, Bryan Chu, Katayoon Kasaian, Stein Aerts, Shaun Mahony, Monica C Sleumer, Mikhail Bilenky, Maximilian Haeussler, et al. Oreganno: an open-access community-driven resource for regulatory annotation. Nucleic acids research, 36(suppl 1):D107--D113, 2008.Google Scholar
Jiawei Han, Micheline Kamber, and Jian Pei. Data mining: concepts and techniques. Morgan kaufmann, 2006. Google ScholarDigital Library
SJ Hebbring, SJ Schrodi, Z Ye, Z Zhou, D Page, and MH Brilliant. A phewas approach in studying hla-drb1* 1501. Genes and immunity, 2013.Google Scholar
Lucia A Hindorff, Praveen Sethupathy, Heather A Junkins, Erin M Ramos, Jayashri P Mehta, Francis S Collins, and Teri A Manolio. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proceedings of the National Academy of Sciences, 106(23):9362--9367, 2009.Google ScholarCross Ref
Federico Innocenti, Gregory M Cooper, Ian B Stanaway, Eric R Gamazon, Joshua D Smith, Snezana Mirkov, Jacqueline Ramirez, Wanqing Liu, Yvonne S Lin, Cliona Moloney, et al. Identification, replication, and functional fine-mapping of expression quantitative trait loci in primary human liver tissue. PLoS genetics, 7(5):e1002078, 2011.Google ScholarCross Ref
Andrew D Johnson and Christopher J O'Donnell. An open access database of genome-wide association results. BMC medical genetics, 10(1):6, 2009.Google Scholar
Evan Koch, Mickey Ristroph, and Mark Kirkpatrick. Long range linkage disequilibrium across the human genome. PloS one, 8(12):e80754, 2013.Google ScholarCross Ref
Ching Lee Koo, Mei Jing Liew, Mohd Saberi Mohamad, and Abdul Hakim Mohamed Salleh. A review for detecting gene-gene interactions using machine learning methods in genetic epidemiology. BioMed research international, 2013, 2013.Google Scholar
Chunyu Liu, H Hoxie Ackerman, and John P Carulli. A genome-wide screen of gene--gene interactions for rheumatoid arthritis susceptibility. Human genetics, 129(5):473--485, 2011.Google ScholarCross Ref
Jianzhong Ma and Christopher I Amos. Investigation of inversion polymorphisms in the human genome using principal components analysis. PloS one, 7(7):e40224, 2012.Google ScholarCross Ref
Nila Patil, Anthony J Berno, David A Hinds, Wade A Barrett, Jigna M Doshi, Coleen R Hacker, Curtis R Kautzer, Danny H Lee, Claire Marjoribanks, David P McDonough, et al. Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science, 294(5547):1719--1723, 2001.Google ScholarCross Ref
R. Peeters. The maximum edge biclique problem is NP-complete. Discrete Applied Mathematics, 131(3):651--654, 2003. Google ScholarDigital Library
Sarah A Pendergrass, Kristin Brown-Gentry, Scott Dudek, Alex Frase, Eric S Torstenson, Robert Goodloe, Jose Luis Ambite, Christy L Avery, et al. Phenome-wide association study (phewas) for detection of pleiotropy within the population architecture using genomics and epidemiology (page) network. PLoS genetics, 9(1):e1003087, 2013.Google ScholarCross Ref
Kai Peng, Wei Xu, Jianyong Zheng, Kegui Huang, Huisong Wang, Jiansong Tong, Zhifeng Lin, Jun Liu, Wenqing Cheng, Dong Fu, et al. The disease and gene annotations (dga): an annotation resource for human disease. Nucleic acids research, 41(D1):D553--D560, 2013.Google Scholar
David E Reich, Michele Cargill, Stacey Bolk, James Ireland, Pardis C Sabeti, Daniel J Richter, Thomas Lavery, Rose Kouyoumjian, Shelli F Farhadian, Ryk Ward, et al. Linkage disequilibrium in the human genome. Nature, 411(6834):199--204, 2001.Google ScholarCross Ref
Marylyn D Ritchie, Bill C White, Joel S Parker, Lance W Hahn, and Jason H Moore. Optimization of neural network architecture using genetic programming improves detection and modeling of gene-gene interactions in studies of human diseases. BMC bioinformatics, 4(1):28, 2003.Google ScholarCross Ref
Rori V Rohlfs, Willie J Swanson, and Bruce S Weir. Detecting coevolution through allelic association between physically unlinked loci. The American Journal of Human Genetics, 86(5):674--685, 2010.Google ScholarCross Ref
Pardis C Sabeti, Patrick Varilly, Ben Fry, Jason Lohmueller, Elizabeth Hostetter, Chris Cotsapas, Xiaohui Xie, Elizabeth H Byrne, Steven A McCarroll, Rachelle Gaudet, et al. Genome-wide detection and characterization of positive selection in human populations. Nature, 449(7164):913--918, 2007.Google ScholarCross Ref
Ravi Sachidanandam, David Weissman, Steven C Schmidt, Jerzy M Kakol, Lincoln D Stein, Gabor Marth, Steve Sherry, James C Mullikin, Beverley J Mortimore, David L Willey, et al. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature, 409(6822):928--933, 2001.Google ScholarCross Ref
Yasuyuki Tomita, Shuta Tomida, Yuko Hasegawa, Yoichi Suzuki, Taro Shirakawa, Takeshi Kobayashi, and Hiroyuki Honda. Artificial neural network approach for selection of susceptible single nucleotide polymorphisms and construction of prediction model on childhood allergic asthma. BMC bioinformatics, 5(1):120, 2004.Google ScholarCross Ref
Axel Visel, Simon Minovitsky, Inna Dubchak, and Len A Pennacchio. Vista enhancer browserala database of tissue-specific human enhancers. Nucleic acids research, 35(suppl 1):D88--D92, 2007.Google Scholar
Jilles Vreeken, Matthijs Van Leeuwen, and Arno Siebes. Krimp: mining itemsets that compress. Data Mining and Knowledge Discovery, 23(1):169--214, 2011. Google ScholarDigital Library
Xiang Wan, Can Yang, Qiang Yang, Hong Xue, Xiaodan Fan, Nelson LS Tang, and Weichuan Yu. Boost: A fast approach to detecting gene-gene interactions in genome-wide case-control studies. The American Journal of Human Genetics, 87(3):325--340, 2010.Google ScholarCross Ref
Yang Xiang. Simple linear algorithms for mining graph cores. arXiv preprint arXiv:1401.1771, 2014.Google Scholar
Yang Xiang, Ruoming Jin, David Fuhry, and Feodor F. Dragan. Summarizing transactional databases with overlapped hyperrectangles. Data Min. Knowl. Discov., 23(2):215--251, 2011. Google ScholarDigital Library
Yang Xiang, Philip R. O. Payne, and Kun Huang. Transactional database transformation and its application in prioritizing human disease genes. IEEE/ACM Trans. Comput. Biology Bioinform., 9(1):294--304, 2012. Google ScholarDigital Library
Kim E Zerba, Robert E Ferrell, and Charles F Sing. Genetic structure of five susceptibility gene regions for coronary artery disease: disequilibria within and among regions. Human genetics, 103(3):346--354, 1998.Google ScholarCross Ref
Chun Zhang, Dione K Bailey, Tarif Awad, Guoying Liu, Guoliang Xing, Manqiu Cao, Venu Valmeekam, Jacques Retief, Hajime Matsuzaki, Margaret Taub, et al. A whole genome long-range haplotype (wglrh) test for detecting imprints of positive selection in human populations. Bioinformatics, 22(17):2122--2128, 2006. Google ScholarDigital Library

Index Terms

Mining massive SNP data for identifying associated SNPs and uncovering gene relationships
1. Applied computing
  1. Life and medical sciences
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Identifying disease-associated SNP clusters via contiguous outlier detection

Motivation: Although genome-wide association studies (GWAS) have identified many disease-susceptibility single-nucleotide polymorphisms (SNPs), these findings can only explain a small portion of genetic contributions to complex diseases, which is ...
Read More
In-depth annotation of SNPs arising from resequencing projects using NGS-SNP

Summary: NGS-SNP is a collection of command-line scripts for providing rich annotations for SNPs identified by the sequencing of whole genomes from any organism with reference sequences in Ensembl. Included among the annotations, several of which are ...
Read More
Inferring combined CNV/SNP haplotypes from genotype data

Motivation: Copy number variations (CNVs) are increasingly recognized as an substantial source of individual genetic variation, and hence there is a growing interest in investigating the evolutionary history of CNVs as well as their impact on complex ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
BCB '14: Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics
September 2014
851 pages
ISBN:9781450328944
DOI:10.1145/2649387
General Chairs:
Pierre Baldi
University of California, Irvine
,
Wei Wang
University of California, Los Angeles
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 September 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- short-paper
Conference

Acceptance Rates
Overall Acceptance Rate254of885submissions,29%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 108
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Mining massive SNP data for identifying associated SNPs and uncovering gene relationships

BCB '14: Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics

ABSTRACT

References

Cited By

Index Terms

Recommendations

Identifying disease-associated SNP clusters via contiguous outlier detection

In-depth annotation of SNPs arising from resequencing projects using NGS-SNP

Inferring combined CNV/SNP haplotypes from genotype data