Abstract
The classification of organisms is a daily-basis task in biology as well as other contexts. This process is usually carried out by comparing a set of descriptors associated with each object. However, general-purpose statistical packages offer a limited number of methods to perform such a comparison, and specific tools are required for each concrete problem. Weka is a freely-available framework that supports both supervised and unsupervised machine-learning algorithms. Here, we present WekaBioSimilarity, an extension of Weka implementing several resemblance measures to compare different kinds of descriptors. Namely, WekaBioSimilarity works with binary, multi-value, string, numerical, and heterogeneous data. WekaBioSimilarity, together with Weka, offers the functionality to classify objects using different resemblance measures, and clustering and classification algorithms. The combination of these two systems can be used as a standalone application or can be incorporated in the workflow of other software systems that require a classification process. WekaBioSimilarity is available at http://wekabiosimilarity.sourceforge.net.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Arif, M., Basalama, S.: Similarity-dissimilarity plot for high dimensional data of different attribute types in biomedical datasets. Int. J. Innovative Comput. Inf. Control 8(2), 1173–1181 (2012)
Boriah, S., Chandola, V., Kumar, V.: Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the 8th SIAM International Conference on Data Mining, pp. 243–254 (2008)
Breese, J., Heckerman, D., Kadie, D.: Empirical analysis of predictive algorithms for collaborative filtering. In: Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence (1998)
Choi, S.S., et al.: A survey of binary similarity and distance measures. J. Syst. Cybern. Inform. 8(1), 43–48 (2010)
Hall, M., et al.: The weka data mining software: an update. SIGKDD Explor. 11(1), 10–18 (2009)
Hubálek, Z.: Coefficients of association and similarity, based on binary (presence-absence) data: an evaluation. Biol. Rev. 57(4), 669–689 (2008)
Jeffreys, A.J., Wilson, V., Thein, S.L.: Hypervariable ‘minisatellite’ regions in human DNA. Nature 314, 67–73 (1985)
Jurasinski, G., Retzer, V.: simba: a collection of functions for similarity analysis of vegetation data (2012)
Kurgan, L.A., et al.: Knowledge discovery approach to automated cardiac SPECT diagnosis. Artif. Intell. Med. 23(2), 149–169 (2001)
Lazar, I.: Gelanalyzer 2010a (2010). http://www.gelanalyzer.com/
Legendre, P., Legendre, L.: Numerical Ecology. Elsevier, Amsterdam (1999)
Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml
MacArthur, R.: Geographical Ecology: Patterns in the Distribution of Species. Princeton University Press, New Jersey (1984)
Manning, C., Schütze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge (2001)
Michael, H.: Binary coefficients: a theoretical and empirical study. Math. Geol. 8(2), 137–150 (1976)
Miyamoto, M., Cacraft, J.: Phylogenetic Analysis of DNA Sequences. Oxford University Press, Oxford (1991)
Nei, M., Kumar, S.: Molecular Evolution and Phylogenetics. Oxford University Press, Oxford (2000)
Nutt, C.L., et al.: Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. Cancer Res. 63(7), 1602–1607 (2003)
Read, M.M. (ed.): Trends in DNA Fingerprint Research. Nova Science Publishers Inc., New York (2005)
Rettinger, A., et al.: Mining the semantic web. Data Min. Knowl. Disc. 24, 613–662 (2012)
Rögnvaldsson, T., You, L., Garwicz, D.: State of the art prediction of HIV-1 protease cleavage sites. BioInformatics 31(8), 1204–1210 (2015)
Silva, T.C., Zhao, L.: Machine Learning in Complex Networks. Springer, Heidelberg (2016)
Sneath, P., Sokal, R.: Numerical Taxonomy: The Principles and Practice of Numerical Classification. W.H. Freeman & Co., San Francisco (1973)
Spertus, E., Sahami, M., Buyukkokten, O.: Evaluating similarity measures: a large-scale study in the orkut social network. In: Proceedings of the 11th ACM SIGKDD Conference on Knowledge Discovery in Data Mining, pp. 678–684 (2005)
USDA, NRCS: The plants database (2008). http://plants.usda.gov
Vauterin, L., Vauterin, P.: Integrated databasing and analysis. In: Stackebrandt, E. (ed.) Molecular Identification, Systematics, and Population Structure of Prokaryotes. Springer, Heidelberg (2006)
Wang, X., et al.: Experimental comparison of representation methods and distance measures for time series data. Data Min. Knowl. Disc. 26, 275–309 (2013)
Wealtec: Dolphin-1D software version 2.4 (2006). http://www.wealtec.com/products/imaging/software/dolphin-1d-software.htm
Willett, P.: Similarity-based approaches to virtual screening. Biochem. Soc. Trans. 31, 603–606 (2003)
Willett, P., Barnard, J.M., Downs, G.M.: Chemical Similarity Searching. J. Chem. Inf. Comput. Sci. 38, 983–996 (1998)
Xu, R., Wunsch, D.C.: Clustering. IEEE Computer Society Press, Washington, DC (2008)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Domínguez, C., Heras, J., Mata, E., Pascual, V. (2016). WekaBioSimilarity—Extending Weka with Resemblance Measures. In: Luaces , O., et al. Advances in Artificial Intelligence. CAEPIA 2016. Lecture Notes in Computer Science(), vol 9868. Springer, Cham. https://doi.org/10.1007/978-3-319-44636-3_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-44636-3_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-44635-6
Online ISBN: 978-3-319-44636-3
eBook Packages: Computer ScienceComputer Science (R0)