Categorical Features Transformation with Compact One-Hot Encoder for Fraud Detection in Distributed Environment

Ul Haq, Ikram; Gondal, Iqbal; Vamplew, Peter; Brown, Simon

doi:10.1007/978-981-13-6661-1_6

Categorical Features Transformation with Compact One-Hot Encoder for Fraud Detection in Distributed Environment

Ikram Ul Haq¹⁶,
Iqbal Gondal¹⁶,
Peter Vamplew¹⁶ &
…
Simon Brown¹⁷

Conference paper
First Online: 16 February 2019

1413 Accesses
10 Citations

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 996))

Abstract

Fraud detection for online banking is an important research area, but one of the challenges is the heterogeneous nature of transactions data i.e. a combination of numeric as well as mixed attributes. Usually, numeric format data gives better performance for classification, regression and clustering algorithms. However, many machine learning problems have categorical, or nominal features, rather than numeric features only. In addition, some machine learning platforms such as Apache Spark accept numeric data only. One-hot Encoding (OHE) is a widely used approach for transforming categorical features to numerical features in traditional data mining tasks. The one-hot approach has some challenges as well: the sparseness of the transformed data and that the distinct values of an attribute are not always known in advance. Other than the model accuracy, compactness of machine learning models is equally important due to growing memory and storage needs. This paper presents an innovative technique to transform categorical features to numeric features by compacting sparse data even if all the distinct values are not known. The transformed data can be used for the development of fraud detection systems. The accuracy of the results has been validated on synthetic and real bank fraud data and a publicly available anomaly detection (KDD-99) dataset on a multi-node data cluster.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Breunig, M.M., Kriegel, H.-P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: ACM Sigmod Record (2000)
Google Scholar
Hodge, V., Austin, J.: A survey of outlier detection methodologies. Artif. Intell. Rev. 22(2), 85–126 (2004)
Article Google Scholar
Jin, H., Chen, J., He, H., Kelman, C., McAullay, D., O’Keefe, C.M.: Signaling potential adverse drug reactions from administrative health databases. IEEE Trans. Knowl. Data Eng. 22(6), 839–853 (2010)
Article Google Scholar
Maruatona, O.: Internet Banking Fraud Detection Using Prudent Analysis. University of Ballarat, Ballarat (2013)
Google Scholar
Zhang, Y., Meratnia, N., Havinga, P.: Outlier detection techniques for wireless sensor networks: a survey. IEEE Commun. Surv. Tutor. 12(2), 159–170 (2010)
Article Google Scholar
Zhang, K., Jin, H.: An effective pattern based outlier detection approach for mixed attribute data. In: Li, J. (ed.) AI 2010. LNCS (LNAI), vol. 6464, pp. 122–131. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-17432-2_13
Chapter Google Scholar
Shih, M.-Y., Jheng, J.-W., Lai, L.-F.: A two-step method for clustering mixed categroical. Tamkang J. Sci. Eng. 13(1), 11–19 (2010)
Google Scholar
Huang, Z.: Clustering large data sets with mixed numeric and categorical values. In: Proceedings of the 1st pacific-asia conference on knowledge discovery and data mining, (PAKDD) (1997)
Google Scholar
Pentreath, N.: Machine Learning with Spark, p. 338. Packt Publishing, Birmingham (2015)
Google Scholar
Meng, X., et al.: Mllib: machine learning in apache spark. J. Mach. Learn. Res. 17(34), 1–7 (2016)
MathSciNet MATH Google Scholar
Shanahan, J., Dai, L.: Large scale distributed data science using apache spark. In: 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco (2015)
Google Scholar
Chen, W.: Learning with Scalability and Compactness, p. 147, Washington (2016)
Google Scholar
Meng, X.: Sparse data support in MLlib. Apache Spark Community, San Francisco (2014)
Google Scholar
Tavallaee, M., Bagheri, E., Lu, W., Ghorbani, A.A.: A detailed analysis of the KDD CUP 99 data set. In: IEEE Symposium on Computational Intelligence for Security and Defense Applications 2009. CISDA 2009, Ottawa, Canada (2009)
Google Scholar
Jian, S., Cao, L., Pang, G., Lu, K., Gao, H.: Embedding-based representation of categorical data by hierarchical value coupling learning. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence (2017)
Google Scholar
Qian, Y., Li, F., Liang, J., Liu, B., Dang, C.: Space structure and clustering of categorical data. IEEE Trans. Neural Netw. Learn. Syst. 27(10), 2047–2059 (2016)
Article MathSciNet Google Scholar
Boriah, S., Chandola, V., Kumar, V.: Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the 2008 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics (2008)
Google Scholar
Anderberg, M.R.: Cluster Analysis for Applications. Academic Press, New York (1973)
MATH Google Scholar
Hartigan, J.A.: Cluster Algorithms, vol. 214, p. 1993. Wiley, New York (1975)
Google Scholar
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, NJ (1988)
MATH Google Scholar
Ul Haq, I., Gondal, I., Vamplew, P., Layton, R.: Generating synthetic datasets for experimental validation of fraud detection. In: Fourteenth Australasian Data Mining Conference, Canberra, Australia. Conferences in Research and Practice in Information Technology, vol. 170, Canberra (2016)
Google Scholar
Apache Software Foundation: Apache Hadoop, 26 April 2015. http://hadoop.apache.org/

Download references

Author information

Authors and Affiliations

ICSL, School of Science, Engineering and Information Technology, PO Box 663, Ballarat, VIC, 3353, Australia
Ikram Ul Haq, Iqbal Gondal & Peter Vamplew
Westpac Bank, Melbourne, Australia
Simon Brown

Authors

Ikram Ul Haq
View author publications
You can also search for this author in PubMed Google Scholar
Iqbal Gondal
View author publications
You can also search for this author in PubMed Google Scholar
Peter Vamplew
View author publications
You can also search for this author in PubMed Google Scholar
Simon Brown
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ikram Ul Haq .

Editor information

Editors and Affiliations

School of Computing and Mathematics, Charles Sturt University, Albury, NSW, Australia
Rafiqul Islam
University of Auckland, Auckland, New Zealand
Yun Sing Koh
CSIRO Scientific Computing, Canberra, Australia
Yanchang Zhao
Data Science and Engineering, Australian Taxation Office, Canberra, Australia
Graco Warwick
Department of Information Technology, University of Wollongong, Wollongong, NSW, Australia
David Stirling
School of Computing and Mathematics, Charles Sturt University, Wagga Wagga, Australia
Chang-Tsun Li
School of Computing and Mathematics, Charles Sturt University, Bathurst, Australia
Zahidul Islam

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ul Haq, I., Gondal, I., Vamplew, P., Brown, S. (2019). Categorical Features Transformation with Compact One-Hot Encoder for Fraud Detection in Distributed Environment. In: Islam, R., et al. Data Mining. AusDM 2018. Communications in Computer and Information Science, vol 996. Springer, Singapore. https://doi.org/10.1007/978-981-13-6661-1_6

Download citation

DOI: https://doi.org/10.1007/978-981-13-6661-1_6
Published: 16 February 2019
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-6660-4
Online ISBN: 978-981-13-6661-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics