Sammon Mapping-Based Gradient Boosted Trees for Tax Crime Prediction in the City of São Paulo

Ippolito, André; Lozano, Augusto Cezar Garcia

doi:10.1007/978-3-030-75418-1_14

André Ippolito¹⁰ &
Augusto Cezar Garcia Lozano¹⁰

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 417))

Included in the following conference series:

International Conference on Enterprise Information Systems

1311 Accesses

Abstract

With the currently vast volume of data available, several institutions, including the public sector, benefit from information, aiming to improve decision-making. Machine Learning enhances data-driven decision-making with its predictive power. In this work, our principal motivation was to apply Machine Learning to ameliorate fiscal audit planning for São Paulo’s municipality. In this study, we predicted crimes against the service tax system of São Paulo using Machine Learning. Our methodology embraced the following steps: data extraction; data preparation; dimensionality reduction; model training and testing; model evaluation; model selection. Our experimental findings revealed that Sammon Mapping (SM) combined with Gradient Boosted Trees (GBT) outranked other state-of-the-art works, classifiers and dimensionality reduction techniques as regards classification performance. Our belief is that the ensemble of classifiers of GBT, combined with SM’s ability to identify relevant dimensions in data, contributed to produce higher prediction scores. These scores enable São Paulo’s tax administration to rank fiscal audits according to the highest probabilities of tax crime occurrence, leveraging tax revenue.

Supported by São Paulo City Hall.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Softcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Provost, F., Fawcett, T.: Data science and its relationship to big data and data-driven decision making. Big Data 1(1), 51–59 (2013). https://doi.org/10.1089/big.2013.1508
Article Google Scholar
Matheus, R., Janssen, M., Maheshwari, D.: Data science empowering the public: data-driven dashboards for transparent and accountable decision-making in smart cities. Gov. Inf. Q. (2018)
Google Scholar
Cobham, A.: Tax avoidance and evasion - the scale of the problem. Tax Justice Network (2017). https://www.taxjustice.net/wp-content/uploads/2017/11/Tax-dodging-the-scale-of-the-problem-TJN-Briefing.pdf. Accessed 03 Aug 2020
SINPROFAZ. Sonegação no Brasil – uma estimativa do desvio da arrecadação do exercício de (2015). http://www.quantocustaobrasil.com.br/artigos/sonegacao-no-brasil-uma-estimativa-do-desvio-da-arrecadacao-do-exercicio-de-2015. Accessed 6 Aug 2020
Empresômetro Homepage. https://www.empresometro.com.br. Accessed 6 Aug 2020
São Paulo State Government. Conheça São Paulo. Sistema Estadual de Análise de Dados (SEADE) (2019). https://www.seade.gov.br/wp-content/uploads/2019/01/Conheca_SP_2019_jan29.pdf. Accessed 15 Jan 2020
São Paulo Commercial Association. Impostômetro (2019). https://impostometro.com.br. Accessed 15 Jan 2020
São Paulo City Hall. Relatório Técnico do Balanço Geral de 2018. Accounting Department (2019)
Google Scholar
González, P.C., Velásquez, J.D.: Characterization and detection of taxpayers with false invoices using data mining techniques. Expert Syst. Appl. 40, 1427–1436 (2013)
Article Google Scholar
López, C.P., Rodríguez, M.J.D., Santos, S.L.: Tax fraud detection through neural networks: an application using a sample of personal income taxpayers. Future Internet 11, 86 (2019)
Article Google Scholar
Kim, S., et al.: DATE: dual attentive tree-aware embedding for customs fraud detection. In: Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, USA (2020)
Google Scholar
Weiyu, C., Yanyan, S., Linpeng, H.: Adaptive factorization network: learning adaptive-order feature interactions. In: Proceedings of the 34th Association for the Advancement of Artificial Intelligence (AAAI) Conference on Artificial Intelligence, New York, USA (2020). https://arxiv.org/pdf/1909.03276.pdf
Shan, Y., Hoens, T.R., Jiao, J., Wang, H., Yu, D., Mao, J.: Deep crossing: web-scale modeling without manually crafted combinatorial features. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM (2016). https://doi.org/10.1145/2939672.2939704
Ippolito, A., Lozano, A.C.G.: Tax crime prediction with machine learning: a case study in the municipality of São Paulo. In: Proceedings of the 22nd International Conference on Enterprise Information Systems - Volume 1: ICEIS, pp. 452–459. SciTePress (2020). https://doi.org/10.5220/0009564704520459
Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997)
MATH Google Scholar
Hossin, M., Sulaiman, M.N.: A review on evaluation metrics for data classification evaluations. Int. J. Data Min. Knowl. Manag. Process 5, 1 (2015)
Google Scholar
Caruana, R., Niculescu-Mizil, A.: Data mining in metric space: an empirical analysis of supervised learning performance criteria. In: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 69–78. ACM, New York, USA (2004)
Google Scholar
McHugh, M.L.: Interrater reliability: the kappa statistic. Biochemia Medica 22(3), 276–82 (2012)
Article MathSciNet Google Scholar
Hardesty, L.: Explained: Neural Networks. MIT News (2017). http://news.mit.edu/2017/explained-neural-networks-deep-learning-0414. Accessed 10 Jan 2020
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning, 2nd edn. MIT Press, Cambridge (2016)
MATH Google Scholar
Russell, S.J., Norvig, P.: Artificial Intelligence: A Modern Approach, 3rd edn. Pearson, London (2010)
MATH Google Scholar
Migon, S.H., Gamerman, D., Louzada, F.: Statistical Inference: An Integrated Approach. CRC Press, Boca Raton (2015)
MATH Google Scholar
Ben-Hur, A., Ong, C.S., Sonnenburg, S., Scholkopf, B., Ratsch, G.: Support vector machines and kernels for computational biology. PLoS Comput. Biol. 4(10), e1000173 (2008)
Article Google Scholar
Poole, D., Mackworth, A.: Artificial Intelligence: Foundations of Computational Agents, 2nd edn. Cambridge University Press, Cambridge (2017)
Book Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article Google Scholar
Rokach, L.: Ensemble-based classifiers. Artif. Intell. Rev. 33, 1–39 (2010)
Article Google Scholar
Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 2(55), 119–139 (1997)
Article MathSciNet Google Scholar
Hastie, T., Tibshirani, R., Friedman, J.H.: Additive logistic regression: a statistical view of boosting. Ann. Stat. 2(28), 337–407 (2000)
MathSciNet MATH Google Scholar
Hastie, T., Tibshirani, R., Friedman, J.: Boosting and additive trees. In: The Elements of Statistical Learning. SSS, vol. 2, pp. 337–387. Springer, New York (2009). https://doi.org/10.1007/978-0-387-84858-7_10
Chapter MATH Google Scholar
Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM (2016). https://doi.org/10.1145/2939672.2939785
Steinbach, M., Ertöz, L., Kumar, V.: The challenges of clustering high-dimensional data. In: Wille, L.T. (ed.) New Vistas in Statistical Physics: Applications in Econophysics, Bioinformatics and Pattern Recognition, pp. 273–309. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-662-08968-2_16
Chapter Google Scholar
Johnson, R.A., Wichern, D.W.: Applied Multivariate Statistical Analysis, 5th edn. Prentice Hall, Hoboken (2002)
MATH Google Scholar
Sammon, J.W., Jr.: A nonlinear mapping for data structure analysis. IEEE Trans. Comput. 18(5), 401–409 (1969)
Article Google Scholar
Alm, J.: What motivates tax compliance. Tulane Economics Working Paper Series, Working Paper 1903. Tulane University (2019)
Google Scholar
Matos, T., et al.: Leveraging feature selection to detect potential tax fraudsters. Expert Syst. Appl. 145, 113128 (2020). https://doi.org/10.1016/j.eswa.2019.113128
Article Google Scholar
Wirth, R., Hipp, J.: CRISP-DM: towards a standard process model for data mining. In: Proceedings of the 4th International Conference on the Practical Applications of Knowledge Discovery and Data Mining, pp. 29–39 (2000)
Google Scholar
Berthold, M.R., et al.: KNIME - the Konstanz information miner: version 2.0 and beyond. SIGKDD Explor. Newsl. 11(1), 26–31 (2009)
Article MathSciNet Google Scholar
Tukey, J.W.: Explanatory Data Analysis. Addison-Wesley, Boston (1977)
Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002). https://doi.org/10.1613/jair.953
Article MATH Google Scholar
Lerner, B., Guterman, H., Aladjem, M., Dinstein, I.: On the initialisation of Sammon’s nonlinear mapping. IEEE Trans. Comput. Pattern Anal. Appl. 3(1), 61–68 (2000)
Article Google Scholar
Lerner, B., Guterman, H., Aladjem, M., Dinstein, I., Romem, Y.: On pattern classification with Sammon’s nonlinear mapping - an experimental study. Pattern Recogn. 31, 371–381 (1998)
Article Google Scholar
Mao, J., Jain, A.K.: Artificial neural networks for feature extraction and multivariate data projection. IEEE Trans. Neural Netw. 6, 296–317 (1995)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Tax Intelligence Office, Under-secretariat of Municipal Revenue, Secretariat of Finance, São Paulo City Hall, São Paulo, Brazil
André Ippolito & Augusto Cezar Garcia Lozano

Authors

André Ippolito
View author publications
You can also search for this author in PubMed Google Scholar
Augusto Cezar Garcia Lozano
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to André Ippolito .

Editor information

Editors and Affiliations

Polytechnic Institute of Setúbal/INSTICC, Setúbal, Portugal
Joaquim Filipe
Warsaw University of Technology, Warsaw, Poland
Michał Śmiałek
George Mason University, Fairfax, VA, USA
Alexander Brodsky
MODESTE/ESEO, Angers, France
Slimane Hammoudi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ippolito, A., Lozano, A.C.G. (2021). Sammon Mapping-Based Gradient Boosted Trees for Tax Crime Prediction in the City of São Paulo. In: Filipe, J., Śmiałek, M., Brodsky, A., Hammoudi, S. (eds) Enterprise Information Systems. ICEIS 2020. Lecture Notes in Business Information Processing, vol 417. Springer, Cham. https://doi.org/10.1007/978-3-030-75418-1_14

Download citation

DOI: https://doi.org/10.1007/978-3-030-75418-1_14
Published: 01 May 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-75417-4
Online ISBN: 978-3-030-75418-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics