Abstract
With the currently vast volume of data available, several institutions, including the public sector, benefit from information, aiming to improve decision-making. Machine Learning enhances data-driven decision-making with its predictive power. In this work, our principal motivation was to apply Machine Learning to ameliorate fiscal audit planning for São Paulo’s municipality. In this study, we predicted crimes against the service tax system of São Paulo using Machine Learning. Our methodology embraced the following steps: data extraction; data preparation; dimensionality reduction; model training and testing; model evaluation; model selection. Our experimental findings revealed that Sammon Mapping (SM) combined with Gradient Boosted Trees (GBT) outranked other state-of-the-art works, classifiers and dimensionality reduction techniques as regards classification performance. Our belief is that the ensemble of classifiers of GBT, combined with SM’s ability to identify relevant dimensions in data, contributed to produce higher prediction scores. These scores enable São Paulo’s tax administration to rank fiscal audits according to the highest probabilities of tax crime occurrence, leveraging tax revenue.
Supported by São Paulo City Hall.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Provost, F., Fawcett, T.: Data science and its relationship to big data and data-driven decision making. Big Data 1(1), 51–59 (2013). https://doi.org/10.1089/big.2013.1508
Matheus, R., Janssen, M., Maheshwari, D.: Data science empowering the public: data-driven dashboards for transparent and accountable decision-making in smart cities. Gov. Inf. Q. (2018)
Cobham, A.: Tax avoidance and evasion - the scale of the problem. Tax Justice Network (2017). https://www.taxjustice.net/wp-content/uploads/2017/11/Tax-dodging-the-scale-of-the-problem-TJN-Briefing.pdf. Accessed 03 Aug 2020
SINPROFAZ. Sonegação no Brasil – uma estimativa do desvio da arrecadação do exercício de (2015). http://www.quantocustaobrasil.com.br/artigos/sonegacao-no-brasil-uma-estimativa-do-desvio-da-arrecadacao-do-exercicio-de-2015. Accessed 6 Aug 2020
Empresômetro Homepage. https://www.empresometro.com.br. Accessed 6 Aug 2020
São Paulo State Government. Conheça São Paulo. Sistema Estadual de Análise de Dados (SEADE) (2019). https://www.seade.gov.br/wp-content/uploads/2019/01/Conheca_SP_2019_jan29.pdf. Accessed 15 Jan 2020
São Paulo Commercial Association. Impostômetro (2019). https://impostometro.com.br. Accessed 15 Jan 2020
São Paulo City Hall. Relatório Técnico do Balanço Geral de 2018. Accounting Department (2019)
González, P.C., Velásquez, J.D.: Characterization and detection of taxpayers with false invoices using data mining techniques. Expert Syst. Appl. 40, 1427–1436 (2013)
López, C.P., Rodríguez, M.J.D., Santos, S.L.: Tax fraud detection through neural networks: an application using a sample of personal income taxpayers. Future Internet 11, 86 (2019)
Kim, S., et al.: DATE: dual attentive tree-aware embedding for customs fraud detection. In: Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, USA (2020)
Weiyu, C., Yanyan, S., Linpeng, H.: Adaptive factorization network: learning adaptive-order feature interactions. In: Proceedings of the 34th Association for the Advancement of Artificial Intelligence (AAAI) Conference on Artificial Intelligence, New York, USA (2020). https://arxiv.org/pdf/1909.03276.pdf
Shan, Y., Hoens, T.R., Jiao, J., Wang, H., Yu, D., Mao, J.: Deep crossing: web-scale modeling without manually crafted combinatorial features. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM (2016). https://doi.org/10.1145/2939672.2939704
Ippolito, A., Lozano, A.C.G.: Tax crime prediction with machine learning: a case study in the municipality of São Paulo. In: Proceedings of the 22nd International Conference on Enterprise Information Systems - Volume 1: ICEIS, pp. 452–459. SciTePress (2020). https://doi.org/10.5220/0009564704520459
Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997)
Hossin, M., Sulaiman, M.N.: A review on evaluation metrics for data classification evaluations. Int. J. Data Min. Knowl. Manag. Process 5, 1 (2015)
Caruana, R., Niculescu-Mizil, A.: Data mining in metric space: an empirical analysis of supervised learning performance criteria. In: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 69–78. ACM, New York, USA (2004)
McHugh, M.L.: Interrater reliability: the kappa statistic. Biochemia Medica 22(3), 276–82 (2012)
Hardesty, L.: Explained: Neural Networks. MIT News (2017). http://news.mit.edu/2017/explained-neural-networks-deep-learning-0414. Accessed 10 Jan 2020
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning, 2nd edn. MIT Press, Cambridge (2016)
Russell, S.J., Norvig, P.: Artificial Intelligence: A Modern Approach, 3rd edn. Pearson, London (2010)
Migon, S.H., Gamerman, D., Louzada, F.: Statistical Inference: An Integrated Approach. CRC Press, Boca Raton (2015)
Ben-Hur, A., Ong, C.S., Sonnenburg, S., Scholkopf, B., Ratsch, G.: Support vector machines and kernels for computational biology. PLoS Comput. Biol. 4(10), e1000173 (2008)
Poole, D., Mackworth, A.: Artificial Intelligence: Foundations of Computational Agents, 2nd edn. Cambridge University Press, Cambridge (2017)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Rokach, L.: Ensemble-based classifiers. Artif. Intell. Rev. 33, 1–39 (2010)
Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 2(55), 119–139 (1997)
Hastie, T., Tibshirani, R., Friedman, J.H.: Additive logistic regression: a statistical view of boosting. Ann. Stat. 2(28), 337–407 (2000)
Hastie, T., Tibshirani, R., Friedman, J.: Boosting and additive trees. In: The Elements of Statistical Learning. SSS, vol. 2, pp. 337–387. Springer, New York (2009). https://doi.org/10.1007/978-0-387-84858-7_10
Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM (2016). https://doi.org/10.1145/2939672.2939785
Steinbach, M., Ertöz, L., Kumar, V.: The challenges of clustering high-dimensional data. In: Wille, L.T. (ed.) New Vistas in Statistical Physics: Applications in Econophysics, Bioinformatics and Pattern Recognition, pp. 273–309. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-662-08968-2_16
Johnson, R.A., Wichern, D.W.: Applied Multivariate Statistical Analysis, 5th edn. Prentice Hall, Hoboken (2002)
Sammon, J.W., Jr.: A nonlinear mapping for data structure analysis. IEEE Trans. Comput. 18(5), 401–409 (1969)
Alm, J.: What motivates tax compliance. Tulane Economics Working Paper Series, Working Paper 1903. Tulane University (2019)
Matos, T., et al.: Leveraging feature selection to detect potential tax fraudsters. Expert Syst. Appl. 145, 113128 (2020). https://doi.org/10.1016/j.eswa.2019.113128
Wirth, R., Hipp, J.: CRISP-DM: towards a standard process model for data mining. In: Proceedings of the 4th International Conference on the Practical Applications of Knowledge Discovery and Data Mining, pp. 29–39 (2000)
Berthold, M.R., et al.: KNIME - the Konstanz information miner: version 2.0 and beyond. SIGKDD Explor. Newsl. 11(1), 26–31 (2009)
Tukey, J.W.: Explanatory Data Analysis. Addison-Wesley, Boston (1977)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002). https://doi.org/10.1613/jair.953
Lerner, B., Guterman, H., Aladjem, M., Dinstein, I.: On the initialisation of Sammon’s nonlinear mapping. IEEE Trans. Comput. Pattern Anal. Appl. 3(1), 61–68 (2000)
Lerner, B., Guterman, H., Aladjem, M., Dinstein, I., Romem, Y.: On pattern classification with Sammon’s nonlinear mapping - an experimental study. Pattern Recogn. 31, 371–381 (1998)
Mao, J., Jain, A.K.: Artificial neural networks for feature extraction and multivariate data projection. IEEE Trans. Neural Netw. 6, 296–317 (1995)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Ippolito, A., Lozano, A.C.G. (2021). Sammon Mapping-Based Gradient Boosted Trees for Tax Crime Prediction in the City of São Paulo. In: Filipe, J., Śmiałek, M., Brodsky, A., Hammoudi, S. (eds) Enterprise Information Systems. ICEIS 2020. Lecture Notes in Business Information Processing, vol 417. Springer, Cham. https://doi.org/10.1007/978-3-030-75418-1_14
Download citation
DOI: https://doi.org/10.1007/978-3-030-75418-1_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-75417-4
Online ISBN: 978-3-030-75418-1
eBook Packages: Computer ScienceComputer Science (R0)