Skip to main content

Sammon Mapping-Based Gradient Boosted Trees for Tax Crime Prediction in the City of São Paulo

  • Conference paper
  • First Online:
Enterprise Information Systems (ICEIS 2020)

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 417))

Included in the following conference series:

  • 1311 Accesses

Abstract

With the currently vast volume of data available, several institutions, including the public sector, benefit from information, aiming to improve decision-making. Machine Learning enhances data-driven decision-making with its predictive power. In this work, our principal motivation was to apply Machine Learning to ameliorate fiscal audit planning for São Paulo’s municipality. In this study, we predicted crimes against the service tax system of São Paulo using Machine Learning. Our methodology embraced the following steps: data extraction; data preparation; dimensionality reduction; model training and testing; model evaluation; model selection. Our experimental findings revealed that Sammon Mapping (SM) combined with Gradient Boosted Trees (GBT) outranked other state-of-the-art works, classifiers and dimensionality reduction techniques as regards classification performance. Our belief is that the ensemble of classifiers of GBT, combined with SM’s ability to identify relevant dimensions in data, contributed to produce higher prediction scores. These scores enable São Paulo’s tax administration to rank fiscal audits according to the highest probabilities of tax crime occurrence, leveraging tax revenue.

Supported by São Paulo City Hall.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 159.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Provost, F., Fawcett, T.: Data science and its relationship to big data and data-driven decision making. Big Data 1(1), 51–59 (2013). https://doi.org/10.1089/big.2013.1508

    Article  Google Scholar 

  2. Matheus, R., Janssen, M., Maheshwari, D.: Data science empowering the public: data-driven dashboards for transparent and accountable decision-making in smart cities. Gov. Inf. Q. (2018)

    Google Scholar 

  3. Cobham, A.: Tax avoidance and evasion - the scale of the problem. Tax Justice Network (2017). https://www.taxjustice.net/wp-content/uploads/2017/11/Tax-dodging-the-scale-of-the-problem-TJN-Briefing.pdf. Accessed 03 Aug 2020

  4. SINPROFAZ. Sonegação no Brasil – uma estimativa do desvio da arrecadação do exercício de (2015). http://www.quantocustaobrasil.com.br/artigos/sonegacao-no-brasil-uma-estimativa-do-desvio-da-arrecadacao-do-exercicio-de-2015. Accessed 6 Aug 2020

  5. Empresômetro Homepage. https://www.empresometro.com.br. Accessed 6 Aug 2020

  6. São Paulo State Government. Conheça São Paulo. Sistema Estadual de Análise de Dados (SEADE) (2019). https://www.seade.gov.br/wp-content/uploads/2019/01/Conheca_SP_2019_jan29.pdf. Accessed 15 Jan 2020

  7. São Paulo Commercial Association. Impostômetro (2019). https://impostometro.com.br. Accessed 15 Jan 2020

  8. São Paulo City Hall. Relatório Técnico do Balanço Geral de 2018. Accounting Department (2019)

    Google Scholar 

  9. González, P.C., Velásquez, J.D.: Characterization and detection of taxpayers with false invoices using data mining techniques. Expert Syst. Appl. 40, 1427–1436 (2013)

    Article  Google Scholar 

  10. López, C.P., Rodríguez, M.J.D., Santos, S.L.: Tax fraud detection through neural networks: an application using a sample of personal income taxpayers. Future Internet 11, 86 (2019)

    Article  Google Scholar 

  11. Kim, S., et al.: DATE: dual attentive tree-aware embedding for customs fraud detection. In: Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, USA (2020)

    Google Scholar 

  12. Weiyu, C., Yanyan, S., Linpeng, H.: Adaptive factorization network: learning adaptive-order feature interactions. In: Proceedings of the 34th Association for the Advancement of Artificial Intelligence (AAAI) Conference on Artificial Intelligence, New York, USA (2020). https://arxiv.org/pdf/1909.03276.pdf

  13. Shan, Y., Hoens, T.R., Jiao, J., Wang, H., Yu, D., Mao, J.: Deep crossing: web-scale modeling without manually crafted combinatorial features. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM (2016). https://doi.org/10.1145/2939672.2939704

  14. Ippolito, A., Lozano, A.C.G.: Tax crime prediction with machine learning: a case study in the municipality of São Paulo. In: Proceedings of the 22nd International Conference on Enterprise Information Systems - Volume 1: ICEIS, pp. 452–459. SciTePress (2020). https://doi.org/10.5220/0009564704520459

  15. Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997)

    MATH  Google Scholar 

  16. Hossin, M., Sulaiman, M.N.: A review on evaluation metrics for data classification evaluations. Int. J. Data Min. Knowl. Manag. Process 5, 1 (2015)

    Google Scholar 

  17. Caruana, R., Niculescu-Mizil, A.: Data mining in metric space: an empirical analysis of supervised learning performance criteria. In: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 69–78. ACM, New York, USA (2004)

    Google Scholar 

  18. McHugh, M.L.: Interrater reliability: the kappa statistic. Biochemia Medica 22(3), 276–82 (2012)

    Article  MathSciNet  Google Scholar 

  19. Hardesty, L.: Explained: Neural Networks. MIT News (2017). http://news.mit.edu/2017/explained-neural-networks-deep-learning-0414. Accessed 10 Jan 2020

  20. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning, 2nd edn. MIT Press, Cambridge (2016)

    MATH  Google Scholar 

  21. Russell, S.J., Norvig, P.: Artificial Intelligence: A Modern Approach, 3rd edn. Pearson, London (2010)

    MATH  Google Scholar 

  22. Migon, S.H., Gamerman, D., Louzada, F.: Statistical Inference: An Integrated Approach. CRC Press, Boca Raton (2015)

    MATH  Google Scholar 

  23. Ben-Hur, A., Ong, C.S., Sonnenburg, S., Scholkopf, B., Ratsch, G.: Support vector machines and kernels for computational biology. PLoS Comput. Biol. 4(10), e1000173 (2008)

    Article  Google Scholar 

  24. Poole, D., Mackworth, A.: Artificial Intelligence: Foundations of Computational Agents, 2nd edn. Cambridge University Press, Cambridge (2017)

    Book  Google Scholar 

  25. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  Google Scholar 

  26. Rokach, L.: Ensemble-based classifiers. Artif. Intell. Rev. 33, 1–39 (2010)

    Article  Google Scholar 

  27. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 2(55), 119–139 (1997)

    Article  MathSciNet  Google Scholar 

  28. Hastie, T., Tibshirani, R., Friedman, J.H.: Additive logistic regression: a statistical view of boosting. Ann. Stat. 2(28), 337–407 (2000)

    MathSciNet  MATH  Google Scholar 

  29. Hastie, T., Tibshirani, R., Friedman, J.: Boosting and additive trees. In: The Elements of Statistical Learning. SSS, vol. 2, pp. 337–387. Springer, New York (2009). https://doi.org/10.1007/978-0-387-84858-7_10

    Chapter  MATH  Google Scholar 

  30. Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM (2016). https://doi.org/10.1145/2939672.2939785

  31. Steinbach, M., Ertöz, L., Kumar, V.: The challenges of clustering high-dimensional data. In: Wille, L.T. (ed.) New Vistas in Statistical Physics: Applications in Econophysics, Bioinformatics and Pattern Recognition, pp. 273–309. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-662-08968-2_16

    Chapter  Google Scholar 

  32. Johnson, R.A., Wichern, D.W.: Applied Multivariate Statistical Analysis, 5th edn. Prentice Hall, Hoboken (2002)

    MATH  Google Scholar 

  33. Sammon, J.W., Jr.: A nonlinear mapping for data structure analysis. IEEE Trans. Comput. 18(5), 401–409 (1969)

    Article  Google Scholar 

  34. Alm, J.: What motivates tax compliance. Tulane Economics Working Paper Series, Working Paper 1903. Tulane University (2019)

    Google Scholar 

  35. Matos, T., et al.: Leveraging feature selection to detect potential tax fraudsters. Expert Syst. Appl. 145, 113128 (2020). https://doi.org/10.1016/j.eswa.2019.113128

    Article  Google Scholar 

  36. Wirth, R., Hipp, J.: CRISP-DM: towards a standard process model for data mining. In: Proceedings of the 4th International Conference on the Practical Applications of Knowledge Discovery and Data Mining, pp. 29–39 (2000)

    Google Scholar 

  37. Berthold, M.R., et al.: KNIME - the Konstanz information miner: version 2.0 and beyond. SIGKDD Explor. Newsl. 11(1), 26–31 (2009)

    Article  MathSciNet  Google Scholar 

  38. Tukey, J.W.: Explanatory Data Analysis. Addison-Wesley, Boston (1977)

    Google Scholar 

  39. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002). https://doi.org/10.1613/jair.953

    Article  MATH  Google Scholar 

  40. Lerner, B., Guterman, H., Aladjem, M., Dinstein, I.: On the initialisation of Sammon’s nonlinear mapping. IEEE Trans. Comput. Pattern Anal. Appl. 3(1), 61–68 (2000)

    Article  Google Scholar 

  41. Lerner, B., Guterman, H., Aladjem, M., Dinstein, I., Romem, Y.: On pattern classification with Sammon’s nonlinear mapping - an experimental study. Pattern Recogn. 31, 371–381 (1998)

    Article  Google Scholar 

  42. Mao, J., Jain, A.K.: Artificial neural networks for feature extraction and multivariate data projection. IEEE Trans. Neural Netw. 6, 296–317 (1995)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to André Ippolito .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ippolito, A., Lozano, A.C.G. (2021). Sammon Mapping-Based Gradient Boosted Trees for Tax Crime Prediction in the City of São Paulo. In: Filipe, J., Śmiałek, M., Brodsky, A., Hammoudi, S. (eds) Enterprise Information Systems. ICEIS 2020. Lecture Notes in Business Information Processing, vol 417. Springer, Cham. https://doi.org/10.1007/978-3-030-75418-1_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-75418-1_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-75417-4

  • Online ISBN: 978-3-030-75418-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics