Skip to main content

Fraud Detection in Online Market Research

  • Conference paper
  • First Online:
Intelligent Systems and Applications (IntelliSys 2021)

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 295))

Included in the following conference series:

  • 958 Accesses

Abstract

The key distinguishing approach in this paper is to utilize random sampling on the training set for optimal solution to maximize the fraud coverage as well as minimize the false positive rate. We have experimented with the variety of optimal solutions to discover a different bad actors segment. We have adapted a partial labeled data in the industry setting together with self developed set of SQL based rules in order to compensate in timely manner for just enough labelled data available for supervised learning model to detect fraud users before they negatively impact our business. Here, at DISQO, market research firm that provides raw data to our partners and clients as well as is a reputable panel for consumers to share their feedback on variety of brands and products, we were facing challenges related to noisy labelled data. Thus, set of rules were developed to assess every user against fraud in the following grade, A (red, very suspicious), B (yellow) and C (green). We started with a simple grading system. Then, after the optimal problem was formulated to maximize the fraud detection on the random sampled training set we were solving for optimal solution, and collected all of these solutions to average out and design our final solution in order to detect Fraud with better precision and improved recall from \(26\%\) to \(52\%\). Lastly we have developed a methodology to combine these optimal coefficient solutions in order to have a well generalized fraud detection model as averaging the coefficients next to the dynamic labels via Logistic Regression. However, we have achieved the best results when we solved for the optimal fraud coverage segment and trained on the hand picked number of classifiers to learn the separation in the data between bad and good actors. Then we have created a fraud vector of 5-dimensions, that consisted of the probabilities retrieved from hand picked classifiers based on the optimal solutions (we had 3 fraud segments retrieved from optimal solutions), one of the fraud vector’s dimension contained the CNN probability, other two were XGBoost and Logistic based probability, and kept the auto-encoder reconstruction error as another fraud vector dimension. At the end, we compare fraud vector magnitude on every users to assess quickly the fraud overall risk, we use every classifier probability and auto-encoder reconstruction error as fraud dimensions.

Supported by DISQO.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bao, H., Niu, G., Sugiyama, M.: Classification from pairwise similarity and unlabeled data. In: Proceedings of the 35th International Conference on Machine Learning, pp. 452–461 (2018)

    Google Scholar 

  2. Domingues, R.: Probabilistic modeling for novelty detection with applications to fraud identification (2019). https://arxiv.org/pdf/1903.01730.pdf

  3. Shahriari, B., Swersky, K., Wang, Z., Adams, R.P., De Freitas, N.: Taking the human out of the loop: a review of Bayesian optimization. Proc. IEEE 104, 148–175 (2016)

    Article  Google Scholar 

  4. Bolton, R.J., Hand, D.J.: Statistical fraud detection: a review. Stat. Sci. 17, 235–249 (2002)

    Article  MathSciNet  Google Scholar 

  5. Buthpitiya, S.W.: Geo-trace modeling using n-grams for anomaly detection in user behavior and user location prediction (Doctoral dissertation, Carnegie Mellon University) (2011)

    Google Scholar 

  6. Hofgesang, P.I., Kowalczyk, W.: Analysing clickstream data: from anomaly detection to visitor profiling. In: Proceedings of ECML/PKDD Discovery Challenge (2005)

    Google Scholar 

  7. Ivey, H., Appana, R.V., Ramsey, P., Yeh, T.: U.S. Patent Application No. 14/789,710 (2016)

    Google Scholar 

  8. Lamba, H., Glazier, T.J., Cámara, J., Schmerl, B., Garlan, D., Pfeffer, J.: Model-based cluster analysis for identifying suspicious activity sequences in software. In: Proceedings of the 3rd ACM on International Workshop on Security and Privacy Analytics, pp. 17–22. ACM (March 2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vera Kalinichenko .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kalinichenko, V., Atashian, G., Abgaryan, D., Wijaya, N. (2022). Fraud Detection in Online Market Research. In: Arai, K. (eds) Intelligent Systems and Applications. IntelliSys 2021. Lecture Notes in Networks and Systems, vol 295. Springer, Cham. https://doi.org/10.1007/978-3-030-82196-8_33

Download citation

Publish with us

Policies and ethics