ABSTRACT
Data providers have a profound contribution to many fields such as finance, economy, and academia by serving people with both web-based and API-based query service of specialized data. Among the data users, there are data resellers who abuse the query APIs to retrieve and resell the data to make a profit, which harms the data provider's interests and causes copyright infringement. In this work, we define the "anti-data-reselling" problem and propose a new systematic method that combines feature engineering and machine learning models to provide a solution. We apply our method to a real query log of over 9,000 users with limited labels provided by a large financial data provider and get reasonable results, insightful observations, and real deployments.
- ACM Digital library. https://dl.acm.org/.Google Scholar
- Bloomberg Indices. https://www.bloombergindices.com/.Google Scholar
- L. Breiman. Random forests. Machine learning, 45(1):5--32, 2001. Google ScholarDigital Library
- K. Brown and D. Doran. Contrasting web robot and human behaviors with network models. arXiv preprint arXiv:1801.09715, 2018.Google Scholar
- CNKI. http://oversea.cnki.net/.Google Scholar
- D. Doran and S. S. Gokhale. Web robot detection techniques: overview and limitations. Data Mining and Knowledge Discovery, 22(1--2), 2011. Google ScholarDigital Library
- FactSet. https://www.factset.com/.Google Scholar
- Feature Importance Evaluation. http://scikit-learn.org/stable/modules/ensemble.html.Google Scholar
- G. Jacob, E. Kirda, C. Kruegel, and G. Vigna. Pubcrawl: Protecting users and businesses from crawlers. In USENIX Security Symposium, 2012. Google ScholarDigital Library
- F. T. Liu, K. M. Ting, and Z.-H. Zhou. Isolation forest. In Data Mining, 2008. ICDM'08. Eighth IEEE International Conference on. IEEE, 2008. Google ScholarDigital Library
- B. Schölkopf, R. C. Williamson, A. J. Smola, J. Shawe-Taylor, and J. C. Platt. Support vector method for novelty detection. In NIPS, 2000.Google Scholar
- scikit-learn. http://scikit-learn.org/.Google Scholar
- D. Stevanovic, A. An, and N. Vlajic. Feature evaluation for web crawler detection with data mining techniques. Expert Systems with Applications, 39(10), 2012. Google ScholarDigital Library
- D. Stevanovic, N. Vlajic, and A. An. Detection of malicious and non-malicious website visitors using unsupervised neural network learning. Applied Soft Computing, 13(1), 2013. Google ScholarDigital Library
- P.-N. Tan and V. Kumar. Discovery of web robot sessions based on their navigational patterns. In Intelligent Technologies for Information Analysis. Springer, 2004.Google ScholarCross Ref
- Thomson Reuters. https://www.thomsonreuters.com/en.html.Google Scholar
- L. Von Ahn, M. Blum, N. J. Hopper, and J. Langford. CAPTCHA: Using hard AI problems for security. In Eurocrypt, 2003. Google ScholarDigital Library
- M. Zabihi, M. V. Jahan, and J. Hamidzadeh. A density based clustering approach for web robot detection. In ICCKE. IEEE, 2014.Google ScholarCross Ref
- M. Zabihimayvan, R. Sadeghi, H. N. Rude, and D. Doran. A soft computing approach for benign and malicious web robot detection. Expert Systems with Applications, 87, 2017. Google ScholarDigital Library
Index Terms
- Do Not Pull My Data for Resale: Protecting Data Providers Using Data Retrieval Pattern Analysis
Recommendations
Comparison of some available packages for use in research data management
Data management features of SIR, SAS, and SPSS were applied to a sample hierarchical data base. For each package, the areas investigated included the logical definition of the data base, data entry, data retrieval, data integrity, security, reporting, ...
Data retrieval from climate model archives
MSS '95: Proceedings of the 14th IEEE Symposium on Mass Storage SystemsStarting from an accumulated amount of climate model data of 7 TByte at the end of 1994, a magnitude of 60 TByte is expected at the end of 1996. There is probably no physical problem in storing the data on available sequential mass storage devices. The ...
Comparison of some available packages for use in research data management
CHI '81: Proceedings of the Joint Conference on Easier and More Productive Use of Computer Systems. (Part - I): Information Processing in the Social Sciences and Humanities - Volume 1981Data management features of SIR, SAS, and SPSS were applied to a sample hierarchical data base. For each package, the areas investigated included the logical definition of the data base, data entry, data retrieval, data integrity, security, reporting, ...
Comments