skip to main content
10.1145/3209978.3210158acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper

Do Not Pull My Data for Resale: Protecting Data Providers Using Data Retrieval Pattern Analysis

Published:27 June 2018Publication History

ABSTRACT

Data providers have a profound contribution to many fields such as finance, economy, and academia by serving people with both web-based and API-based query service of specialized data. Among the data users, there are data resellers who abuse the query APIs to retrieve and resell the data to make a profit, which harms the data provider's interests and causes copyright infringement. In this work, we define the "anti-data-reselling" problem and propose a new systematic method that combines feature engineering and machine learning models to provide a solution. We apply our method to a real query log of over 9,000 users with limited labels provided by a large financial data provider and get reasonable results, insightful observations, and real deployments.

References

  1. ACM Digital library. https://dl.acm.org/.Google ScholarGoogle Scholar
  2. Bloomberg Indices. https://www.bloombergindices.com/.Google ScholarGoogle Scholar
  3. L. Breiman. Random forests. Machine learning, 45(1):5--32, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. K. Brown and D. Doran. Contrasting web robot and human behaviors with network models. arXiv preprint arXiv:1801.09715, 2018.Google ScholarGoogle Scholar
  5. CNKI. http://oversea.cnki.net/.Google ScholarGoogle Scholar
  6. D. Doran and S. S. Gokhale. Web robot detection techniques: overview and limitations. Data Mining and Knowledge Discovery, 22(1--2), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. FactSet. https://www.factset.com/.Google ScholarGoogle Scholar
  8. Feature Importance Evaluation. http://scikit-learn.org/stable/modules/ensemble.html.Google ScholarGoogle Scholar
  9. G. Jacob, E. Kirda, C. Kruegel, and G. Vigna. Pubcrawl: Protecting users and businesses from crawlers. In USENIX Security Symposium, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. F. T. Liu, K. M. Ting, and Z.-H. Zhou. Isolation forest. In Data Mining, 2008. ICDM'08. Eighth IEEE International Conference on. IEEE, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. B. Schölkopf, R. C. Williamson, A. J. Smola, J. Shawe-Taylor, and J. C. Platt. Support vector method for novelty detection. In NIPS, 2000.Google ScholarGoogle Scholar
  12. scikit-learn. http://scikit-learn.org/.Google ScholarGoogle Scholar
  13. D. Stevanovic, A. An, and N. Vlajic. Feature evaluation for web crawler detection with data mining techniques. Expert Systems with Applications, 39(10), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D. Stevanovic, N. Vlajic, and A. An. Detection of malicious and non-malicious website visitors using unsupervised neural network learning. Applied Soft Computing, 13(1), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P.-N. Tan and V. Kumar. Discovery of web robot sessions based on their navigational patterns. In Intelligent Technologies for Information Analysis. Springer, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  16. Thomson Reuters. https://www.thomsonreuters.com/en.html.Google ScholarGoogle Scholar
  17. L. Von Ahn, M. Blum, N. J. Hopper, and J. Langford. CAPTCHA: Using hard AI problems for security. In Eurocrypt, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Zabihi, M. V. Jahan, and J. Hamidzadeh. A density based clustering approach for web robot detection. In ICCKE. IEEE, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  19. M. Zabihimayvan, R. Sadeghi, H. N. Rude, and D. Doran. A soft computing approach for benign and malicious web robot detection. Expert Systems with Applications, 87, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Do Not Pull My Data for Resale: Protecting Data Providers Using Data Retrieval Pattern Analysis

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval
              June 2018
              1509 pages
              ISBN:9781450356572
              DOI:10.1145/3209978

              Copyright © 2018 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 27 June 2018

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • short-paper

              Acceptance Rates

              SIGIR '18 Paper Acceptance Rate86of409submissions,21%Overall Acceptance Rate792of3,983submissions,20%
            • Article Metrics

              • Downloads (Last 12 months)10
              • Downloads (Last 6 weeks)0

              Other Metrics

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader