short-paper

Do Not Pull My Data for Resale: Protecting Data Providers Using Data Retrieval Pattern Analysis

Authors:

Wei XuAuthors Info & Claims

SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval

Pages 1053 - 1056

https://doi.org/10.1145/3209978.3210158

Published: 27 June 2018 Publication History

Abstract

Data providers have a profound contribution to many fields such as finance, economy, and academia by serving people with both web-based and API-based query service of specialized data. Among the data users, there are data resellers who abuse the query APIs to retrieve and resell the data to make a profit, which harms the data provider's interests and causes copyright infringement. In this work, we define the "anti-data-reselling" problem and propose a new systematic method that combines feature engineering and machine learning models to provide a solution. We apply our method to a real query log of over 9,000 users with limited labels provided by a large financial data provider and get reasonable results, insightful observations, and real deployments.

References

[1]

ACM Digital library. https://dl.acm.org/.

[2]

Bloomberg Indices. https://www.bloombergindices.com/.

[3]

L. Breiman. Random forests. Machine learning, 45(1):5--32, 2001.

Digital Library

[4]

K. Brown and D. Doran. Contrasting web robot and human behaviors with network models. arXiv preprint arXiv:1801.09715, 2018.

[5]

CNKI. http://oversea.cnki.net/.

[6]

D. Doran and S. S. Gokhale. Web robot detection techniques: overview and limitations. Data Mining and Knowledge Discovery, 22(1--2), 2011.

Digital Library

[7]

FactSet. https://www.factset.com/.

[8]

Feature Importance Evaluation. http://scikit-learn.org/stable/modules/ensemble.html.

[9]

G. Jacob, E. Kirda, C. Kruegel, and G. Vigna. Pubcrawl: Protecting users and businesses from crawlers. In USENIX Security Symposium, 2012.

Digital Library

[10]

F. T. Liu, K. M. Ting, and Z.-H. Zhou. Isolation forest. In Data Mining, 2008. ICDM'08. Eighth IEEE International Conference on. IEEE, 2008.

Digital Library

[11]

B. Schölkopf, R. C. Williamson, A. J. Smola, J. Shawe-Taylor, and J. C. Platt. Support vector method for novelty detection. In NIPS, 2000.

[12]

scikit-learn. http://scikit-learn.org/.

[13]

D. Stevanovic, A. An, and N. Vlajic. Feature evaluation for web crawler detection with data mining techniques. Expert Systems with Applications, 39(10), 2012.

Digital Library

[14]

D. Stevanovic, N. Vlajic, and A. An. Detection of malicious and non-malicious website visitors using unsupervised neural network learning. Applied Soft Computing, 13(1), 2013.

Digital Library

[15]

P.-N. Tan and V. Kumar. Discovery of web robot sessions based on their navigational patterns. In Intelligent Technologies for Information Analysis. Springer, 2004.

[16]

Thomson Reuters. https://www.thomsonreuters.com/en.html.

[17]

L. Von Ahn, M. Blum, N. J. Hopper, and J. Langford. CAPTCHA: Using hard AI problems for security. In Eurocrypt, 2003.

Digital Library

[18]

M. Zabihi, M. V. Jahan, and J. Hamidzadeh. A density based clustering approach for web robot detection. In ICCKE. IEEE, 2014.

[19]

M. Zabihimayvan, R. Sadeghi, H. N. Rude, and D. Doran. A soft computing approach for benign and malicious web robot detection. Expert Systems with Applications, 87, 2017.

Digital Library

Index Terms

Do Not Pull My Data for Resale: Protecting Data Providers Using Data Retrieval Pattern Analysis

Recommendations

Comparison of some available packages for use in research data management

Data management features of SIR, SAS, and SPSS were applied to a sample hierarchical data base. For each package, the areas investigated included the logical definition of the data base, data entry, data retrieval, data integrity, security, reporting, ...
Data retrieval from climate model archives
MSS '95: Proceedings of the 14th IEEE Symposium on Mass Storage Systems

Starting from an accumulated amount of climate model data of 7 TByte at the end of 1994, a magnitude of 60 TByte is expected at the end of 1996. There is probably no physical problem in storing the data on available sequential mass storage devices. The ...
Comparison of some available packages for use in research data management
CHI '81: Proceedings of the Joint Conference on Easier and More Productive Use of Computer Systems. (Part - I): Information Processing in the Social Sciences and Humanities - Volume 1981

Data management features of SIR, SAS, and SPSS were applied to a sample hierarchical data base. For each package, the areas investigated included the logical definition of the data base, data entry, data retrieval, data integrity, security, reporting, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval

June 2018

1509 pages

ISBN:9781450356572

DOI:10.1145/3209978

General Chairs:
Kevyn Collins-Thompson
University of Michigan, United States
,
Qiaozhu Mei
University of Michigan, United States
,
Program Chairs:
Brian Davison
Lehigh University, United States
,
Yiqun Liu
Tsinghua University, China
,
Emine Yilmaz
University College London, United Kingdom

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 June 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Funding Sources

Huawei Technologies
Tsinghua Initiative Research Program
MOE Online Education Research Center (Quantong Fund)
National Natural Science Foundation of China
Ant Financial

Conference

SIGIR '18

Sponsor:

SIGIR

SIGIR '18: The 41st International ACM SIGIR conference on research and development in Information Retrieval

July 8 - 12, 2018

MI, Ann Arbor, USA

Acceptance Rates

SIGIR '18 Paper Acceptance Rate 86 of 409 submissions, 21%;

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
182
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)2

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten