research-article

Topic modeling of freelance job postings to monitor web service abuse

Authors:
Do-kyum Kim

University of California, San Diego, La Jolla, CA, USA

University of California, San Diego, La Jolla, CA, USA
View Profile

,
Marti Motoyama

University of California, San Diego, La Jolla, CA, USA

University of California, San Diego, La Jolla, CA, USA
View Profile

,
Geoffrey M. Voelker

University of California, San Diego, La Jolla, CA, USA

University of California, San Diego, La Jolla, CA, USA
View Profile

,
Lawrence K. Saul

University of California, San Diego, La Jolla, CA, USA

University of California, San Diego, La Jolla, CA, USA
View Profile

AISec '11: Proceedings of the 4th ACM workshop on Security and artificial intelligenceOctober 2011Pages 11–20https://doi.org/10.1145/2046684.2046687

Published:21 October 2011Publication History

AISec '11: Proceedings of the 4th ACM workshop on Security and artificial intelligence

Pages 11–20

ABSTRACT

Web services such as Google, Facebook, and Twitter are recurring victims of abuse, and their plight will only worsen as more attackers are drawn to their large user bases. Many attackers hire cheap, human labor to actualize their schemes, connecting with potential workers via crowdsourcing and freelancing sites such as Mechanical Turk and Freelancer.com. To identify solicitations for abuse jobs, these Web sites need ways to distinguish these tasks from ordinary jobs. In this paper, we show how to discover clusters of abuse tasks using latent Dirichlet allocation (LDA), an unsupervised method for topic modeling in large corpora of text. Applying LDA to hundreds of thousands of unlabeled job postings from Freelancer.com, we find that it discovers clusters of related abuse jobs and identifies the prevalent words that distinguish them. Finally, we use the clusters from LDA to profile the population of workers who bid on abuse jobs and the population of buyers who post their project descriptions.

References

D. Blei and J. McAuliffe. Supervised topic models. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 121--128. MIT Press, Cambridge, MA, 2008.Google Scholar
D. M. Blei and J. Lafferty. Topic Models. In Text Mining: Theory and Applications. Taylor and Francis, London, UK, 2009.Google Scholar
D. M. Blei and J. D. Lafferty. Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning (ICML), page 113--120, Pittsburgh, Pennsylvania, 2006. Google ScholarDigital Library
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation. The Journal of Machine Learning Research, 3:993--1022, Mar. 2003. Google ScholarDigital Library
J. Chang and D. M. Blei. Hierarchical relational models for document networks. The Annals of Applied Statistics, 4(1):124--150, Mar. 2010.Google ScholarCross Ref
J. Chang, J. Boyd-Graber, and D. M. Blei. Connections between the lines: augmenting social networks with text. In Proceedings of the Fifteenth ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 169--178, Paris, France, 2009. Google ScholarDigital Library
J. Chang, J. Boyd-Graber, S. Gerrish, C. Wang, and D. Blei. Reading Tea Leaves: How Humans Interpret Topic Models. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, page 288--296. 2009.Google Scholar
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.Google ScholarCross Ref
Facebook Overtakes Myspace. http://blog.alexa.com/2008/05/facebook-overtakes-myspace_07.html.Google Scholar
L. Fei-Fei and P. Perona. A Bayesian hierarchical model for learning natural scene categories. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005. CVPR 2005, volume 2, pages 524-- 531 vol. 2. IEEE, June 2005. Google ScholarDigital Library
J. Franklin, V. Paxson, A. Perrig, and S. Savage. An Inquiry into the Nature and Causes of the Wealth of Internet Miscreants. In Proceedings of the ACM Conference on Computer and Communications Security (CCS), Alexandria, VA, Oct. 2007. Google ScholarDigital Library
Freelancer.com. http://www.freelancer.com/info/about.php.Google Scholar
T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 101(Suppl 1):5228--5235, Apr. 2004.Google ScholarCross Ref
M. Hoffman, D. Blei, and F. Bach. Online learning for latent dirichlet allocation. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 856--864. 2010.Google Scholar
T. Hofmann. Probabilistic Latent Semantic Indexing. Research and Development in Information Retrieval, pages 50--57, 1999. Google ScholarDigital Library
D. J. Hu and L. K. Saul. A probabilistic topic model of unsupervised learning for musical-key profiles. In Proceedings of the 10th International Society for Music Information Retrieval Conference, 2009.Google Scholar
P. G. Ipeirotis. Analyzing the Amazon Mechanical Turk Marketplace. XRDS: Crossroads, 17:16--21, Dec. 2010. Google ScholarDigital Library
M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An Introduction to Variational Methods for Graphical Models. Mach. Learn., 37(2):183--233, Nov. 1999. Google ScholarDigital Library
D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. SMART stopword list. http://jmlr.csail.mit.edu/papers/volume5/lewis04a/a11-smart-stop-list/english.stop, April 2004.Google Scholar
A. McCallum, X. Wang, and A. Corrada-Emmanuel. Topic and role discovery in social networks with experiments on enron and academic email. Journal of Artificial Intelligence Research, 30:249--272, Oct. 2007. Google ScholarDigital Library
M. Motoyama, K. Levchenko, C. Kanich, D. McCoy, G. M. Voelker, and S. Savage. Re: CAPTCHAs -- Understanding CAPTCHA-Solving from an Economic Context. In Proceedings of the USENIX Security Symposium, Washington, D.C., Aug. 2010. Google ScholarDigital Library
M. Motoyama, D. McCoy, K. Levchenko, S. Savage, and G. M. Voelker. Dirty Jobs: The Role of Freelance Labor in Web Service Abuse. In Proceedings of the USENIX Security Symposium, San Francisco, CA, Aug. 2011. Google ScholarDigital Library
H. Ning, Y. Hu, and T. S. Huang. Searching Human Behaviors using Spatial-Temporal words. In IEEE International Conference on Image Processing (ICIP), volume 6, Oct. 2007.Google Scholar
W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes 3rd Edition: The Art of Scientific Computing. Cambridge University Press, New York, NY, USA, 2007. Google ScholarDigital Library
B. Stone-Gross, T. Holz, G. Stringhini, and G. Vigna. The Underground Economy of Spam: a Botmaster's Perspective of Coordinating Large-Scale Spam Campaigns. In Proceedings of the 4th USENIX Workshop on Large-scale Exploits and Emergent Threats (LEET), Apr. 2011. Google ScholarDigital Library

Index Terms

Topic modeling of freelance job postings to monitor web service abuse

Recommendations

Topic modelling for qualitative studies

Qualitative studies, such as sociological research, opinion analysis and media studies, can benefit greatly from automated topic mining provided by topic models such as latent Dirichlet allocation LDA. However, examples of qualitative studies that ...
Read More
Joint sentiment/topic model for sentiment analysis
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

Sentiment analysis or opinion mining aims to use automated tools to detect subjective information such as opinions, attitudes, and feelings expressed in text. This paper proposes a novel probabilistic modeling framework based on Latent Dirichlet ...
Read More
Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Topic modeling is one of the most powerful techniques in text mining for data mining, latent data discovery, and finding relationships among data and text documents. Researchers have published many articles in the field of topic modeling and applied in ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
AISec '11: Proceedings of the 4th ACM workshop on Security and artificial intelligence
October 2011
124 pages
ISBN:9781450310031
DOI:10.1145/2046684
General Chair:
Yan Chen
Northwestern University, USA
,
Program Chairs:
Alvaro A. Cárdenas
Fujitsu Laboratories of America, USA
,
Rachel Greenstadt
Drexel University, USA
,
Ben Rubinstein
Microsoft Research, USA
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 October 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
crowdsourcing
latent dirichlet allocation
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate94of231submissions,41%
Upcoming Conference
CCS '24

Sponsor:

sigsac

ACM SIGSAC Conference on Computer and Communications Security

October 14 - 18, 2024

Salt Lake City , UT , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 361
  Total Downloads
- Downloads (Last 12 months)8
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Topic modeling of freelance job postings to monitor web service abuse

AISec '11: Proceedings of the 4th ACM workshop on Security and artificial intelligence

ABSTRACT

References

Cited By

Index Terms

Recommendations

Topic modelling for qualitative studies

Joint sentiment/topic model for sentiment analysis

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Topic modeling of freelance job postings to monitor web service abuse

AISec '11: Proceedings of the 4th ACM workshop on Security and artificial intelligence

ABSTRACT

References

Cited By

Index Terms

Recommendations

Topic modelling for qualitative studies

Joint sentiment/topic model for sentiment analysis

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media