research-article

Finding Camouflaged Needle in a Haystack?: Pornographic Products Detection via Berrypicking Tree Model

Authors:

Luo SiAuthors Info & Claims

SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 365 - 374

https://doi.org/10.1145/3331184.3331197

Published: 18 July 2019 Publication History

Abstract

It is an important and urgent research problem for decentralized eCommerce services, e.g., eBay, eBid, and Taobao, to detect illegal products, e.g., unclassified pornographic products. However, it is a challenging task as some sellers may utilize and change camouflaged text to deceive the current detection algorithms. In this study, we propose a novel task to dynamically locate the pornographic products from very large product collections. Unlike prior product classification efforts focusing on textual information, the proposed model, BerryPIcking TRee MoDel (BIRD), utilizes both product textual content and buyers' seeking behavior information as berrypicking trees. In particular, the BIRD encodes both semantic information with respect to all branches sequence and the overall latent buyer intent during the whole seeking process. An extensive set of experiments have been conducted to demonstrate the advantage of the proposed model against alternative solutions. To facilitate further research of this practical and important problem, the codes and buyers' seeking behavior data have been made publicly available1.

Supplementary Material

MP4 File (cite2-12h00-d2.mp4)

Download
482.67 MB

References

[1]

Prudhvi Ratna Badri Satya, Kyumin Lee, Dongwon Lee, Thanh Tran, and Jason Jiasheng Zhang. 2016. Uncovering fake likers in online social networks. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. ACM, 2365--2370.

Digital Library

[2]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. International Conference on Learning Representations (2015), 1--15.

[3]

Marcia J. Bates. 1989. The design of browsing and berrypicking techniques for the online search interface. Online review, Vol. 13, 5 (1989), 407--424.

[4]

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model. Journal of machine learning research, Vol. 3, Feb (2003), 1137--1155.

Digital Library

[5]

Cheng Cao, James Caverlee, Kyumin Lee, Hancheng Ge, and Jinwook Chung. 2015. Organic or organized? Exploring url sharing behavior. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. ACM, 513--522.

Digital Library

[6]

Elfreda A. Chatman. 1999. A theory of life in the round. Journal of the American Society for information Science, Vol. 50, 3 (1999), 207--217.

Digital Library

[7]

Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014), 1724--1734.

[8]

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).

Digital Library

[9]

Brenda Dervin. 1998. Sense-making theory and practice: an overview of user interests in knowledge seeking and use. Journal of knowledge management, Vol. 2, 2 (1998), 36--46.

[10]

Carsten Eickhoff, Jaime Teevan, Ryen White, and Susan Dumais. 2014. Lessons from the journey: a query log analysis of within-session learning. In Proceedings of the 7th ACM international conference on Web search and data mining. ACM, 223--232.

Digital Library

[11]

Song Feng, Longfei Xing, Anupam Gogar, and Yejin Choi. 2012. Distributional Footprints of Deceptive Product Reviews. ICWSM, Vol. 12 (2012), 98--105.

[12]

David Mandell Freeman. 2017. Can you spot the fakes? On the limitations of user feedback in online social networks. In Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1093--1102.

Digital Library

[13]

Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. 2017. Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 1243--1252.

Digital Library

[14]

Guoxiu He and Wei Lu. 2018. Entire Information Attentive GRU for Text Representation. In Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval (ICTIR '18). ACM, 163--166.

Digital Library

[15]

Marti A. Hearst, Susan T. Dumais, Edgar Osuna, John Platt, and Bernhard Scholkopf. 1998. Support vector machines. IEEE Intelligent Systems and their applications, Vol. 13, 4 (1998), 18--28.

Digital Library

[16]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.

Digital Library

[17]

Ramon Ferrer i Cancho and Ricard V Solé. 2003. Least effort and the origins of scaling in human language. Proceedings of the National Academy of Sciences, Vol. 100, 3 (2003), 788--791.

[18]

Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS), Vol. 20, 4 (2002), 422--446.

Digital Library

[19]

Zhuoren Jiang, Liangcai Gao, Ke Yuan, Zheng Gao, Zhi Tang, and Xiaozhong Liu. 2018. Mathematics Content Understanding for Cyberlearning via Formula Evolution Map. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. ACM, 37--46.

Digital Library

[20]

Zhuoren Jiang, Yue Yin, Liangcai Gao, Yao Lu, and Xiaozhong Liu. 2018. Cross-language Citation Recommendation via Hierarchical Representation Learning on Heterogeneous Graph. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 635--644.

Digital Library

[21]

Rie Johnson and Tong Zhang. 2017. Deep pyramid convolutional neural networks for text categorization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 562--570.

[22]

Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A convolutional neural network for modelling sentences. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, Volume 1: Long Papers (2014), 655--665.

[23]

Mahmood Khosrowjerdi. 2016. A review of theory-driven models of trust in the online health context. IFLA journal, Vol. 42, 3 (2016), 189--206.

[24]

Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014. 1746--1751.

[25]

Diederik P. Kingma and Jimmy Ba. {n. d.}. Adam: A method for stochastic optimization. In International Conference for Learning Representations. 1--15.

[26]

James Krikelas. 1983. Information-seeking behavior: Patterns and concepts. Drexel library quarterly, Vol. 19, 2 (1983), 5--20.

[27]

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. nature, Vol. 521, 7553 (2015), 436.

[28]

Kyumin Lee, James Caverlee, Zhiyuan Cheng, and Daniel Z. Sui. 2013. Campaign extraction from social media. ACM Transactions on Intelligent Systems and Technology (TIST), Vol. 5, 1 (2013), 9.

Digital Library

[29]

Kyumin Lee, Brian David Eoff, and James Caverlee. 2011. Seven Months with the Devils: A Long-Term Study of Content Polluters on Twitter. In Fifth International AAAI Conference on Weblogs and Social Media. 185--192.

[30]

Tao Lei, Yu Zhang, Sida I Wang, Hui Dai, and Yoav Artzi. 2018. Simple Recurrent Units for Highly Parallelizable Recurrence. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 4470--4481.

[31]

Yuqing Lu, Lei Zhang, Yudong Xiao, and Yangguang Li. 2013. Simultaneously detecting fake reviews and review spammers using factor graph model. In Proceedings of the 5th annual ACM web science conference. ACM, 225--233.

Digital Library

[32]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. Computer Science (2013).

[33]

Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Interspeech, Vol. 2. 3.

[34]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111--3119.

Digital Library

[35]

Frederic Morin and Yoshua Bengio. 2005. Hierarchical probabilistic neural network language model. In Aistats, Vol. 5. Citeseer, 246--252.

[36]

Myle Ott, Claire Cardie, and Jeff Hancock. 2012. Estimating the prevalence of deception in online review communities. In Proceedings of the 21st international conference on World Wide Web. ACM, 201--210.

Digital Library

[37]

Sherif Saad, Issa Traore, Ali Ghorbani, Bassam Sayed, David Zhao, Wei Lu, John Felix, and Payman Hakimian. 2011. Detecting P2P botnets through network behavior analysis and machine learning. In Privacy, Security and Trust (PST), 2011 Ninth Annual International Conference on. IEEE, 174--180.

[38]

Dinghan Shen, Guoyin Wang, Wenlin Wang, Martin Renqiang Min, Qinliang Su, Yizhe Zhang, Chunyuan Li, Ricardo Henao, and Lawrence Carin. 2018. Baseline needs more love: On simple word-embedding-based models and associated pooling mechanisms. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Volume 1: Long Papers. 440--450.

[39]

Ning Su, Yiqun Liu, Zhao Li, Yuli Liu, Min Zhang, and Shaoping Ma. 2018. Detecting Crowdturfing "Add to Favorites" Activities in Online Shopping. In Proceedings of the 2018 World Wide Web Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1673--1682.

Digital Library

[40]

Ming Tan, Cicero dos Santos, Bing Xiang, and Bowen Zhou. 2015. LSTM-based deep learning models for non-factoid answer selection. arXiv preprint arXiv:1511.04108 (2015).

[41]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998--6008.

Digital Library

[42]

Bingning Wang, Kang Liu, and Jun Zhao. 2016. Inner attention based recurrent neural networks for answer selection. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 1288--1297.

[43]

Chenglong Wang, Feijun Jiang, and Hongxia Yang. 2017. A hybrid framework for text modeling with convolutional RNN. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2061--2069.

Digital Library

[44]

Ryen W. White, Gary Marchionini, and Gheorghe Muresan. 2008. Evaluating exploratory search systems. Information Processing and Management, Vol. 44, 2 (2008), 433.

Digital Library

[45]

Chang Xu and Jie Zhang. 2015. Towards collusive fraud detection in online reviews. In 2015 IEEE International Conference on Data Mining (ICDM). IEEE, 1051--1056.

Digital Library

[46]

Chang Xu, Jie Zhang, Kuiyu Chang, and Chong Long. 2013. Uncovering collusive spammers in Chinese review websites. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management. ACM, 979--988.

Digital Library

[47]

Junting Ye and Leman Akoglu. 2015. Discovering opinion spammer groups by network footprints. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 267--282.

Digital Library

[48]

Wenpeng Yin and Hinrich Schütze. 2015. Convolutional neural network for paraphrase identification. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 901--911.

[49]

Chunting Zhou, Chonglin Sun, Zhiyuan Liu, and Francis Lau. 2015. A C-LSTM neural network for text classification. arXiv preprint arXiv:1511.08630 (2015).

Cited By

Huang LYuan BZhang RLu QHuang JChang YCheng XKamps JMurdock VWen JLiu Y(2020)Towards Linking Camouflaged Descriptions to Implicit Products in E-commerceProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3397271.3401067(901-910)Online publication date: 25-Jul-2020
https://dl.acm.org/doi/10.1145/3397271.3401067

Index Terms

Finding Camouflaged Needle in a Haystack?: Pornographic Products Detection via Berrypicking Tree Model
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Information systems
  1. Information retrieval
    1. Information retrieval query processing
      1. Query log analysis
  2. World Wide Web
    1. Web searching and information discovery
      1. Web search engines
        Spam detection

Recommendations

Implicit Products in the Decentralized eCommerce Ecosystems
JCDL '20: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020

Detecting dark businesses in a decentralized eCommerce ecosystem (e.g. eBay, eBid, and Taobao) is a critical research problem. In this paper, we investigate the characteristics of dark implicit products, the associated buyer seeking behaviors, and ...
Needle in a Haystack: Tracking Down Elite Phishing Domains in the Wild
IMC '18: Proceedings of the Internet Measurement Conference 2018

Today's phishing websites are constantly evolving to deceive users and evade the detection. In this paper, we perform a measurement study on squatting phishing domains where the websites impersonate trusted entities not only at the page content level ...
Poster: CUD: crowdsourcing for URL spam detection
CCS '11: Proceedings of the 18th ACM conference on Computer and communications security

The prevalence of spam URLs in Internet services, such as email, social networks, blogs and online forums has become a serious problem. These spam URLs host spam advertisements, phishing attempts, and malwares, which are harmful for normal users. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

July 2019

1512 pages

ISBN:9781450361729

DOI:10.1145/3331184

General Chairs:
Benjamin Piwowarski
CNRS - Sorbonne Universite, France
,
Max Chevalier
Universite de Toulouse, CNRS, France
,
Eric Gaussier
Universite Grenoble Alpes, CNRS, France
,
Program Chairs:
Yoelle Maarek
Amazon Research, Israel
,
Jian-Yun Nie
University of Montreal, Canada
,
Falk Scholer
RMIT University, Australia

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 July 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Fundamental Research Funds for the Central Universities

Conference

SIGIR '19

Sponsor:

SIGIR

SIGIR '19: The 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

July 21 - 25, 2019

Paris, France

Acceptance Rates

SIGIR'19 Paper Acceptance Rate 84 of 426 submissions, 20%;

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
485
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)0

Reflects downloads up to 30 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Huang LYuan BZhang RLu QHuang JChang YCheng XKamps JMurdock VWen JLiu Y(2020)Towards Linking Camouflaged Descriptions to Implicit Products in E-commerceProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3397271.3401067(901-910)Online publication date: 25-Jul-2020
https://dl.acm.org/doi/10.1145/3397271.3401067

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten