research-article

AutoCross: Automatic Feature Crossing for Tabular Data in Real-World Applications

Authors:
Yuanfei Luo

4Paradigm Inc., Beijing, China

4Paradigm Inc., Beijing, China
View Profile

,
Mengshuo Wang

4Paradigm Inc., Beijing, China

4Paradigm Inc., Beijing, China
View Profile

,
Hao Zhou

4Paradigm Inc., Beijing, China

4Paradigm Inc., Beijing, China
View Profile

,
Quanming Yao

4Paradigm Inc., Beijing, China

4Paradigm Inc., Beijing, China
View Profile

,
Wei-Wei Tu

4Paradigm Inc., Beijing, China

4Paradigm Inc., Beijing, China
View Profile

,
Yuqiang Chen

4Paradigm Inc., Beijing, China

4Paradigm Inc., Beijing, China
View Profile

,
Wenyuan Dai

4Paradigm Inc., Beijing, China

4Paradigm Inc., Beijing, China
View Profile

,
Qiang Yang

Hong Kong University of Science and Technology, Hong Kong, Hong Kong

Hong Kong University of Science and Technology, Hong Kong, Hong Kong
View Profile

KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningJuly 2019Pages 1936–1945https://doi.org/10.1145/3292500.3330679

Published:25 July 2019Publication History

KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pages 1936–1945

ABSTRACT

Feature crossing captures interactions among categorical features and is useful to enhance learning from tabular data in real-world businesses. In this paper, we present AutoCross, an automatic feature crossing tool provided by 4Paradigm to its customers, ranging from banks, hospitals, to Internet corporations. By performing beam search in a tree-structured space, AutoCross enables efficient generation of high-order cross features, which is not yet visited by existing works. Additionally, we propose successive mini-batch gradient descent and multi-granularity discretization to further improve efficiency and effectiveness, while ensuring simplicity so that no machine learning expertise or tedious hyper-parameter tuning is required. Furthermore, the algorithms are designed to reduce the computational, transmitting, and storage costs involved in distributed computing. Experimental results on both benchmark and real-world business datasets demonstrate the effectiveness and efficiency of AutoCross. It is shown that AutoCross can significantly enhance the performance of both linear and deep models.

References

R. Agrawal, T. Imieli'nski, and A. Swami. 1993. Mining association rules between sets of items in large databases. In ACM Sigmod Record, Vol. 22. ACM, 207--216. Google ScholarDigital Library
M. Blondel, A. Fujino, N. Ueda, and M. Ishihata. 2016. Higher-order factorization machines. In Advances in Neural Information Processing Systems. 3351--3359. Google ScholarDigital Library
J. Bobadilla, F. Ortega, A. Hernando, and A. Gutiérrez. 2013. Recommender systems survey. Knowledge-Based Systems, Vol. 46 (2013), 109--132. Google ScholarDigital Library
R. Bolton and D. Hand. 2002. Statistical fraud detection: A review. Statistical science (2002), 235--249.Google Scholar
O. Chapelle, E. Manavoglu, and R. Rosales. 2015. Simple and scalable response prediction for display advertising. ACM Transactions on Intelligent Systems and Technology (TIST), Vol. 5, 4 (2015), 61. Google ScholarDigital Library
C. Cheng, F. Xia, T. Zhang, I. King, and M. Lyu. 2014. Gradient boosting factorization machines. In ACM Conference on Recommender systems. 265--272. Google ScholarDigital Library
H.-T. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, and M. Ispir. 2016. Wide & deep learning for recommender systems. In Workshop on Deep Learning for Recommender Systems. 7--10. Google ScholarDigital Library
D. Crankshaw, X. Wang, G. Zhou, M. Franklin, J. Gonzalez, and I. Stoica. 2017. Clipper: A low-latency online prediction serving system.. In USENIX Symposium on Networked Systems Design and Implementation. 613--627. Google ScholarDigital Library
P. Domingos. 2012. A few useful things to know about machine learning. Commun. ACM, Vol. 55, 10 (2012), 78--87. Google ScholarDigital Library
D. Evans. 2009. The online advertising industry: Economics, evolution, and privacy. Journal of Economic Perspectives, Vol. 23, 3 (2009), 37--60.Google ScholarCross Ref
W. Fan, E. Zhong, J. Peng, O. Verscheure, K. Zhang, J. Ren, R. Yan, and Q. Yang. 2010. Generalized and heuristic-free feature construction for improved accuracy. In SIAM International Conference on Data Mining. 629--640.Google Scholar
J. Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001), 1189--1232.Google Scholar
H. Guo and W. Hsu. 2002. A survey of algorithms for real-time Bayesian network inference. In Join Workshop on Real Time Decision Support and Diagnosis Systems.Google Scholar
H. Guo, R. Tang, Y. Ye, Z. Li, and X. He. 2017. DeepFM: a factorization-machine based neural network for CTR prediction. In International Joint Conference on Artificial Intelligence. 1725--1731. Google ScholarDigital Library
J. Han, J. Pei, and M. Kamber. 2011. Data mining: concepts and techniques. Elsevier. Google ScholarDigital Library
J. Han, J. Pei, and Y. Yin. 2000. Mining frequent patterns without candidate generation. In ACM Sigmod Record, Vol. 29. 1--12. Google ScholarDigital Library
S. Han, H. Mao, and W. Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In International Conference on Learning Representations.Google Scholar
K. Jamieson and A. Talwalkar. 2016. Non-stochastic best arm identification and hyperparameter optimization. In Artificial Intelligence and Statistics. 240--248.Google Scholar
Y. Juan, Y. Zhuang, W.-S. Chin, and C.-J. Lin. 2016. Field-aware factorization machines for CTR prediction. In ACM Conference on Recommender Systems. 43--50. Google ScholarDigital Library
J. Kanter and K. Veeramachaneni. 2015. Deep feature synthesis: Towards automating data science endeavors. In IEEE International Conference on Data Science and Advanced Analytics. 1--10.Google Scholar
G. Katz, E. Shin, and D. Song. 2016. Explorekit: Automatic feature generation and selection. In International Conference on Data Mining. 979--984.Google Scholar
D. Kingma and J. Ba. 2014. Adam: A method for stochastic optimization. In International Conference on Learning Representations.Google Scholar
I. Kononenko. 2001. Machine learning for medical diagnosis: history, state of the art and perspective. Artificial Intelligence in medicine, Vol. 23, 1 (2001), 89--109. Google ScholarDigital Library
S. Kotsiantis and D. Kanellopoulos. 2006. Discretization techniques: A recent survey. GESTS International Transactions on Computer Science and Engineering, Vol. 32, 1 (2006), 47--58.Google Scholar
M. Li, L. Zhou, Z. Yang, A. Li, F. Xia, D. Andersen, and A. Smola. 2013. Parameter server for distributed machine learning. In Big Learning NIPS Workshop, Vol. 6. 2.Google Scholar
J. Lian, X. Zhou, F. Zhang, Z. Chen, X. Xie, and G. Sun. 2018. xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems. In International Conference on Knowledge Discovery & Data Mining. Google ScholarDigital Library
H. Liu, F. Hussain, C. Tan, and M. Dash. 2002. Discretization: An enabling technique. Data mining and knowledge discovery, Vol. 6, 4 (2002), 393--423. Google ScholarDigital Library
H. Liu, H. sand Motoda. 1998. Feature extraction, construction and selection: A data mining perspective. Vol. 453. Springer Science & Business Media. Google ScholarDigital Library
M. Medress, F. Cooper, J. Forgie, C. Green, D. Klatt, M. O'Malley, E. Neuburg, A. Newell, and B. Reddy, D Ritea. 1977. Speech understanding systems: Report of a steering committee. Artificial Intelligence, Vol. 9, 3 (1977), 307--316.Google ScholarCross Ref
L. Meier, S. Van De Geer, and P. Bühlmann. 2008. The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), Vol. 70, 1 (2008), 53--71.Google ScholarCross Ref
T. Mitchell. 1997. Machine learning. Springer Science & Business Media. Google ScholarDigital Library
R. Ng, L. Lakshmanan, J. Han, and A. Pang. 1998. Exploratory mining and pruning optimizations of constrained associations rules. In ACM Sigmod Record, Vol. 27. ACM, 13--24. Google ScholarDigital Library
Y. Qu, H. Cai, K. Ren, W. Zhang, Y. Yu, Y. Wen, and J. Wang. 2016. Product-based neural networks for user response prediction. In IEEE International Conference on Data Mining. IEEE, 1149--1154.Google Scholar
R. Rosales, H. Cheng, and E. Manavoglu. 2012. Post-click conversion modeling and analysis for non-guaranteed delivery display advertising. In ACM International Conference on Web Search and Data Mining. 293--302. Google ScholarDigital Library
M. Smith and L. Bull. 2005. Genetic programming with a genetic algorithm for feature construction and selection. Genetic Programming and Evolvable Machines, Vol. 6, 3 (2005), 265--281. Google ScholarDigital Library
B. Tran, B. Xue, and M. Zhang. 2016. Genetic programming for feature construction and selection in classification on high-dimensional data. Memetic Computing, Vol. 8, 1 (2016), 3--15.Google ScholarCross Ref
R. Wang, B. Fu, G. Fu, and M. Wang. 2017. Deep & cross network for ad click predictions. In KDD Workshop. ACM, 12. Google ScholarDigital Library
S. Wang. 2010. A comprehensive survey of data mining-based accounting-fraud detection research. In Intelligent Computation Technology and Automation (ICICTA), 2010 International Conference on, Vol. 1. IEEE, 50--53. Google ScholarDigital Library
K. Weinberger, A. Dasgupta, J. Attenberg, J. Langford, and A. Smola. 2009. Feature hashing for large scale multitask learning. In International Conference on Machine Learning. Google ScholarDigital Library
Q. Yao, M. Wang, Y. Chen, W. Dai, Y. Hu, Y. Li, W.-W. Tu, Q. Yang, and Y. Yu. 2018. Taking Human out of Learning Applications: A Survey on Automated Machine Learning. Technical Report. arXiv preprint.Google Scholar
R. Zeff and B. Aronson. 1999. Advertising on the Internet. John Wiley & Sons, Inc. Google ScholarDigital Library
W. Zhang, T. Du, and J. Wang. 2016. Deep learning over multi-field categorical data. In European conference on information retrieval. Springer, 45--57.Google Scholar
Y. Zhang, Q. Yao, W. Dai, and L. Chen. 2019. AutoKGE: Searching Scoring Functions for Knowledge Graph Embedding. Technical Report. arXiv preprint arXiv:1904.11682.Google Scholar

Index Terms

AutoCross: Automatic Feature Crossing for Tabular Data in Real-World Applications
1. Computing methodologies
  1. Machine learning

Recommendations

Deep & Cross Network for Ad Click Predictions
ADKDD'17: Proceedings of the ADKDD'17

Feature engineering has been the key to the success of many prediction models. However, the process is nontrivial and often requires manual feature engineering or exhaustive searching. DNNs are able to automatically learn feature interactions; however, ...
Read More
Multimodal AutoML for Image, Text and Tabular Data
KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Automated machine learning (AutoML) offers the promise of translating raw data into accurate predictions without the need for significant human effort, expertise, and manual experimentation. In this lecture-style tutorial, we demonstrate fundamental ...
Read More
Mining Cross Features for Financial Credit Risk Assessment
CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management

For reliability, machine learning models in some areas, e.g., finance and healthcare, require to be both accurate and globally interpretable. Among them, credit risk assessment is a major application of machine learning for financial institutions to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
July 2019
3305 pages
ISBN:9781450362016
DOI:10.1145/3292500
General Chairs:
Ankur Teredesai
KenSci
,
Vipin Kumar
University of Minnesota
,
Program Chairs:
Ying Li
EV Analysis Corporation
,
Rómer Rosales
LinkedIn
,
Evimaria Terzi
Boston University
,
George Karypis
University of Minnesota
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 July 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
automl
feature crossing
tabular data
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '19 Paper Acceptance Rate110of1,200submissions,9%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 48
  Total Citations
  View Citations
- 1,116
  Total Downloads
- Downloads (Last 12 months)100
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

AutoCross: Automatic Feature Crossing for Tabular Data in Real-World Applications

KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Deep & Cross Network for Ad Click Predictions

Multimodal AutoML for Image, Text and Tabular Data

Mining Cross Features for Financial Credit Risk Assessment

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

AutoCross: Automatic Feature Crossing for Tabular Data in Real-World Applications

KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Deep & Cross Network for Ad Click Predictions

Multimodal AutoML for Image, Text and Tabular Data

Mining Cross Features for Financial Credit Risk Assessment

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media