ABSTRACT
We present the ideas and methodologies that we used to address the KDD Cup 2013 challenge on author-paper identification. We firstly formulate the problem as a personalized ranking task and then propose to solve the task through a supervised learning framework. The key point is to eliminate those incorrectly assigned papers of a given author based on existing records. We choose Gradient Boosted Tree as our main classifier. Through our exploration we conclude that the most critical factor to achieve our results is the effective feature engineering. In this paper, we formulate this process as a unified framework that constructs features based on contextual information and combines machine learning techniques with human intelligence. Besides this, we suggest several strategies to parse authors' names, which improve the prediction results significantly. Divide-conquer based model building as well as the model averaging techniques also benefit the prediction precision.
- M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. Adaptive name matching in information integration. IEEE Intelligent Systems, 18(5):16--23, Sept. 2003. Google ScholarDigital Library
- C. J. C. Burges. From RankNet to LambdaRank to LambdaMART: An overview. Technical report, Microsoft Research, 2010.Google Scholar
- O. Chapelle and Y. Chang. Yahoo! learning to rank challenge overview. Journal of Machine Learning Research - Proceedings Track, 14:1--24, 2011.Google Scholar
- J. H. Friedman. Stochastic gradient boosting. Comput. Stat. Data Anal., 38(4):367--378, Feb. 2002. Google ScholarDigital Library
- J. K. Laurila, D. Gatica-Perez, I. Aad, J. Blom, O. Bornet, T. Do, O. Dousse, J. Eberle, and M. Miettinen. The mobile data challenge: Big data for mobile computing research. In Mobile Data Challenge by Nokia Workshop, Newcastle, UK, 2012.Google Scholar
- T.-Y. Liu. Learning to rank for information retrieval. Found. Trends Inf. Retr., 3(3):225--331, Mar. 2009. Google ScholarDigital Library
- B.-W. On, D. Lee, J. Kang, and P. Mitra. Comparative study of name disambiguation problem using a scalable blocking-based framework. In Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries, JCDL '05, pages 344--353, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
- W. Pan, E. Zhong, and Q. Yang. Transfer learning for text mining. In C. C. Aggarwal and C. Zhai, editors, Mining Text Data, pages 223--257. Springer, 2012.Google ScholarCross Ref
- S. B. Roy, M. D. Cock, V. Mandava, B. Dalessandro, C. Perlich, W. Cukierski, and B. Hamner. The microsoft academic search dataset and kdd cup 2013. In KDD Cup 2013 workshop, 2013. Google ScholarDigital Library
- J. Tang, A. C. Fong, B. Wang, and J. Zhang. A unified probabilistic framework for name disambiguation in digital library. IEEE Transactions on Knowledge and Data Engineering, 24(6), 2012. Google ScholarDigital Library
- J. Xie, V. Rojkova, S. Pal, and S. Coggeshall. A combination of boosting and bagging for kdd cup 2009 - fast scoring on a large database. Journal of Machine Learning Research - Proceedings Track, 7:35--43, 2009.Google Scholar
- J. Ye, J.-H. Chow, J. Chen, and Z. Zheng. Stochastic gradient boosted distributed decision trees. In Proceedings of the 18th ACM conference on Information and knowledge management, CIKM '09, pages 2061--2064, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- H. Zhang, E. Riedl, V. A. Petrushin, S. Pal, and J. Spoelstra. Committee based prediction system for recommendation: Kdd cup 2011, track2. Journal of Machine Learning Research - Proceedings Track, 18:215--229, 2012.Google Scholar
- E. Zhong, B. Tan, K. Mo, and Q. Yang. User demographics prediction based on mobile data. Pervasive and Mobile Computing, (0):--, 2013.Google Scholar
Recommendations
Rule-Based Forecasting: Development and Validation of an Expert Systems Approach to Combining Time Series Extrapolations
This paper examines the feasibility of rule-based forecasting, a procedure that applies forecasting expertise and domain knowledge to produce forecasts according to features of the data. We developed a rule base to make annual extrapolation forecasts ...
Feature fusion of side face and gait for video-based human identification
Video-based human recognition at a distance remains a challenging problem for the fusion of multimodal biometrics. As compared to the approach based on match score level fusion, in this paper, we present a new approach that utilizes and integrates ...
Identification based on feature fusion of multimodal biometrics and deep learning
This paper proposes a novel methodology for individuals identification based on convolutional neural network (CNN) and machine learning (ML) algorithms. The technique is based on fusioning biometric modalities at the feature level. For this purpose, ...
Comments