skip to main content
10.1145/2517288.2517293acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Contextual rule-based feature engineering for author-paper identification

Authors Info & Claims
Published:11 August 2013Publication History

ABSTRACT

We present the ideas and methodologies that we used to address the KDD Cup 2013 challenge on author-paper identification. We firstly formulate the problem as a personalized ranking task and then propose to solve the task through a supervised learning framework. The key point is to eliminate those incorrectly assigned papers of a given author based on existing records. We choose Gradient Boosted Tree as our main classifier. Through our exploration we conclude that the most critical factor to achieve our results is the effective feature engineering. In this paper, we formulate this process as a unified framework that constructs features based on contextual information and combines machine learning techniques with human intelligence. Besides this, we suggest several strategies to parse authors' names, which improve the prediction results significantly. Divide-conquer based model building as well as the model averaging techniques also benefit the prediction precision.

References

  1. M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. Adaptive name matching in information integration. IEEE Intelligent Systems, 18(5):16--23, Sept. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. C. J. C. Burges. From RankNet to LambdaRank to LambdaMART: An overview. Technical report, Microsoft Research, 2010.Google ScholarGoogle Scholar
  3. O. Chapelle and Y. Chang. Yahoo! learning to rank challenge overview. Journal of Machine Learning Research - Proceedings Track, 14:1--24, 2011.Google ScholarGoogle Scholar
  4. J. H. Friedman. Stochastic gradient boosting. Comput. Stat. Data Anal., 38(4):367--378, Feb. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. K. Laurila, D. Gatica-Perez, I. Aad, J. Blom, O. Bornet, T. Do, O. Dousse, J. Eberle, and M. Miettinen. The mobile data challenge: Big data for mobile computing research. In Mobile Data Challenge by Nokia Workshop, Newcastle, UK, 2012.Google ScholarGoogle Scholar
  6. T.-Y. Liu. Learning to rank for information retrieval. Found. Trends Inf. Retr., 3(3):225--331, Mar. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. B.-W. On, D. Lee, J. Kang, and P. Mitra. Comparative study of name disambiguation problem using a scalable blocking-based framework. In Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries, JCDL '05, pages 344--353, New York, NY, USA, 2005. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. W. Pan, E. Zhong, and Q. Yang. Transfer learning for text mining. In C. C. Aggarwal and C. Zhai, editors, Mining Text Data, pages 223--257. Springer, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  9. S. B. Roy, M. D. Cock, V. Mandava, B. Dalessandro, C. Perlich, W. Cukierski, and B. Hamner. The microsoft academic search dataset and kdd cup 2013. In KDD Cup 2013 workshop, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Tang, A. C. Fong, B. Wang, and J. Zhang. A unified probabilistic framework for name disambiguation in digital library. IEEE Transactions on Knowledge and Data Engineering, 24(6), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Xie, V. Rojkova, S. Pal, and S. Coggeshall. A combination of boosting and bagging for kdd cup 2009 - fast scoring on a large database. Journal of Machine Learning Research - Proceedings Track, 7:35--43, 2009.Google ScholarGoogle Scholar
  12. J. Ye, J.-H. Chow, J. Chen, and Z. Zheng. Stochastic gradient boosted distributed decision trees. In Proceedings of the 18th ACM conference on Information and knowledge management, CIKM '09, pages 2061--2064, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. H. Zhang, E. Riedl, V. A. Petrushin, S. Pal, and J. Spoelstra. Committee based prediction system for recommendation: Kdd cup 2011, track2. Journal of Machine Learning Research - Proceedings Track, 18:215--229, 2012.Google ScholarGoogle Scholar
  14. E. Zhong, B. Tan, K. Mo, and Q. Yang. User demographics prediction based on mobile data. Pervasive and Mobile Computing, (0):--, 2013.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    KDD Cup '13: Proceedings of the 2013 KDD Cup 2013 Workshop
    August 2013
    69 pages
    ISBN:9781450324953
    DOI:10.1145/2517288

    Copyright © 2013 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 11 August 2013

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article

    Upcoming Conference

    KDD '24

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader