ABSTRACT
This paper studies the entity-centric document filtering task -- given an entity represented by its identification page (e.g., an Wikpedia page), how to correctly identify its relevant documents. In particular, we are interested in learning an entity-centric document filter based on a small number of training entities, and the filter can predict document relevance for a large set of unseen entities at query time. Towards characterizing the relevance of a document, the problem boils down to learning keyword importance for the query entities. Since the same keyword will have very different importance for different entities, we abstract the entity-centric document filtering problem as a transfer learning problem, and the challenge becomes how to appropriately transfer the keyword importance learned from training entities to query entities. Based on the insight that keywords sharing some similar "properties" should have similar importance for their respective entities, we propose a novel concept of meta-feature to map keywords from different entities. To realize the idea of meta-feature-based feature mapping, we develop and contrast two different models, LinearMapping and BoostMapping. Experiments on three different datasets confirm the effectiveness of our proposed models, which show significant improvement compared with four state-of-the-art baseline methods.
- Trec knowledge base acceleration 2012, http://trec-kba.org/kba-ccr-2012.shtml.Google Scholar
- S. Andrews, I. Tsochantaridis, and T. Hofmann. Support vector machines for multiple-instance learning. In Advances in neural information processing systems, pages 561--568, 2002.Google ScholarDigital Library
- D. Blei and J. McAuliffe. Supervised topic models. arXiv preprint arXiv:1003.0783, 2010.Google Scholar
- J. Blitzer, M. Dredze, and F. Pereira. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Annual Meeting-Association For Computational Linguistics, volume 45, page 440, 2007.Google Scholar
- Y. Cao, J. Xu, T.-Y. Liu, H. Li, Y. Huang, and H.-W. Hon. Adapting ranking svm to document retrieval. In Proceedings of the 29th ACM SIGIR conference, pages 186--193. ACM, 2006. Google ScholarDigital Library
- W. Dai, G. Xue, Q. Yang, and Y. Yu. Co-clustering based classification for out-of-domain documents. In Proceedings of the 13 th ACM SIGKDD international conference, volume 12, pages 210--219, 2007. Google ScholarDigital Library
- A. Evgeniou and M. Pontil. Multi-task feature learning. In Advances in Neural Information Processing Systems 19: Proceedings of the 2006 Conference, volume 19, page 41. MIT Press, 2007.Google Scholar
- J. Frank, M. Kleiman-Weiner, D. Roberts, F. Niu, C. Zhang, and R. C. Building an entity-centric stream filtering test collection for trec 2012. In Proceeding of the Twenty-First Text Retrieval Conference, 2012.Google Scholar
- J. Friedman. Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4):367--378, 2002. Google ScholarDigital Library
- A. Huang. Similarity measures for text document clustering. In Proceedings of the Sixth New Zealand Computer Science Research Student Conference (NZCSRSC2008), Christchurch, New Zealand, pages 49--56, 2008.Google Scholar
- T. Joachims. Text categorization with support vector machines: Learning with many relevant features. Machine learning: ECML-98, pages 137--142, 1998. Google ScholarDigital Library
- T. Joachims. Making large scale svm learning practical. 1999.Google Scholar
- T. Joachims. Training linear svms in linear time. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 217--226. ACM, 2006. Google ScholarDigital Library
- G. Kumaran and V. R. Carvalho. Reducing long queries using query quality predictors. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 564--571. ACM, 2009. Google ScholarDigital Library
- X. Liu and H. Fang. Entity profile based approach in automatic knowledge finding. In Proceeding of the Twenty-First Text Retrieval Conference, 2012.Google Scholar
- S. Pan, J. Kwok, and Q. Yang. Transfer learning via dimensionality reduction. In Proceedings of the 23rd national conference on Artificial intelligence, volume 2, pages 677--682, 2008. Google ScholarDigital Library
- S. Pan and Q. Yang. A survey on transfer learning. Knowledge and Data Engineering, IEEE Transactions on, 22(10):1345--1359, 2010. Google ScholarDigital Library
- T. Qin, T.-Y. Liu, J. Xu, and H. Li. Letor: A benchmark collection for research on learning to rank for information retrieval. Information Retrieval, 13(4):346--374, 2010. Google ScholarDigital Library
- S. E. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu, M. Gatford, et al. Okapi at trec-3. NIST Special Publication SP, pages 109--109, 1995.Google Scholar
- L. Weng, Z. Li, R. Cai, Y. Zhang, Y. Zhou, L. Yang, and L. Zhang. Query by document via a decomposition-based two-level retrieval approach. In Proceedings of the 34th international ACM SIGIR conference. ACM, 2011. Google ScholarDigital Library
- Y. Yang, N. Bansal, W. Dakka, P. Ipeirotis, N. Koudas, and D. Papadias. Query by document. In Proceedings of the Second ACM International Conference on Web Search and Data Mining, pages 34--43. ACM, 2009. Google ScholarDigital Library
- Y. Yang and J. Pedersen. A comparative study on feature selection in text categorization. In Machine Learning Internetional Workshop Then Conference, pages 412--420. Morgan Kaufmann Publishers, Inc., 1997. Google ScholarDigital Library
- B. Zadrozny. Learning and evaluating classifiers under sample selection bias. In Proceedings of the twenty-first international conference on Machine learning, page 114. ACM, 2004. Google ScholarDigital Library
Index Terms
- Entity-centric document filtering: boosting feature mapping through meta-features
Recommendations
Learning entity-centric document representations using an entity facet topic model
Highlights- We propose the task of entity-centric document representation learning.
- We ...
AbstractLearning semantic representations of documents is essential for various downstream applications, including text classification and information retrieval. Entities, as important sources of information, have been playing a crucial role ...
Entity centric query expansion for enterprise search
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge managementEnterprise search is important, and the search quality has a direct impact on the productivity of an enterprise. Many information needs of enterprise search center around entities. Intuitively, information related to the entities mentioned in the query, ...
Exploiting entity relationship for query expansion in enterprise search
AbstractEnterprise search is important, and the search quality has a direct impact on the productivity of an enterprise. Enterprise data contain both structured and unstructured information. Since these two types of information are complementary and the ...
Comments