skip to main content
10.1145/2396761.2396859acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

TCSST: transfer classification of short & sparse text using external data

Published: 29 October 2012 Publication History

Abstract

Short & sparse text is becoming more prevalent on the web, such as search snippets, micro-blogs and product reviews. Accurately classifying short & sparse text has emerged as an important while challenging task. Existing work has considered utilizing external data (e.g. Wikipedia) to alleviate data sparseness, by appending topics detected from external data as new features. However, training a classifier on features concatenated from different spaces is not easy considering the features have different physical meanings and different significance to the classification task. Moreover, it exacerbates the "curse of dimensionality" problem. In this study, we propose a transfer classification method, TCSST, to exploit the external data to tackle the data sparsity issue. The transfer classifier will be learned in the original feature space. Considering that the labels of the external data may not be readily available or sufficiently enough, TCSST further exploits the unlabeled external data to aid the transfer classification. We develop novel strategies to allow TCSST to iteratively select high quality unlabeled external data to help with the classification. We evaluate the performance of TCSST on both benchmark as well as real-world data sets. Our experimental results demonstrate that the proposed method is effective in classifying very short & sparse text, consistently outperforming existing and baseline methods.

References

[1]
Natural resource manager capacity accessment data. In http://sourceforge.net/projects/tcsst/.
[2]
S. Banerjee, K. Ramanathan, and A. Gupta. Clustering short texts using wikipedia. In Proc. ACM SIGIR, 2007.
[3]
D. Bollegala, Y. Matsuo, and M. Ishizuka. Measuring semantic similarity between words using web search engines. In Proc. WWW, 2007.
[4]
W. Dai, Q. Yang, G.-R. Xue, and Y. Yu. Boosting for transfer learning. In Proc. ICML, 2007.
[5]
T. Evgeniou and M. Pontil. Regularized multi-task learning. In Proc. KDD, 2004.
[6]
E. Gabrilovich and S. Markovitch. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In Proc. IJCAI, 2007.
[7]
X. Hu, N. Sun, C. Zhang, and T.-S. Chua. Exploiting internal and external semantics for the clustering of short texts using world knowledge. In Proc. CIKM, 2009.
[8]
O. Jin, N. N. Liu, K. Zhao, Y. Yu, and Q. Yang. Transferring topical knowledge from auxiliary long texts for short text clustering. In Proc. CIKM, 2011.
[9]
K. Lang. Newsweeder: Learning to filter netnews. In Proc. ICML, 1995.
[10]
R. Mihalcea, C. Corley, and C. Strapparava. Corpus-based and knowledge-based measures of text semantic similarity. In Proc. AAAI, 2006.
[11]
S. J. Pan and Q. Yang. A survey on transfer learning. IEEE TKDE, 22(10):1345--1359, 2010.
[12]
X.-H. Phan, L.-M. Nguyen, and S. Horiguchi. Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In Proc. WWW, 2008.
[13]
R. Raina, A. Battle, H. Lee, B. Packer, and A. Ng. Self-taught learning: Transfer learning from unlabeled data. In Proc. ICML, 2007.
[14]
M. Sahami and T. D. Heilman. A web-based kernel function for measuring the similarity of short text snippets. In Proc. WWW, 2006.
[15]
R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee. Boosting the margin: A new explnation for the effectiveness of voting methods. In Proc. ICML, 1997.
[16]
B. Sriram, D. Fuhry, E. Demir, H. Ferhatosmanoglu, and M. Demirbas. Short text classification in twitter to improve information filtering. In Proc. SIGIR, 2010.
[17]
X. Sun, H. Wang, and Y. Yu. Towards effective short text deep classification. In Proc. SIGIR, 2011.
[18]
W. Yih and C. Meek. Improving similarity measures for short segments of text. In Proc. AAAI, 2007.
[19]
X. Zhu. Semi-supervised learning literature survey. In http://pages.cs.wisc.edu/\ jerryzhu/pub/ssl\_survey.pdf, 2008.

Cited By

View all
  • (2023)Topic modeling methods for short textsJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-22383445:2(1971-1990)Online publication date: 1-Jan-2023
  • (2020)Identification of Cognitive Learning Complexity of Assessment Questions Using Multi-class Text ClassificationContemporary Educational Technology10.30935/cedtech/834112:2(ep275)Online publication date: 2020
  • (2020)Interpretable Time-series Classification on Few-shot Samples2020 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN48605.2020.9206860(1-8)Online publication date: Jul-2020
  • Show More Cited By

Index Terms

  1. TCSST: transfer classification of short & sparse text using external data

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management
    October 2012
    2840 pages
    ISBN:9781450311564
    DOI:10.1145/2396761
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 29 October 2012

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. classification
    2. external data
    3. short & sparse text mining
    4. transfer learning
    5. wikipedia

    Qualifiers

    • Research-article

    Conference

    CIKM'12
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    CIKM '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)4
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 16 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Topic modeling methods for short textsJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-22383445:2(1971-1990)Online publication date: 1-Jan-2023
    • (2020)Identification of Cognitive Learning Complexity of Assessment Questions Using Multi-class Text ClassificationContemporary Educational Technology10.30935/cedtech/834112:2(ep275)Online publication date: 2020
    • (2020)Interpretable Time-series Classification on Few-shot Samples2020 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN48605.2020.9206860(1-8)Online publication date: Jul-2020
    • (2019)Filtering and Classifying Relevant Short Text with a Few Seed WordsData and Information Management10.2478/dim-2019-0011Online publication date: 28-Sep-2019
    • (2018)Review on Recent Advances in Information Mining From Big Consumer Opinion Data for Product DesignJournal of Computing and Information Science in Engineering10.1115/1.404108719:1(010801)Online publication date: 17-Sep-2018
    • (2018)Learning to classify short text from scientific documents using topic models with various types of knowledgeExpert Systems with Applications: An International Journal10.1016/j.eswa.2014.09.03142:3(1684-1698)Online publication date: 29-Dec-2018
    • (2015)Comparing Tweet Classifications by Authors' Hashtags, Machine Learning, and Human AnnotatorsProceedings of the 2015 IEEE / WIC / ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT) - Volume 0110.1109/WI-IAT.2015.69(67-74)Online publication date: 6-Dec-2015
    • (2015)A context-aware approach to detection of short irrelevant texts2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA)10.1109/DSAA.2015.7344831(1-10)Online publication date: Oct-2015
    • (2015)An effective and economic bi-level approach to ranking and rating spam detection2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA)10.1109/DSAA.2015.7344794(1-10)Online publication date: Oct-2015
    • (2014)Hierarchical multi-label classification of social text streamsProceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval10.1145/2600428.2609595(213-222)Online publication date: 3-Jul-2014
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media