Improving Document Clustering for Short Texts by Long Documents via a Dirichlet Multinomial Allocation Model

Yan, Yingying; Huang, Ruizhang; Ma, Can; Xu, Liyang; Ding, Zhiyuan; Wang, Rui; Huang, Ting; Liu, Bowei

doi:10.1007/978-3-319-63579-8_47

Yingying Yan^18,19,
Ruizhang Huang^18,19,20,
Can Ma^18,19,
Liyang Xu^18,19,
Zhiyuan Ding^18,19,
Rui Wang^18,19,
Ting Huang^18,19 &
…
Bowei Liu^18,19

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10366))

Included in the following conference series:

Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint Conference on Web and Big Data

1864 Accesses
3 Citations

Abstract

Document clustering for short texts has received considerable interest. Traditional document clustering approaches are designed for long documents and perform poorly for short texts due to the their sparseness representation. To better understand short texts, we observe that words that appear in long documents can enrich short text context and improve the clustering performance for short texts. In this paper, we propose a novel model, namely DDMAfs, which (1) improves the clustering performance of short texts by sharing structural knowledge of long documents to short texts; (2) automatically identifies the number of clusters; (3) separates discriminative words from irrelevant words for long documents to obtain high quality structural knowledge. Our experiments indicate that the DDMAfs model performs well on the synthetic dataset and real datasets. Comparisons between the DDMAfs model and state-of-the-art short text clustering approaches show that the DDMAfs model is effective.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bela, A., Frigyik, A., Gupta, M.: Introduction to the dirichlet distribution and related processes. Department of Electrical Engineering, University of Washington (2010)
Google Scholar
Cheeseman, P., Kelly, J., Self, M., Stutz, J., Taylor, W., Freeman, D.: Autoclass: a Bayesian classification system. In: Readings in Knowledge Acquisition and Learning, pp. 431–441. Morgan Kaufmann Publishers Inc., Burlington (1993)
Google Scholar
Green, P.J., Richardson, S.: Modelling heterogeneity with and without the dirichlet process. Scand. J. Stat. 28(2), 355–375 (2001)
Article MathSciNet MATH Google Scholar
Hong, L., Davison, B.D.: Empirical study of topic modeling in Twitter. In: Proceedings of the First Workshop on Social Media Analytics, pp. 80–88. ACM (2010)
Google Scholar
Hotho, A., Staab, S., Stumme, G.: Wordnet improves text document clustering. In: Proceedings of the SIGIR 2003 Semantic Web Workshop, pp. 541–544 (2003)
Google Scholar
Huang, R., Yu, G., Wang, Z., Zhang, J., Shi, L.: Dirichlet process mixture model for document clustering with feature partition. IEEE Trans. Knowl. Data Eng. 25(8), 1748–1759 (2013)
Article Google Scholar
Ishwaran, H., James, L.F.: Gibbs sampling methods for stick-breaking priors. J. Am. Stat. Assoc. 96(453), 161–173 (2001)
Article MathSciNet MATH Google Scholar
Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recogn. Lett. 31(8), 651–666 (2010)
Article Google Scholar
Jin, O., Liu, N.N., Zhao, K., Yu, Y., Yang, Q.: Transferring topical knowledge from auxiliary long texts for short text clustering. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 775–784. ACM (2011)
Google Scholar
Phan, X.H., Nguyen, C.T., Le, D.T., Nguyen, L.M., Horiguchi, S., Ha, Q.T.: A hidden topic-based framework toward building applications with short web documents. IEEE Trans. Knowl. Data Eng. 23(7), 961–976 (2011)
Article Google Scholar
Phan, X.H., Nguyen, L.M., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th International Conference on World Wide Web, pp. 91–100. ACM (2008)
Google Scholar
Smyth, P.: Model selection for probabilistic clustering using cross-validated likelihood. Stat. Comput. 10(1), 63–72 (2000)
Article Google Scholar
Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., Su, Z.: Arnetminer: extraction and mining of academic social networks. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 990–998. ACM (2008)
Google Scholar
Weng, J., Lim, E.P., Jiang, J., He, Q.: TwitterRank: finding topic-sensitive influential twitterers. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 261–270. ACM (2010)
Google Scholar
Xu, J., Wang, P., Tian, G., Xu, B., Zhao, J., Wang, F., Hao, H.: Short text clustering via convolutional neural networks. In: Proceedings of NAACL-HLT, pp. 62–69 (2015)
Google Scholar
Yang, C.L., Benjamasutin, N., Chen-Burger, Y.H.: Mining hidden concepts: using short text clustering and wikipedia knowledge. In: 2014 28th International Conference on Advanced Information Networking and Applications Workshops (WAINA), pp. 675–680. IEEE (2014)
Google Scholar
Yin, J., Wang, J.: A dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 233–242. ACM (2014)
Google Scholar
Yu, G., Huang, R., Wang, Z.: Document clustering via dirichlet process mixture model with feature selection. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 763–772. ACM (2010)
Google Scholar
Zhao, W.X., Jiang, J., Weng, J., He, J., Lim, E.-P., Yan, H., Li, X.: Comparing Twitter and traditional media using topic models. In: Clough, P., Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 338–349. Springer, Heidelberg (2011). doi:10.1007/978-3-642-20161-5_34
Chapter Google Scholar
Zhong, S.: Semi-supervised model-based document clustering: a comparative study. Mach. Learn. 65(1), 3–29 (2006)
Article Google Scholar

Download references

Acknowledgments

This work is supported by Nation Science Foundation of China (Nos. 61462011, 61202089), Introduced Talents Science Projects of Guizhou University (No. 2016050), Major Applied Basic Research Program of Guizhou Province (Grant No. JZ20142001) and the Graduate Innovated Foundation of Guizhou University Project (Nos. 2011015, 2016051).

Author information

Authors and Affiliations

College of Computer Science and Technology, Guizhou University, Guiyang, Guizhou, China
Yingying Yan, Ruizhang Huang, Can Ma, Liyang Xu, Zhiyuan Ding, Rui Wang, Ting Huang & Bowei Liu
Guizhou Provincial Key Laboratory of Public Big Data, Guizhou University, Guiyang, Guizhou, China
Yingying Yan, Ruizhang Huang, Can Ma, Liyang Xu, Zhiyuan Ding, Rui Wang, Ting Huang & Bowei Liu
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, People’s Republic of China
Ruizhang Huang

Authors

Yingying Yan
View author publications
You can also search for this author in PubMed Google Scholar
Ruizhang Huang
View author publications
You can also search for this author in PubMed Google Scholar
Can Ma
View author publications
You can also search for this author in PubMed Google Scholar
Liyang Xu
View author publications
You can also search for this author in PubMed Google Scholar
Zhiyuan Ding
View author publications
You can also search for this author in PubMed Google Scholar
Rui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ting Huang
View author publications
You can also search for this author in PubMed Google Scholar
Bowei Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ruizhang Huang .

Editor information

Editors and Affiliations

Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, China
Lei Chen
Computer Science, Aarhus University, Aarhus N, Denmark
Christian S. Jensen
Computer Science, University of Southern California, Los Angeles, California, USA
Cyrus Shahabi
Northeastern University, Shenyang, China
Xiaochun Yang
Kent State University, Kent, Ohio, USA
Xiang Lian

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yan, Y. et al. (2017). Improving Document Clustering for Short Texts by Long Documents via a Dirichlet Multinomial Allocation Model. In: Chen, L., Jensen, C., Shahabi, C., Yang, X., Lian, X. (eds) Web and Big Data. APWeb-WAIM 2017. Lecture Notes in Computer Science(), vol 10366. Springer, Cham. https://doi.org/10.1007/978-3-319-63579-8_47

Download citation

DOI: https://doi.org/10.1007/978-3-319-63579-8_47
Published: 03 August 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-63578-1
Online ISBN: 978-3-319-63579-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics