Topic Discovery and Topic-Driven Clustering for Audit Method Datasets

Zhao, Ying; Fu, Wanyu; Huang, Shaobin

doi:10.1007/978-3-642-25856-5_26

Ying Zhao²²,
Wanyu Fu²² &
Shaobin Huang²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7121))

Included in the following conference series:

International Conference on Advanced Data Mining and Applications

1387 Accesses

Abstract

As the promotion of China’s Golden Auditing Project and the fast growth of on-line auditing, there are thousands of new computer audit methods emerged every year to fulfill various needs of audit practices. How to organize these existing computer audit methods and use them intelligently have become a fundamental and challenging problem. In this paper, we propose to use topic-driven clustering methods to organize computer audit methods according to the system of computer audit methods that is issued by the National Audit Office of China. We also apply Latent Dirichlet allocation (LDA) analysis to audit method datasets at different levels of granularity. Our experimental results on social insurance computer audit methods show that the topic-driven clustering scheme with topics created by domain experts is the overall best scheme. It achieved an average purity of 0.862 across the datasets. Topics discovered by LDA were consistent with classes defined in the taxonomy for four out of five datasets, and they were effective when used in the topic-driven clustering scheme.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Zhao, Y., Karypis, G.: Topic-driven Clustering for Document Datasets. In: 2005 SIAM International Conference on Data Mining (SDM 2005), pp. 358–369 (2005)
Google Scholar
Wagstaff, K., Cardie, C., Rogers, S., Schroedl, S.: Constrained k-means clustering with background knowledge. In: 18th International Conference on Machine Learning (ICML 2001), pp. 577–584 (2001)
Google Scholar
Davidson, I., Ravi, S.S.: Agglomerative Hierarchical Clustering with Constraints: Theoretical and Empirical Results. In: Jorge, A.M., Torgo, L., Brazdil, P.B., Camacho, R., Gama, J. (eds.) PKDD 2005. LNCS (LNAI), vol. 3721, pp. 59–70. Springer, Heidelberg (2005)
Chapter Google Scholar
Bade, K., Nurnberger, A.: Creating a cluster hierarchy under constraints of a partially known hierarchy. In: 2008 SIAM International Conference on Data Mining (SDM 2008), pp. 13–24 (2008)
Google Scholar
Basu, S., Bilenko, M., Monney, R.: A probabilistic framework for semi-supervised clustering. In: 10th International Conference on Knowledge Discovery and Data Mining (2004)
Google Scholar
Xing, E.P., Ng, A.Y., Jordan, M.I., Russell, S.: Distance metric learning with application to clustering with side-information. Advances in Neural Information Processing Systerms 15, 505–512 (2003)
Google Scholar
Bilenko, M., Basu, S., Monney, R.: Integrating constraints and metric learning in semi-supervised clustering. In: 21th International Conference on Machine Learning, ICML 2004 (2004)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. Journal of Machine Learning Research 3(4-5), 993–1022 (2003)
MATH Google Scholar
Hofmann, T.: Probabilistic Latent Semantic Indexing. In: The Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval (1999)
Google Scholar
Barnard, K., Duygulu, P., Freitas, N., Forsyth, D., Blei, D., Jordan, M.: Matching Words and Pictures. Journal of Machine Learning Research 3, 1107–1135 (2003)
MATH Google Scholar
Griffiths, T., Steyvers, M.: Finding Scientific Topics. Proc. of the National Academy of Sciences 101 (suppl. 1), 5228–5235 (2004)
Article Google Scholar
Mimno, D., McCallum, A.: Organizing the OCA: Learning Faceted Subjects from a library of digital books. In: Proc. of JCDL 2007, pp. 376–385 (2007)
Google Scholar
Phan, X., Nguyen, L., Horiguchi, S.: Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections. In: Proc. of The 17th International World Wide Web Conference (WWW 2008), pp. 91–100 (2008)
Google Scholar
Masada, T., Kiyasu, S., Miyahara, S.: Comparing LDA with pLSI as a Dimensionality Reduction Method in Document Clustering. In: Tokunaga, T., Ortega, A. (eds.) LKR 2008. LNCS (LNAI), vol. 4938, pp. 13–26. Springer, Heidelberg (2008)
Chapter Google Scholar
Salton, G.: Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley (1989)
Google Scholar
Cutting, D.R., Pedersen, J.O., Karger, D.R., Tukey, J.W.: Scatter/gather: A cluster-based approach to browsing large document collections. In: 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 318–329 (1992)
Google Scholar
Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 16–22 (1999)
Google Scholar
Dhillon, I.S., Modha, D.S.: Concept decompositions for large sparse text data using clustering. Machine Learning 42(1/2), 143–175 (2001)
Article MATH Google Scholar
Zhao, Y., Karypis, G.: Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning 55(3), 311–331 (2004)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Technology, Tsinghua University, Beijing, China, 100084
Ying Zhao & Wanyu Fu
College of Computer Science and Technology, Harbin Engineering University, Harbin, China, 150001
Shaobin Huang

Authors

Ying Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Wanyu Fu
View author publications
You can also search for this author in PubMed Google Scholar
Shaobin Huang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Technology, Tsinghua University, 100084, Beijing, China
Jie Tang & Jianyong Wang &
Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, SAR, China
Irwin King
Faculty of Engineering and Information Technology, University of Technology, 2007, Sydney, NSW, Australia
Ling Chen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhao, Y., Fu, W., Huang, S. (2011). Topic Discovery and Topic-Driven Clustering for Audit Method Datasets. In: Tang, J., King, I., Chen, L., Wang, J. (eds) Advanced Data Mining and Applications. ADMA 2011. Lecture Notes in Computer Science(), vol 7121. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25856-5_26

Download citation

DOI: https://doi.org/10.1007/978-3-642-25856-5_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25855-8
Online ISBN: 978-3-642-25856-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics