Abstract
Web document cluster analysis plays an important role in information retrieval by organizing large amounts of documents into a small number of meaningful clusters. Traditional web document clustering is based on the Vector Space Model (VSM), which takes into account only two-level (document and term) knowledge granularity but ignores the bridging paragraph granularity. However, this two-level granularity may lead to unsatisfactory clustering results with “false correlation”. In order to deal with the problem, a Hierarchical Representation Model with Multi-granularity (HRMM), which consists of five-layer representation of data and a two-phase clustering process is proposed based on granular computing and article structure theory. To deal with the zero-valued similarity problem resulted from the sparse term-paragraph matrix, an ontology based strategy and a tolerance-rough-set based strategy are introduced into HRMM. By using granular computing, structural knowledge hidden in documents can be more efficiently and effectively captured in HRMM and thus web document clusters with higher quality can be generated. Extensive experiments show that HRMM, HRMM with tolerance-rough-set strategy, and HRMM with ontology all outperform VSM and a representative non VSM-based algorithm, WFP, significantly in terms of the F-Score.
Similar content being viewed by others
References
Beil, F., Ester, M., Xu, X.: Frequent term-based text clustering. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2002), pp. 436–442. Edmonton, Alberta, Canada (2002)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Bruno, E., Faessel, N., Glotin, H., Maitre, J.L., Scholl, M.: Indexing and querying segmented web pages: the blockweb model. World Wide Web 14(5–6), 623–649 (2011)
Chen, C.-L., Tseng, F.S.C., Liang, T.: An integration of fuzzy association rules and WordNet for document clustering. Knowl. Inf. Syst. 28(3), 687–708 (2011)
Cui, J., Liu, H., He, J., Li, P., Du, X., Wang, P.: Tagclus: a random walk-based method for tag clustering. Knowl. Inf. Syst. 27(2), 193–225 (2011)
Derrick, C.: TinyLex: static n-gram index pruning with perfect recall. In: Proceedings of the 17th Conference on Information and Knowledge Management (CIKM 2008), pp. 409–418. Napa Valley, California, USA (2008)
Farahat, A.K., Kamel, M.S.: Statistical semantics for enhancing document clustering. Knowl. Inf. Syst. 28(2), 365–393 (2011)
Fodeh, S., Punch, B., Tan, P.-N.: On ontology-driven document clustering using core semantic features. Knowl. Inf. Syst. 28(2), 395–421 (2011)
Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42, 177–196 (2001)
Hossain, M.S., Angryk, R.A.: GDClust: a graph-based document clustering technique. In: Proceedings of the Seventh IEEE International Conference on Data Mining. (ICDM Workshops 2007), pp. 417–422. Omaha, Nebraska, USA (2007)
Hotho, A., Maedche, A., Staab, S.: Ontology-based text document clustering. KI 16(4), 48–54 (2002)
Hotho, A., Staab, S., Stumme, G.: Ontologies improve text document clustering. In: Proceedings of the Third IEEE International Conference on Data Mining (ICDM’03), pp. 541–544. Melbourne, Florida, USA (2003)
http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/webkb-data.gtar.gz
http://www.cs.cmu.edu/afs/cs/project/theo-20/www/data/bootstrappingIE/7sectors.tar.gz
Huang, F., Xie, G., Yao, Z., Cai, S.: Clustering transactions based on weighting maximal frequent itemsets. In: Proceedings of the 3rd International Conference on Intelligent System and Knowledge Engineering (ISKE 2008), pp. 262–266. Xiamen, China (2008)
Huang, F., Zhang, S.: Clustering web documents based on knowledge granularity. In: Proceedings of the Eighth Asia Pacific Web Conference (APWeb’06), pp. 85–96. Harbin, China (2006)
Jing, L., Zhou, L., Ng, M., Huang, Z.: Ontology-based distance measure for text clustering. In: Proceedings of SIAM SDM Workshop on Text Mining, Bethesda, Maryland, USA (2003)
Keller, M., Bengio, S.: A neural network for text representation. In: Proceedings of the International Conference on Artificial Neural Networks (ICANN’05), pp. 667–672. Warsaw, Poland (2005)
Khy, S., Ishikawa, Y., Kitagawa, H.: A novelty-based clustering method for on-line documents. World Wide Web 11(1), 1–37 (2008)
Kryszkiewicz, M.: Properties of in complete information systems in the framework of rough sets. In: Polkowski, L., Skowron, A., (eds.) Rough Set in Knowledge Discovery 1: Methodology and Applications, Studies in Fuzziness and Soft Computing 18, pp. 422-450. Physica Verlag (1998)
Lang, N.C.: A Tolerance Rough Set Approach to Clustering Web Search Results. Warsaw University, Pisa, Italy (2003)
Leung, C., Chan, S., Chung, F., Ngai, G.: A probabilistic rating inference framework for mining user preferences from reviews. World Wide Web 14(2), 187–215 (2011)
Li, Y., Chung, S.M., Holt, J.D.: Text document clustering based on frequent word meaning sequences. Data Knowl. Eng. 64(1), 381–404 (2008)
Liu, N., Zhang, B., Yan, J., et al.: Text representation: from vector to tensor. In: Proceedings of the 5th IEEE International Conference on Data Mining (ICDM 2005), pp. 725–728. Houston, Texas, USA (2005)
Ma, J., Xu, W., Sun, Y.-h., et al.: An ontology-based text-mining method to cluster proposals for research project selection. IEEE Trans. Syst. Man Cybern. Syst. Hum. 42(3), 784–790 (2012)
Parapar, J., Barreiro, A.: Evaluation of Text Clustering Algorithms with N-Gram-Based Document Fingerprints. ECIR, pp. 645–653 (2009)
Park, S., An, D.U., Char, B.R., et al.: Document Clustering with Cluster Refinement and Non-negative Matrix Factorization. ICONIP. (2), 281–288 (2009)
Park, S., Lee, S.R.: Enhancing document clustering using condensing cluster terms and fuzzy association. IEICE Trans. Inf. Syst. 94(6), 1227–1234 (2011)
Pawlak, Z.: Rough Sets, Theoretical Aspects of Reasoning about Data. Kluwer Academic Publishers, Dordrecht (1991)
Pawlak, Z.: Granularity of knowledge, indiscernibility and rough sets. In: Proceedings of IEEE International Conference on Fuzzy Systems, pp. 106–110. Anchorage, Alaska (1998)
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
Sedding, J., Kazakov, D.: WordNet-based text document clustering. In: Proceedings of COLING-2004 Workshop on Robust Methods in Analysis of Natural Language Data, pp. 104–113. Geneva, Switzerland (2004)
Siivola, V., Pellom, B.: Growing an n-gram language model. In: Proceedings of the 9th European Conference on Speech Communication and Technology (INTERSPEECH’05), pp. 1309–1312. Lisbon, Portugal (2005)
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. KDD-2000 Workshop on Text Mining, (2000)
Theodosiou, T., Darzentas, N., Angelis, L., Ouzounis, C.A.: PuReD-MCL: a graph-based PubMed document clustering methodology. Bioinformatics 24(17), 1935–1941 (2008)
Tsai, F.S., Zhang, Y.: D2S: document-to-sentence framework for novelty detection. Knowl. Inf. Syst. (KAIS) 29, 419–433 (2011)
Varelas, G., Voutsakis, E., Raftopoulou, P., et al.: Semantic similarity methods in wordNet and their application to information retrieval on the web. In: Proceedings of Seventh ACM International Workshop on Web Information and Data Management (WIDM 2005), pp. 10–16. Bremen, Germany (2005)
Wang, F., Li, P., König, A.C.: Efficient Document Clustering via Online Nonnegative Matrix Factorizations. SDM, pp. 908–919 (2011)
Yao, Y.Y.: Information granulation and rough set approximation. Int. J. Intell. Syst. 16, 87–104 (2001)
Yao, Y.Y.: A partition model of granular computing. LNCS Trans. Rough Sets 1, 232–253 (2004)
Yao, Y.Y.: Granular computing for the design of information retrieval support systems. In: Wu, W., Xiong, H., Shekhar, S. (eds.) Information Retrieval and Clustering. Kluwer Academic Publishers 299 (2003)
Yao,Y.Y.: Granular computing for data mining. In: Proceedings of SPIE Conference on Data Mining, Intrusion Detection, Information Assurance, and Data Networks Security, pp. 1–12. Orlando, FL, USA (2006)
Zadeh, L.A.: Towards a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic. Fuzzy Set Syst. 19, 111–127 (1997)
Zadeh, L.A.: Some reflections on soft computing, granular computing and their roles in the conception, design and utilization of information/intelligent systems. Soft Comput. 2, 23–25 (1998)
Zheng W.: Architecture for Paragraphs (in Chinese). Fujian People’s Press, Fuzhou, China (1984)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Huang, F., Zhang, S., He, M. et al. Clustering web documents using hierarchical representation with multi-granularity. World Wide Web 17, 105–126 (2014). https://doi.org/10.1007/s11280-012-0197-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-012-0197-x