Abstract
Tibetan web pages appear enormously. It is meaningful that the information processing technology is utilized to find the useful knowledge from the Tibetan web information. Tibetan semantic ontology can enrich the Tibetan digital resource and is helpful to improve the information processing performance. In this paper, semantic classification of Tibetan network corpus is studied. Firstly Tibetan web pages are collected. Secondly preprocessing is conducted to extract the useful information from Web pages. Thirdly the word segmentation and text representation are introduced. Finally the text similarity classification algorithm is proposed to classify the text. During the experiment, the comparison between semantic classification and non semantic classification is conducted. The results show that the semantic classification performance is obviously superior to non semantic classification. This means that making full use of ontology semantic relationship can greatly enhance the classification accuracy. The research is useful and helpful to the study of Tibetan semantic information processing.




Similar content being viewed by others
References
Feng, X.: Analysis of computer information processing technology under the background of ‘big data’. Comput. CD Softw. Appl. 16(05), 105–107 (2014)
Min, H., Wu, L., Wu, D.: Average multinomial naive Bayesian text classification based on MapReduce. Appl. Res. Comput. 32(01), 115–117 (2016)
Huo, S.: Realization of Chinese text classification by using BP neural network. Comput. Era 32(11), 58–61 (2015)
Wang, J.: Based on semantic similarity web text classification research. Res. Libr. Sci. 9, 65–65 (2012)
Leone, A., Distante, C.: Shadow detecting for moving objects based on texture analysis. Pattern Recogn. 40(2), 1222–1233 (2007)
Agirree, R.: G.A proposal for word sense disambiguation using conceptual distance. In: Proceedings of International Conference on Recent Advances in Natural Language Processing, pp. 258–264 (1995)
Che, W., Liu, T., Qin, B.: Facing the dual statement for retrieval of Chinese Sentence Similarity Computing. In: Proceedings of the Seventh National Conference on Computational Linguistics, pp. 81–88 (2003)
Liu, Q., Li, S.: The lexical semantic similarity calculation based on HowNet. In: Proceedings of the third session of Chinese lexical semantic symposium, pp. 59–76 (2002)
Li, S.: Study on the relevancy between sentences based on semantic computation. Comput. Eng. Appl. 75–78 (2002)
Batsakis, S., Petrakis, E.G.M., Milios, E., et al.: Improving the performance of focused web crawlers. Data Knowl. Eng. 68(10), 926–945 (2009)
Li-wei, S.U.N., Guo-hui, H.E., Li-fa, W.U.: Research on the web Crawler. Comput. Knowl. Technol. 6(15), 4112–4115 (2010)
Hadrien, B., Gupta, S.K., Mohania, M.K., et al.: A Data-Mining Approach for Optimizing Performance of an Incremental Crawler. In: 2003 IEEE/WIC International Conference on Web Intelligence (WI 2003), pp. 610–615 (2003)
Chen, J.: Research on of Chinese problem in Nutch. Modern Comput. 7, 60–62 (2009)
Du, J.: The research and improvement of Chinese segmentation in Nutch. Softw. guide 10(6), 19–20 (2011)
Diligenti, M., Coetzee, F., Lawrence, S., et al.: Focused Crawling using context graphs. In: International Conference on Very Large Databases, pp. 527–534 (2002)
Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. Comput. Netw. 29(1157–1166), 8–13 (1997)
Charikar, M.: Similarity estimation techniques from rounding algorithms. In: Proceedings of 34th Annual ACM Symposium on Theory of Computing, (Montreal, Quebec, Canada, May 19–21, 2002), pp. 380–388 (2002)
Gu, Y., Tian, W.: Extraction of information from web pages based on extended DOM tree. Comput. Sci. 36(11), 235–237 (2009)
Wang, J., Lochovsky, F.H.: Date-rich section extraction from HTML pages. In: Proceeding of the Third International Conference on Web Information Systems Engineering (Workshops). IEEE Computer Society, Singapore 20(2):313–32 (2002)
Deng, C., Yu, S., Wen, J., et al.: Extracting Content Structure for Web Pages Based on Visual Representation. In: Proceeding of the 6th Asia Pacific Web conference, pp. 4–7 (2003)
Xiang, C., Yu, W.: A Template-Based Tibetan Web Text Information Extraction Method. In: 2011 4th International Conference Intelligent, pp. 218–221
William, W., Cohen, W.F.: Learning page-independent heuristics for extracting data from Web pages. Comput. Netw. 31(11–16), 1641–1652 (1999)
Lin, Z., Beijun, S.: Statistics-based automatic web news text extraction. Comput. Appl. Softw. 12, 232–235 (2010)
Chen, Y., Li, B., Yu, S., Lan, C.: An automatic Tibetan segmentation scheme based on case-auxiliary words and continuous features. Appl. Linguist. 11(01), 75–82 (2003)
Liu, H., Nuo, M., Zhao, W., Wu, J., He, Yeping: SegT: a practical Tibetan word segmentation system. J. Chin. Inf. Process. 26(01), 97–103 (2012)
Jia, H., Li, Y.: Design and implementation of Tibetan text classifier. Guide Sci-tech Mag. 17(12), 32–33 (2010)
Jia, H.: Tibetan text classified based on KNN. J. Northwest Univ. Natl. (Nat. Sci.) 31(03), 27–32 (2011)
Xu, G., Xiang, C., Yu, W., Zhao, X., Yang, G.: Automatic text classification of Tibetan web pages based on column. J. Chin. Inf. Process. 25(4), 20–23 (2011)
Tao, J., Jing, J., Yu-gang, D., Ailin, L.: Research on Tibetan public opinion platform of cloud analysis system. Netinfo Secur. 13(09), 92–94 (2014)
Jia, H., Liu, X., Yu, H.: Research of Feature Methods Based on Part of Speech in Tibetan Documents Classification. In: CCF NCSC 2011-The second session of the National Conference on Service Computing, pp. 93–97 (2007)
Li, H., Yu, H.: Tibetan text sentiment classification system design. Sci. Tech. Inf. Gansu 40(01), 107–108 (2011)
Renqing, N., Su, Y., Sun, Y.: Design and implementation of Tibetan bad text recognition system based on Maximum Entropy Model [J]. Tibet Sci. Technol. 38(03), 77–78 (2014)
Huang, X.T.: Research on semantic Web text classification based on ontology. Library 3(3), 47–49 (2009)
Tsytsarau, M., Palpanas, T.: Survey on mining subjective data on the web. Data Min. Knowl. Discov. 24(3), 478–514 (2012)
Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)
Liao, K., Yang, B.: Similarity computing of documents based on weighted semantic network. J. Intell. 31(7), 182–186 (2012)
Huang, C., Yin, J., Hou, F.: A combination of word semantic information and TF-IDF method of text similarity measure method. Chin. J. Comput. 34(5), 856–864 (2011)
Hammer, J., Molina, H., Cho, J.: Extracting Semistructured Information from the Web, pp. 23–24. Department of Computer Science, Stanford University, Stanford (1997)
Zh, Z., Li, J.: A preprocessing framework and approach for web applications. Web Eng. 12(3), 175–181 (2004)
Yang, L., Geng, X., Liao, H.: A web sentiment analysis method on fuzzy clustering for mobile social media users. Eurasip J. Wirel. Commun. Netw. 2016(1), 1–13 (2016)
Yang, Li, Geng, Xinyu, Cao, X.: A novel knowledge representation model based on factor state space. Optik 127(12), 5141–5147 (2016)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Xu, GX., Wang, CZ., Wang, LH. et al. Semantic classification method for network Tibetan corpus. Cluster Comput 20, 155–165 (2017). https://doi.org/10.1007/s10586-017-0742-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-017-0742-6