Skip to main content
Log in

Semantic classification method for network Tibetan corpus

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Tibetan web pages appear enormously. It is meaningful that the information processing technology is utilized to find the useful knowledge from the Tibetan web information. Tibetan semantic ontology can enrich the Tibetan digital resource and is helpful to improve the information processing performance. In this paper, semantic classification of Tibetan network corpus is studied. Firstly Tibetan web pages are collected. Secondly preprocessing is conducted to extract the useful information from Web pages. Thirdly the word segmentation and text representation are introduced. Finally the text similarity classification algorithm is proposed to classify the text. During the experiment, the comparison between semantic classification and non semantic classification is conducted. The results show that the semantic classification performance is obviously superior to non semantic classification. This means that making full use of ontology semantic relationship can greatly enhance the classification accuracy. The research is useful and helpful to the study of Tibetan semantic information processing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Feng, X.: Analysis of computer information processing technology under the background of ‘big data’. Comput. CD Softw. Appl. 16(05), 105–107 (2014)

    Google Scholar 

  2. Min, H., Wu, L., Wu, D.: Average multinomial naive Bayesian text classification based on MapReduce. Appl. Res. Comput. 32(01), 115–117 (2016)

    MathSciNet  Google Scholar 

  3. Huo, S.: Realization of Chinese text classification by using BP neural network. Comput. Era 32(11), 58–61 (2015)

    MathSciNet  Google Scholar 

  4. Wang, J.: Based on semantic similarity web text classification research. Res. Libr. Sci. 9, 65–65 (2012)

    Google Scholar 

  5. Leone, A., Distante, C.: Shadow detecting for moving objects based on texture analysis. Pattern Recogn. 40(2), 1222–1233 (2007)

    Article  MATH  Google Scholar 

  6. Agirree, R.: G.A proposal for word sense disambiguation using conceptual distance. In: Proceedings of International Conference on Recent Advances in Natural Language Processing, pp. 258–264 (1995)

  7. Che, W., Liu, T., Qin, B.: Facing the dual statement for retrieval of Chinese Sentence Similarity Computing. In: Proceedings of the Seventh National Conference on Computational Linguistics, pp. 81–88 (2003)

  8. Liu, Q., Li, S.: The lexical semantic similarity calculation based on HowNet. In: Proceedings of the third session of Chinese lexical semantic symposium, pp. 59–76 (2002)

  9. Li, S.: Study on the relevancy between sentences based on semantic computation. Comput. Eng. Appl. 75–78 (2002)

  10. Batsakis, S., Petrakis, E.G.M., Milios, E., et al.: Improving the performance of focused web crawlers. Data Knowl. Eng. 68(10), 926–945 (2009)

    Article  Google Scholar 

  11. Li-wei, S.U.N., Guo-hui, H.E., Li-fa, W.U.: Research on the web Crawler. Comput. Knowl. Technol. 6(15), 4112–4115 (2010)

    Google Scholar 

  12. Hadrien, B., Gupta, S.K., Mohania, M.K., et al.: A Data-Mining Approach for Optimizing Performance of an Incremental Crawler. In: 2003 IEEE/WIC International Conference on Web Intelligence (WI 2003), pp. 610–615 (2003)

  13. Chen, J.: Research on of Chinese problem in Nutch. Modern Comput. 7, 60–62 (2009)

    Google Scholar 

  14. Du, J.: The research and improvement of Chinese segmentation in Nutch. Softw. guide 10(6), 19–20 (2011)

    Google Scholar 

  15. Diligenti, M., Coetzee, F., Lawrence, S., et al.: Focused Crawling using context graphs. In: International Conference on Very Large Databases, pp. 527–534 (2002)

  16. Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. Comput. Netw. 29(1157–1166), 8–13 (1997)

    Google Scholar 

  17. Charikar, M.: Similarity estimation techniques from rounding algorithms. In: Proceedings of 34th Annual ACM Symposium on Theory of Computing, (Montreal, Quebec, Canada, May 19–21, 2002), pp. 380–388 (2002)

  18. Gu, Y., Tian, W.: Extraction of information from web pages based on extended DOM tree. Comput. Sci. 36(11), 235–237 (2009)

    Google Scholar 

  19. Wang, J., Lochovsky, F.H.: Date-rich section extraction from HTML pages. In: Proceeding of the Third International Conference on Web Information Systems Engineering (Workshops). IEEE Computer Society, Singapore 20(2):313–32 (2002)

  20. Deng, C., Yu, S., Wen, J., et al.: Extracting Content Structure for Web Pages Based on Visual Representation. In: Proceeding of the 6th Asia Pacific Web conference, pp. 4–7 (2003)

  21. Xiang, C., Yu, W.: A Template-Based Tibetan Web Text Information Extraction Method. In: 2011 4th International Conference Intelligent, pp. 218–221

  22. William, W., Cohen, W.F.: Learning page-independent heuristics for extracting data from Web pages. Comput. Netw. 31(11–16), 1641–1652 (1999)

    Google Scholar 

  23. Lin, Z., Beijun, S.: Statistics-based automatic web news text extraction. Comput. Appl. Softw. 12, 232–235 (2010)

    Google Scholar 

  24. Chen, Y., Li, B., Yu, S., Lan, C.: An automatic Tibetan segmentation scheme based on case-auxiliary words and continuous features. Appl. Linguist. 11(01), 75–82 (2003)

    Google Scholar 

  25. Liu, H., Nuo, M., Zhao, W., Wu, J., He, Yeping: SegT: a practical Tibetan word segmentation system. J. Chin. Inf. Process. 26(01), 97–103 (2012)

    Google Scholar 

  26. Jia, H., Li, Y.: Design and implementation of Tibetan text classifier. Guide Sci-tech Mag. 17(12), 32–33 (2010)

    Google Scholar 

  27. Jia, H.: Tibetan text classified based on KNN. J. Northwest Univ. Natl. (Nat. Sci.) 31(03), 27–32 (2011)

    Google Scholar 

  28. Xu, G., Xiang, C., Yu, W., Zhao, X., Yang, G.: Automatic text classification of Tibetan web pages based on column. J. Chin. Inf. Process. 25(4), 20–23 (2011)

    Google Scholar 

  29. Tao, J., Jing, J., Yu-gang, D., Ailin, L.: Research on Tibetan public opinion platform of cloud analysis system. Netinfo Secur. 13(09), 92–94 (2014)

    Google Scholar 

  30. Jia, H., Liu, X., Yu, H.: Research of Feature Methods Based on Part of Speech in Tibetan Documents Classification. In: CCF NCSC 2011-The second session of the National Conference on Service Computing, pp. 93–97 (2007)

  31. Li, H., Yu, H.: Tibetan text sentiment classification system design. Sci. Tech. Inf. Gansu 40(01), 107–108 (2011)

    Google Scholar 

  32. Renqing, N., Su, Y., Sun, Y.: Design and implementation of Tibetan bad text recognition system based on Maximum Entropy Model [J]. Tibet Sci. Technol. 38(03), 77–78 (2014)

    Google Scholar 

  33. Huang, X.T.: Research on semantic Web text classification based on ontology. Library 3(3), 47–49 (2009)

    MathSciNet  Google Scholar 

  34. Tsytsarau, M., Palpanas, T.: Survey on mining subjective data on the web. Data Min. Knowl. Discov. 24(3), 478–514 (2012)

    Article  MATH  Google Scholar 

  35. Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)

    MATH  Google Scholar 

  36. Liao, K., Yang, B.: Similarity computing of documents based on weighted semantic network. J. Intell. 31(7), 182–186 (2012)

    Google Scholar 

  37. Huang, C., Yin, J., Hou, F.: A combination of word semantic information and TF-IDF method of text similarity measure method. Chin. J. Comput. 34(5), 856–864 (2011)

    Article  Google Scholar 

  38. Hammer, J., Molina, H., Cho, J.: Extracting Semistructured Information from the Web, pp. 23–24. Department of Computer Science, Stanford University, Stanford (1997)

    Google Scholar 

  39. Zh, Z., Li, J.: A preprocessing framework and approach for web applications. Web Eng. 12(3), 175–181 (2004)

    Google Scholar 

  40. Yang, L., Geng, X., Liao, H.: A web sentiment analysis method on fuzzy clustering for mobile social media users. Eurasip J. Wirel. Commun. Netw. 2016(1), 1–13 (2016)

    Article  Google Scholar 

  41. Yang, Li, Geng, Xinyu, Cao, X.: A novel knowledge representation model based on factor state space. Optik 127(12), 5141–5147 (2016)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gui-Xian Xu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xu, GX., Wang, CZ., Wang, LH. et al. Semantic classification method for network Tibetan corpus. Cluster Comput 20, 155–165 (2017). https://doi.org/10.1007/s10586-017-0742-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-017-0742-6

Keywords

Navigation