Abstract
Semantic entities carry the most important semantics of text data. However, traditional approaches such as named entity recognition and new word identification may only detect some specific types of entities. In addition, they generally adopt sequence annotation algorithms such as Hidden Markov Model (HMM) and Conditional Random Field (CRF) which can only utilize limited context information. As a result, they are inefficient on the extraction of semantic entities that were never shown in the training data. In this paper we propose a strategy to extract unknown text semantic entities by integrating statistical features, Decision Tree (DT), and Support Vector Machine (SVM) algorithms. With the proposed statistical features and novel classification approach, our strategy can detect more semantic entities than traditional approaches such as CRF and Bootstrapping-SVM methods. It is very sensitive to new entities that just appear in fresh data. Our experimental results have shown that the precision, recall rate and F-One rate of our strategy are about 23.6%, 21.5% and 25.8% higher than that of the representative approaches on average.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Altun, Y., Tsochantaridis, I., Hofmann, T., et al.: Hidden Markov Support Vector Machines. In: Machine Learning-International Workshop Then Conference, vol. 20 (2003)
Arndt, R., Troncy, R., Staab, S., Hardman, L., Vacura, M.: COMM: Designing a Well-Founded Multimedia Ontology for The Web. The Semantic Web, 30–43 (2007)
Bai, S., Wu, H.J.P., Li, H., Loudon, G.: System for Chinese Tokenization and Named Entity Recognition, Google Patents. US Patent 6,311,152 (2001)
Berger, A.L., Pietra, V.J.D., Pietra, S.A.D.: A Maximum Entropy Approach to Natural Language Processing. Computational Linguistics 22(1), 39–71 (1996)
Chen, A., Peng, F., Shan, R., Sun, G.: Chinese Named Entity Recognition with Conditional Probabilistic Models. In: 5th SIGHAN Workshop on Chinese Language Processing, Australia (2006)
Cortes, C., Vapnik, V.: Support-Vector Networks. Machine learning 20(3), 273–297 (1995)
Finkel, J.R., Grenager, T., Manning, C.: Incorporating Non-Local Information into Information Extraction Systems by Gibbs Sampling. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, pp. 363–370 (2005)
Fu, G., Luke, K.K.: Chinese Unknown Word Identification using Class-Based LM. Natural Language 2004, 704–713 (2005)
Gao, J., Li, M., Wu, A., Huang, C.N.: Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach. Computational Linguistics 31(4), 531–574 (2005)
Hunter, J.: Adding multimedia to the semantic web: Building an mpeg-7 ontology. In: International Semantic Web Working Symposium, SWWS (2011)
Jones, K.S.: A Statistical Interpretation of Term Specificity and Its Application in Retrieval. Journal of Documentation 28(1), 11–21 (1972)
Kudo, T.: CRF++: Yet Another CRF Toolkit, http://crfpp.sourceforge.net (accessed on March 1, 2012)
Latham, P., Roudi, Y.: Mutual information. Scholarpedia 4(1), 16–58 (2009)
Li, H., Huang, C.N., Gao, J., Fan, X.: The use of SVM for Chinese New Word Identification. Natural Language 2004, 723–732 (2005)
Sekine, S., Grishman, R., Shinnou, H.: A Decision Tree Method for Finding and Classifying Names in Japanese Texts. In: Proceedings of the 6th Workshop on Very Large Corpora (1998)
Sproat, R., Emerson, T.: The First International Chinese Word Segmentation Bakeoff. In: Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing, vol. 17, pp. 133–143 (2003)
Takeuchi, K., Collier, N.: Use of Support Vector Machines in Extended Named Entity Recognition. In: Proceedings of the 6th Conference on Natural Language Learning, vol. 20, pp. 1–7 (2002)
Tsai, T.H., Wu, S.H., Lee, C.W., Shih, C.W., Hsu, W.L.: Mencius: A Chinese Named Entity Recognizer Using the Maximum Entropy-Based Hybrid Model. International Journal of Computational Linguistics and Chinese Language Processing 9(1) (2004)
Wu, A., Jiang, Z.: Statistically-Enhanced New Word Identification in a Rule-Based Chinese System. Proceedings of the 2nd Workshop on Chinese Language Processing: Held in Conjunction with the 38th Annual Meeting of the Association for Computational Linguistics 12, 46–51 (2000)
Wu, Y., Zhao, J., Xu, B.: Chinese Named Entity Recognition Combining a Statistical Model with Human Knowledge. In: ACL 2003, vol. 15, pp. 65–72 (2003)
Wu, Y., Zhao, J., Xu, B., Yu, H.: Chinese Named Entity Recognition based on Multiple Features. In: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 427–434 (2005)
Zhao, Y., Cui, L., Yang, H.: Evaluating Reliability of Co-citation Clustering Analysis in Representing the Research History of Subject, 80(1), 91–102 (2009)
Zheng, Y., Liu, Z., Sun, M., Ru, L., Zhang, Y.: Incorporating User Behaviors in New Word Detection. In: Proceedings of the 21st International Joint Conference on Artificial Intelligence, pp. 2101–2106 (2009)
Niu, C., Li, W., Ding, J., et al.: A Bootstrapping Approach to Named Entity Classification using Successive Learners. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, vol. 1, pp. 335–342. Association for Computational Linguistics (2003)
Tellier, I., Eshkol, I., Taalab, S., et al.: Pos-tagging for Oral Texts with Crf and Category Decomposition. Natural Language Processing and its Applications 46, 79–90 (2010)
Goutte, C., Gaussier, É.: A Probabilistic Interpretation of Precision, Recall and F-score, with Implication for Evaluation. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 345–359. Springer, Heidelberg (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wang, D., Liu, X., Luo, H., Fan, J. (2013). Semantic Entity Identification in Large Scale Data via Statistical Features and DT-SVM. In: Lin, X., Manolopoulos, Y., Srivastava, D., Huang, G. (eds) Web Information Systems Engineering – WISE 2013. WISE 2013. Lecture Notes in Computer Science, vol 8180. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41230-1_30
Download citation
DOI: https://doi.org/10.1007/978-3-642-41230-1_30
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41229-5
Online ISBN: 978-3-642-41230-1
eBook Packages: Computer ScienceComputer Science (R0)