Skip to main content

Semantic Entity Identification in Large Scale Data via Statistical Features and DT-SVM

  • Conference paper
Web Information Systems Engineering – WISE 2013 (WISE 2013)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8180))

Included in the following conference series:

  • 2004 Accesses

Abstract

Semantic entities carry the most important semantics of text data. However, traditional approaches such as named entity recognition and new word identification may only detect some specific types of entities. In addition, they generally adopt sequence annotation algorithms such as Hidden Markov Model (HMM) and Conditional Random Field (CRF) which can only utilize limited context information. As a result, they are inefficient on the extraction of semantic entities that were never shown in the training data. In this paper we propose a strategy to extract unknown text semantic entities by integrating statistical features, Decision Tree (DT), and Support Vector Machine (SVM) algorithms. With the proposed statistical features and novel classification approach, our strategy can detect more semantic entities than traditional approaches such as CRF and Bootstrapping-SVM methods. It is very sensitive to new entities that just appear in fresh data. Our experimental results have shown that the precision, recall rate and F-One rate of our strategy are about 23.6%, 21.5% and 25.8% higher than that of the representative approaches on average.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Altun, Y., Tsochantaridis, I., Hofmann, T., et al.: Hidden Markov Support Vector Machines. In: Machine Learning-International Workshop Then Conference, vol. 20 (2003)

    Google Scholar 

  2. Arndt, R., Troncy, R., Staab, S., Hardman, L., Vacura, M.: COMM: Designing a Well-Founded Multimedia Ontology for The Web. The Semantic Web, 30–43 (2007)

    Google Scholar 

  3. Bai, S., Wu, H.J.P., Li, H., Loudon, G.: System for Chinese Tokenization and Named Entity Recognition, Google Patents. US Patent 6,311,152 (2001)

    Google Scholar 

  4. Berger, A.L., Pietra, V.J.D., Pietra, S.A.D.: A Maximum Entropy Approach to Natural Language Processing. Computational Linguistics 22(1), 39–71 (1996)

    Google Scholar 

  5. Chen, A., Peng, F., Shan, R., Sun, G.: Chinese Named Entity Recognition with Conditional Probabilistic Models. In: 5th SIGHAN Workshop on Chinese Language Processing, Australia (2006)

    Google Scholar 

  6. Cortes, C., Vapnik, V.: Support-Vector Networks. Machine learning 20(3), 273–297 (1995)

    MATH  Google Scholar 

  7. Finkel, J.R., Grenager, T., Manning, C.: Incorporating Non-Local Information into Information Extraction Systems by Gibbs Sampling. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, pp. 363–370 (2005)

    Google Scholar 

  8. Fu, G., Luke, K.K.: Chinese Unknown Word Identification using Class-Based LM. Natural Language 2004, 704–713 (2005)

    Google Scholar 

  9. Gao, J., Li, M., Wu, A., Huang, C.N.: Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach. Computational Linguistics 31(4), 531–574 (2005)

    Article  MATH  Google Scholar 

  10. Hunter, J.: Adding multimedia to the semantic web: Building an mpeg-7 ontology. In: International Semantic Web Working Symposium, SWWS (2011)

    Google Scholar 

  11. Jones, K.S.: A Statistical Interpretation of Term Specificity and Its Application in Retrieval. Journal of Documentation 28(1), 11–21 (1972)

    Article  Google Scholar 

  12. Kudo, T.: CRF++: Yet Another CRF Toolkit, http://crfpp.sourceforge.net (accessed on March 1, 2012)

  13. Latham, P., Roudi, Y.: Mutual information. Scholarpedia 4(1), 16–58 (2009)

    Article  Google Scholar 

  14. Li, H., Huang, C.N., Gao, J., Fan, X.: The use of SVM for Chinese New Word Identification. Natural Language 2004, 723–732 (2005)

    Google Scholar 

  15. Sekine, S., Grishman, R., Shinnou, H.: A Decision Tree Method for Finding and Classifying Names in Japanese Texts. In: Proceedings of the 6th Workshop on Very Large Corpora (1998)

    Google Scholar 

  16. Sproat, R., Emerson, T.: The First International Chinese Word Segmentation Bakeoff. In: Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing, vol. 17, pp. 133–143 (2003)

    Google Scholar 

  17. Takeuchi, K., Collier, N.: Use of Support Vector Machines in Extended Named Entity Recognition. In: Proceedings of the 6th Conference on Natural Language Learning, vol. 20, pp. 1–7 (2002)

    Google Scholar 

  18. Tsai, T.H., Wu, S.H., Lee, C.W., Shih, C.W., Hsu, W.L.: Mencius: A Chinese Named Entity Recognizer Using the Maximum Entropy-Based Hybrid Model. International Journal of Computational Linguistics and Chinese Language Processing 9(1) (2004)

    Google Scholar 

  19. Wu, A., Jiang, Z.: Statistically-Enhanced New Word Identification in a Rule-Based Chinese System. Proceedings of the 2nd Workshop on Chinese Language Processing: Held in Conjunction with the 38th Annual Meeting of the Association for Computational Linguistics 12, 46–51 (2000)

    Article  Google Scholar 

  20. Wu, Y., Zhao, J., Xu, B.: Chinese Named Entity Recognition Combining a Statistical Model with Human Knowledge. In: ACL 2003, vol. 15, pp. 65–72 (2003)

    Google Scholar 

  21. Wu, Y., Zhao, J., Xu, B., Yu, H.: Chinese Named Entity Recognition based on Multiple Features. In: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 427–434 (2005)

    Google Scholar 

  22. Zhao, Y., Cui, L., Yang, H.: Evaluating Reliability of Co-citation Clustering Analysis in Representing the Research History of Subject, 80(1), 91–102 (2009)

    Google Scholar 

  23. Zheng, Y., Liu, Z., Sun, M., Ru, L., Zhang, Y.: Incorporating User Behaviors in New Word Detection. In: Proceedings of the 21st International Joint Conference on Artificial Intelligence, pp. 2101–2106 (2009)

    Google Scholar 

  24. Niu, C., Li, W., Ding, J., et al.: A Bootstrapping Approach to Named Entity Classification using Successive Learners. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, vol. 1, pp. 335–342. Association for Computational Linguistics (2003)

    Google Scholar 

  25. Tellier, I., Eshkol, I., Taalab, S., et al.: Pos-tagging for Oral Texts with Crf and Category Decomposition. Natural Language Processing and its Applications 46, 79–90 (2010)

    Google Scholar 

  26. Goutte, C., Gaussier, É.: A Probabilistic Interpretation of Precision, Recall and F-score, with Implication for Evaluation. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 345–359. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wang, D., Liu, X., Luo, H., Fan, J. (2013). Semantic Entity Identification in Large Scale Data via Statistical Features and DT-SVM. In: Lin, X., Manolopoulos, Y., Srivastava, D., Huang, G. (eds) Web Information Systems Engineering – WISE 2013. WISE 2013. Lecture Notes in Computer Science, vol 8180. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41230-1_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-41230-1_30

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-41229-5

  • Online ISBN: 978-3-642-41230-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics