Abstract
The performance of classification tasks extremely relies on data quality, while in real world label noises inevitably exists because of data entry errors, transmit errors and subjectivity of taggers. Different methods have been proposed to deal with label imperfection, including robust algorithms by avoid overfitting, filtering mechanism to remove noises and correction mechanism to revise noises. In this paper, we propose an approach based on knowledge graph to perceive and correct the label errors in training data. Experiments on a medical Q&A data set reveal that our knowledge graph based approach can be effective on promoting classification performance and data quality. The results as well show our approach can work in a relatively high noise level and be applied in other data mining tasks demanding deep understanding.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Zhu, X., Wu, X.: Class noise vs. attribute noise: a quantitative study. Artif. Intell. Rev. 22(3), 177–210 (2004)
Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)
Wu, W., Li, H., Wang, H., Zhu, K.Q.: Probase: a probabilistic taxonomy for text understanding. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 481–492. ACM (2012)
Zhang, Y.: Contextualizing consumer health information searching: an analysis of questions in a social Q&A community. In: Proceedings of the 1st ACM International Health Informatics Symposium, pp. 210–219. ACM (2010)
Kunz, H., Schaaf, T.: General and specific formalization approach for a balanced scorecard: an expert system with application in health care. Expert Syst. Appl. 38(3), 1947–1955 (2011)
Zeng, X., Martinez, T.R.: An algorithm for correcting mislabeled data. Intell. Data Anal. 5(6), 491–502 (2001)
Wilson, D.R., Martinez, T.R.: Instance pruning techniques. In: ICML, vol. 97, pp. 403–411 (1997)
Wilson, D.R., Martinez, T.R.: Reduction techniques for instance-based learning algorithms. Mach. Learn. 38(3), 257–286 (2000)
Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 2(3), 408–421 (1972)
Aha, D.W., Kibler, D.F.: Noise-tolerant instance-based learning algorithms. In: IJCAI, Citeseer, pp. 794–799 (1989)
Brodley, C.E., Friedl, M.A.: Identifying and eliminating mislabeled training instances. In: AAAI/IAAI, Citeseer, vol. 1, pp. 799–805 (1996)
Brodley, C.E., Friedl, M.A.: Identifying mislabeled training data (2011). arXiv preprint arXiv:1106.0219
Teng, C.M.: Evaluating noise correction. In: Mizoguchi, R., Slaney, J.K. (eds.) PRICAI 2000. LNCS, vol. 1886, pp. 188–198. Springer, Heidelberg (2000)
Teng, C.M.: Polishing blemishes: Issues in data correction. IEEE Intell. Syst. 19(2), 34–39 (2004)
Teng, C.M.: A comparison of noise handling techniques. In: FLAIRS Conference, pp. 269–273 (2001)
Li, J., Zhang, K., et al.: Keyword extraction based on tf/idf for chinese news document. Wuhan Univ. J. Nat. Sci. 12(5), 917–921 (2007)
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: ICML, vol. 97, pp. 412–420 (1997)
McCallum, A., Nigam, K., et al.: A comparison of event models for naive bayes text classification. In: AAAI-98 workshop on learning for text categorization, Citeseer, vol. 752, pp. 41–48 (1998)
Acknowledgements
This work was supported by the NSFC (No. 61272099, 61261160502 and 61202025), Shanghai Excellent Academic Leaders Plan (No. 11XD1402900), the Program for Changjiang Scholars and Innovative Research Team in University of China (IRT1158, PCSIRT), the Scientific Innovation Act of STCSM (No. 13511504200), Singapore NRF (CREATE E2S2), and the EU FP7 CLIMBER project (No. PIRSES-GA-2012-318939).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Liu, Y., Li, H., Chen, Y. (2014). Using Knowledge Graph to Handle Label Imperfection. In: Peng, WC., et al. Trends and Applications in Knowledge Discovery and Data Mining. PAKDD 2014. Lecture Notes in Computer Science(), vol 8643. Springer, Cham. https://doi.org/10.1007/978-3-319-13186-3_32
Download citation
DOI: https://doi.org/10.1007/978-3-319-13186-3_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-13185-6
Online ISBN: 978-3-319-13186-3
eBook Packages: Computer ScienceComputer Science (R0)