Skip to main content
Log in

A new co-training-style random forest for computer aided diagnosis

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Machine learning techniques used in computer aided diagnosis (CAD) systems learn a hypothesis to help the medical experts make a diagnosis in the future. To learn a well-performed hypothesis, a large amount of expert-diagnosed examples are required, which places a heavy burden on experts. By exploiting large amounts of undiagnosed examples and the power of ensemble learning, the co-training-style random forest (Co-Forest) releases the burden on the experts and produces well-performed hypotheses. However, the Co-forest may suffer from a problem common to other co-training-style algorithms, namely, that the unlabeled examples may instead be wrongly-labeled examples that become accumulated in the training process. This is due to the fact that the limited number of originally-labeled examples usually produces poor component classifiers, which lack diversity and accuracy. In this paper, a new Co-Forest algorithm named Co-Forest with Adaptive Data Editing (ADE-Co-Forest) is proposed. Not only does it exploit a specific data-editing technique in order to identify and discard possibly mislabeled examples throughout the co-labeling iterations, but it also employs an adaptive strategy in order to decide whether to trigger the editing operation according to different cases. The adaptive strategy combines five pre-conditional theorems, all of which ensure an iterative reduction of classification error and an increase in the scale of new training sets under PAC learning theory. Experiments on UCI datasets and an application to small pulmonary nodules detection using chest CT images show that ADE-Co-Forest can more effectively enhance the performance of a learned hypothesis than Co-Forest and DE-Co-Forest (Co-Forest with Data Editing but without adaptive strategy).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  • Anagnostopoulos, I., & Maglogiannis, I. (2006). Neural network-based diagnostic and prognostic estimations in breast cancer microscopic instances. Medical and Biological Engineering and Computing, 44, 773–784.

    Article  Google Scholar 

  • Angluin, D., & Laird, P. (1988). Learning from noisy examples. Machine Learning, 2(4), 343–370.

    Google Scholar 

  • Bennett, K. P., Demiriz, A., & Maclin, R. (2002). Exploiting unlabeled data in ensemble methods. In Proc. 8th ACM int. conf. on knowledge discovery and data mining (SIGKDD’02) (pp. 289–296). Canada: Edmonton.

    Chapter  Google Scholar 

  • Blake, C., Keogh, E., & Merz, C. J. (1998). UCI repository of machine learning databases. Dept. Inf. and Comput. Sci., Univ. California, [Online]. http://www.ics.uci.edu/~mlearn/MLRepository.html.

  • Blum, A., & Chawla, S. (2001). Learning from labeled and unlabeled data using graph mincuts. In Proc. 18th int. conf. on machine learning (ICML01) (pp. 19–26). Williamstown, MA.

  • Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In Proc. 11th annu. conf. on computational learning theory (pp. 92–100). U.S.A.: Wisconsin.

    Chapter  Google Scholar 

  • Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140.

    MathSciNet  MATH  Google Scholar 

  • Breiman, L. (2001). Random Forest. Machine Learning, 45(1), 5–32.

    Article  MATH  Google Scholar 

  • Chapelle, O., Schoelkopf, B., & Zien, A. (2006). Semi-supervised learning. Cambridge: MIT Press.

    Google Scholar 

  • Dasgupta, S., Littman, M., & McAllester, D. (2002). PAC generalization bounds for co-training. In Advances in neural information processing systems (NIPS02) (Vol. 4, pp. 375–382). Cambridge: MIT Press.

    Google Scholar 

  • Deng, C., & Guo, M. Z. (2006). Tri-training and data editing based semi-supervised clustering algorithm. In A. F. Gelbukhm & C. A. R. García (Eds.), MICAI2006: Advances in artificial intelligence (pp. 641–651). Mexico: Apizaco.

    Google Scholar 

  • Goldman, S., & Zhou, Y. (2000). Enhancing supervised learning with unlabeled data. In Proc. 17th int. conf. on machine Learning (ICML00) (pp. 327–334). San Francisco, CA.

  • Hansen, L., & Salamon, P. (1990). Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(10), 993–1001.

    Article  Google Scholar 

  • Hwa, R., Osborne, M., Sarkar A., & Steedman, M. (2003). Corrected cotraining for statistical parsers. In Proc. 20th int. conf. on machine learning (ICML03) workshop on continuum from labeled to unlabeled data in machine learning and data mining (pp. 95–102). Washington, DC.

  • Jia, X. H., Wang, Z., & Chen, S. C. (2006). Fast screening out true negative regions for microcalcification detection in digital mammograms. Transaction of Nanjing University of Aeronautics & Astronautics, 23(1), 52–58.

    MATH  Google Scholar 

  • Jiang, Y., & Zhou, Z. H. (2004). Editing training data for kNN classifiers with neural network ensemble. In Proc. IEEE 2004 int. sym. on neural networks (ISNN04) (pp. 356–361). Dalian, China.

  • Koprinska, I., Poon, J., Clark, J., & Chan, J. (2007). Learning to classify e-mail. Information Sciences, 177(10), 2167–2187.

    Article  Google Scholar 

  • Li, M., & Zhou, Z. H. (2005). SETRED: Self-training with editing. In Proc. 9th Pacific-Asia conf. on knowledge discovery and data mining (PAKDD05) (pp. 611–621). Hanoi, Vietnam.

  • Li, M., & Zhou, Z. H. (2007). Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples. IEEE Transactions on Systems, Man, and Cybernetics, Part A, 37(6), 1088–1098.

    Article  Google Scholar 

  • Martínez, C., & Fuentes, O. (2003). Face recognition using unlabeled data. Computación y Sistemas, 7(2), 123–129.

    Google Scholar 

  • Mitchell, T. M. (1997). Machine learning (ch. 3). New York: McGraw-Hill.

    Google Scholar 

  • Muhlenbach, F., Lallich, S., & Zighed, D. A. (2004). Identifying and handling mislabeled instances. Journal of Intelligent Information Systems, 22(1), 89–109.

    Article  Google Scholar 

  • Muhlenbruch, M. D. G., et al. (2006). Small pulmonary nodules: Effect of two computer-aided detection systems on radiologist performance. Radiology, 241(2), 564–571.

    Article  MathSciNet  Google Scholar 

  • Nigam K., & Ghani, R. (2000). Analyzing the effectiveness and applicability of co-training. In Proc. ACM 9th conf. on information and knowledge management (pp. 86–93). McLean, Virginia.

  • Nigam, K., McCallum, A. K., Thrun, S., & Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(3–4), 103–134.

    Article  MATH  Google Scholar 

  • Paredes, R., & Vidal, E. (2006). Learning weighted metrics to minimize nearest-neighbor classification error. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(7), 1100–1110.

    Article  Google Scholar 

  • Roli, F. (2005). Semi-supervised multiple classifier systems: Background and research direction. In Proc. multiple classifiers systems (pp. 1–11). Seaside, CA.

  • Sánchez, J. S., Barandela, R., Marqués, A. I., Alejo, R., & Badenas, J. (2003). Analysis of new techniques to obtain quality training sets. Pattern Recognition Letters, 24(7), 1015–1022.

    Article  Google Scholar 

  • Seeger, M. (2001). Learning with labeled and unlabeled data. Tech. Rep., Univ. of Edinburgh, Edinburgh, Scotland.

  • Vincent, N., & Claire, C. (2003). Bootstrapping coreference classifiers with multiple machine learning algorithms. In Proc. 2003 conf. empirical methods in natural language processing (pp. 113–120). Sapporo, Japan.

  • Wilson, D. R., & Martinez, T. R. (1997). Improved heterogeneous distance functions. Journal of Artificial Intelligence Research, 6(1), 1–34.

    MathSciNet  MATH  Google Scholar 

  • Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques with java implementations (2nd ed.). San Francisco: Morgan Kaufmann.

    Google Scholar 

  • Xu, Q., Hu, D. H., Xue, H., Yu, W., & Yang, Q. (2009). Semi-supervised protein subcellular localization. BMC Bioinformatics, 10(suppl. 1), S47. doi:10.1186/1471-2105-10-S1-S47.

    Article  Google Scholar 

  • Zhou, Y., & Goldman, S. (2004). Democratic co-learning. In Proc. 16th IEEE int. conf. tools with artificial intelligence (pp. 594–602). Boca Raton, FL.

  • Zhou, Z. H., & Li, M. (2005). Tri-training: Exploiting unlabeled data using three classifiers. IEEE Transactions on Knowledge and Data Engineering, 17(11), 1529–1541.

    Article  Google Scholar 

  • Zhu, X. J. (2008). Semi-supervised learning literature survey. Tech. Rep. Computer Sciences, TR1530, Univ. of Wisconsin-Madison, Wisconsin.

Download references

Acknowledgements

This work is supported by the National Science Foundation of China under the Grant Nos.60702033, 60772076 and 2007307000189, the National High-Tec Research and Development Plant of China under the Grant No.2007AA01Z171, the Heilongjiang Science Foundation Key Project under the No. ZJG0705, the Science Foundation for Distinguished Young Scholars of Heilongjiang Province in China under the Grant No.JC200611. The authors thank partners from the 2nd Affiliated Hospital of Harbin Medical University for collecting and labeling the CT images.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chao Deng.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Deng, C., Guo, M.Z. A new co-training-style random forest for computer aided diagnosis. J Intell Inf Syst 36, 253–281 (2011). https://doi.org/10.1007/s10844-009-0105-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-009-0105-8

Keywords

Navigation