Abstract
We describe an ensemble approach to learning salient regions from arbitrarily partitioned data. The partitioning comes from the distributed processing requirements of large-scale simulations. The volume of the data is such that classifiers can train only on data local to a given partition. Since the data partition reflects the needs of the simulation, the class statistics can vary from partition to partition. Some classes will likely be missing from some or even most partitions. We combine a fast ensemble learning algorithm with scaled probabilistic majority voting in order to learn an accurate classifier from such data. Since some simulations are difficult to model without a considerable number of false positive errors, and since we are essentially building a search engine for simulation data, we order predicted regions to increase the likelihood that most of the top-ranked predictions are correct (salient). Results from simulation runs of a canister being torn and from a casing being dropped show that regions of interest are successfully identified in spite of the class imbalance in the individual training sets. Lift curve analysis shows that the use of data driven ordering methods provides a statistically significant improvement over the use of the default, natural time step ordering. Significant time is saved for the end user by allowing an improved focus on areas of interest without the need to conventionally search all of the data.
Similar content being viewed by others
References
Aggarwal CC, Han J, Wang J, Yu PS (2004) On demand classification of data streams. In: KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, NY, USA, pp 503–508
ASC, National Nuclear Security Administration in collaboration with Sandia, Lawrence Livermore, and Los Alamos National Laboratories, http://www.sandia.gov/nnsa/asc/. Accessed 29 Nov 2008
Baeza-Yates R, Ribeiro-Neto B: Modern information retrieval. ACM Press, New York (1999)
Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP (2005) Ensembles of classifiers from spatially disjoint data. In: Multiple classifier systems, sixth international workshop. Lecture Notes in Computer Science, vol. 3541. Springer, Seaside, CA, USA, pp 196–205
Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP: A comparison of decision tree ensemble creation techniques. IEEE transactions on pattern analysis and machine intelligence 29(1), 173–180 (2007)
Breiman L: Random forests. Mach Learn 45(1), 5–32 (2001)
Brinker K (2004) Active learning of label ranking functions. In: Proceedings of the 21st international conference on machine learning, July 4–8. Banff, Alberta, Canada, pp 129–136
Chawla NV, Hall LO, Bowyer KW, Kegelmeyer WP: SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16, 321–357 (2002)
Chawla NV, Moore TE, Hall LO, Bowyer KW, Kegelmeyer WP, Springer C: Distributed learning with bagging-like performance. Pattern Recognit Lett 24(1-3), 455–471 (2003)
Chawla NV, Hall LO, Bowyer KW, Kegelmeyer WP: Learning ensembles from bites: a scalable and accurate approach. J Mach Learn Res 5, 421–451 (2004)
Cohen WW, Schapire RE, Singer Y: Learning to order things. J Artif Intell Res 10, 243–270 (1999)
Demsar J: Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7, 1–30 (2006)
Domingos P (1999) Metacost: a general method for making classifiers cost-sensitive. In: KDD ’99: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, NY, USA, pp 155–164
Domingos P, Hulten G (2000) Mining high-speed data streams. In: KDD ’00: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, NY, USA, pp 71–80
Erdem Z, Polikar R, Gurgen F, Yumusak N (2005) Ensemble of SVMs for incremental learning. In: Multiple classifier systems, 6th international workshop. Lecture Notes in Computer Science, vol. 3541. Springer, Seaside, CA, USA, pp 246–256
Eschrich S, Hall LO (2003) Learning from soft partitions of data: reducing the variance. In: The 12th IEEE international conference on fuzzy systems, 2003. FUZZ ’03, May 25–28, vol 1. St. Louis, Missouri, USA, pp 666–671
Fan W (2004) Systematic data selection to mine concept-drifting data streams. In: KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, NY, USA, pp 128–137
Fan W, Wang H, Yu PS, Stolfo SJ (2002) A fully distributed framework for cost-sensitive data mining. In: Proceedings 22nd international conference on distributed computing systems, July 2–5. Vienna, Austria, pp 445–446
Gionis A, Mannila H, Puolamäki K, Ukkonen A (2006) Algorithms for discovering bucket orders from data. Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, August 20–23. Philadelphia, PA, USA, pp 561–566
Hall LO, Bhadoria D, Bowyer KW (2004) Learning a model from spatially disjoint data. In: 2004 IEEE international conference on systems, man, and cybernetics, October 10–13, vol 2. The Hague, Netherlands, pp 1447–1451
Henderson A: The ParaView guide. Kitware Inc., United States (2004)
Hullermeier E, Furnkranz J (2005) Learning label preferences: ranking error versus position error. Proceedings IDA05, 6th international symposium on intelligent data analysis, September 8–10. Madrid, Spain, pp 180–191
Koegler WS, Kegelmeyer WP (2005) FCLib: a library for building data analysis and data discovery tools. Advances in intelligent data analysis VI IDA 2005, pp 192–203
Kong R, Zhang B: A fast incremental learning algorithm for support vector machine. Control Decision 20(10), 1129–1136 (2005)
Korecki JN, Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP (2008) Semi-supervised learning on large complex simulations. In: Proceedings of the 19th conference of the international association for pattern recognition, December 8–11. Tampa, FL, USA
Kotsiantis S, Kanellopoulos D, Pintelas P: Handling imbalanced datasets: a review. GESTS Int Trans Comput Sci Eng 30(1), 25–36 (2006)
Kusnezov DF (2004) Advanced simulation & computing: the next ten years. Tech. rep., NA-ASC-100R-04, Sandia National Labs, Albuquerque. http://www.acq.usd.mil/dsb/reports/ADA495920.pdf
Lazarevic A, Obradovic Z: Boosting algorithms for parallel and distributed learning. Distrib Parallel Databases J 11(2), 203–229 (2002)
Ling CX, Li C (1998) Data mining for direct marketing: problems and solutions. In: Proceedings of the fourth international conference on knowledge discovery and data mining (KDD-98), pp 73–79
Maloof MA, Michalski RS: Incremental learning with partial instance memory. Artif Intell 154(1-2), 95–126 (2004)
Manning C, Raghavan P, Schutze H: Introduction to information retrieval. Cambridge University Press, Cambridge (2008)
Otsu N: A threshold selection method from gray level histograms. IEEE Trans Syst Man Cybern 9, 62–66 (1979)
Piatetsky-Shapiro G, Steingold S: Measuring lift quality in database marketing. SIGKDD Explor Newsl 2(2), 76–80 (2000)
Schoof LA, Yarberry VR (1998) EXODUS II: a finite element data model, Technical Report # SAND92–2137. Tech. rep., Sandia National Labs, Albuquerque, NM 87185
Shipp CA, Kuncheva LI: Relationships between combination methods and measures of diversity in combining classifiers. Inf Fusion 3(2), 135–148 (2002)
Shoemaker L, Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP (2006) Learning to predict salient regions from disjoint and skewed training sets. In: 18th IEEE Conference on Tools with Artificial Intelligence (ICTAI 2006), Arlington, VA, USA, pp 116–123
Shoemaker L, Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP (2008a) Detecting and ordering salient regions for efficient browsing. In: Proceedings of the 19th conference of the international association for pattern recognition, December 8–11. Tampa, FL, USA
Shoemaker L, Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP: Using classifier ensembles to label spatially disjoint data. Inf Fusion 9(1), 120–133 (2008b)
Wang F, Ma S, Yang L, Li T (2006) Recommendation on item graphs. Proceedings of the sixth international conference on data mining. pp 1119–1123
Webb GI, Boughton JR, Wang Z: Not so naive Bayes: aggregating one-dependence estimators. Mach Learn 58(1), 5–24 (2005)
Weiss G: Mining with rarity: a unifying framework. SIGKDD Explor 6(1), 7–19 (2004)
Witten IH, Frank E: Data mining: practical machine learning tools and techniques. 2. Morgan Kaufmann, San Francisco (2005)
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Chih-Jen Lin.
Rights and permissions
About this article
Cite this article
Shoemaker, L., Banfield, R.E., Hall, L.O. et al. Detecting and ordering salient regions. Data Min Knowl Disc 22, 259–290 (2011). https://doi.org/10.1007/s10618-010-0194-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-010-0194-6