Skip to main content
Log in

Detecting and ordering salient regions

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

We describe an ensemble approach to learning salient regions from arbitrarily partitioned data. The partitioning comes from the distributed processing requirements of large-scale simulations. The volume of the data is such that classifiers can train only on data local to a given partition. Since the data partition reflects the needs of the simulation, the class statistics can vary from partition to partition. Some classes will likely be missing from some or even most partitions. We combine a fast ensemble learning algorithm with scaled probabilistic majority voting in order to learn an accurate classifier from such data. Since some simulations are difficult to model without a considerable number of false positive errors, and since we are essentially building a search engine for simulation data, we order predicted regions to increase the likelihood that most of the top-ranked predictions are correct (salient). Results from simulation runs of a canister being torn and from a casing being dropped show that regions of interest are successfully identified in spite of the class imbalance in the individual training sets. Lift curve analysis shows that the use of data driven ordering methods provides a statistically significant improvement over the use of the default, natural time step ordering. Significant time is saved for the end user by allowing an improved focus on areas of interest without the need to conventionally search all of the data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Aggarwal CC, Han J, Wang J, Yu PS (2004) On demand classification of data streams. In: KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, NY, USA, pp 503–508

  • ASC, National Nuclear Security Administration in collaboration with Sandia, Lawrence Livermore, and Los Alamos National Laboratories, http://www.sandia.gov/nnsa/asc/. Accessed 29 Nov 2008

  • Baeza-Yates R, Ribeiro-Neto B: Modern information retrieval. ACM Press, New York (1999)

    Google Scholar 

  • Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP (2005) Ensembles of classifiers from spatially disjoint data. In: Multiple classifier systems, sixth international workshop. Lecture Notes in Computer Science, vol. 3541. Springer, Seaside, CA, USA, pp 196–205

  • Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP: A comparison of decision tree ensemble creation techniques. IEEE transactions on pattern analysis and machine intelligence 29(1), 173–180 (2007)

    Article  Google Scholar 

  • Breiman L: Random forests. Mach Learn 45(1), 5–32 (2001)

    Article  MATH  Google Scholar 

  • Brinker K (2004) Active learning of label ranking functions. In: Proceedings of the 21st international conference on machine learning, July 4–8. Banff, Alberta, Canada, pp 129–136

  • Chawla NV, Hall LO, Bowyer KW, Kegelmeyer WP: SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16, 321–357 (2002)

    MATH  Google Scholar 

  • Chawla NV, Moore TE, Hall LO, Bowyer KW, Kegelmeyer WP, Springer C: Distributed learning with bagging-like performance. Pattern Recognit Lett 24(1-3), 455–471 (2003)

    Article  Google Scholar 

  • Chawla NV, Hall LO, Bowyer KW, Kegelmeyer WP: Learning ensembles from bites: a scalable and accurate approach. J Mach Learn Res 5, 421–451 (2004)

    MathSciNet  Google Scholar 

  • Cohen WW, Schapire RE, Singer Y: Learning to order things. J Artif Intell Res 10, 243–270 (1999)

    MATH  MathSciNet  Google Scholar 

  • Demsar J: Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7, 1–30 (2006)

    MathSciNet  Google Scholar 

  • Domingos P (1999) Metacost: a general method for making classifiers cost-sensitive. In: KDD ’99: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, NY, USA, pp 155–164

  • Domingos P, Hulten G (2000) Mining high-speed data streams. In: KDD ’00: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, NY, USA, pp 71–80

  • Erdem Z, Polikar R, Gurgen F, Yumusak N (2005) Ensemble of SVMs for incremental learning. In: Multiple classifier systems, 6th international workshop. Lecture Notes in Computer Science, vol. 3541. Springer, Seaside, CA, USA, pp 246–256

  • Eschrich S, Hall LO (2003) Learning from soft partitions of data: reducing the variance. In: The 12th IEEE international conference on fuzzy systems, 2003. FUZZ ’03, May 25–28, vol 1. St. Louis, Missouri, USA, pp 666–671

  • Fan W (2004) Systematic data selection to mine concept-drifting data streams. In: KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, NY, USA, pp 128–137

  • Fan W, Wang H, Yu PS, Stolfo SJ (2002) A fully distributed framework for cost-sensitive data mining. In: Proceedings 22nd international conference on distributed computing systems, July 2–5. Vienna, Austria, pp 445–446

  • Gionis A, Mannila H, Puolamäki K, Ukkonen A (2006) Algorithms for discovering bucket orders from data. Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, August 20–23. Philadelphia, PA, USA, pp 561–566

  • Hall LO, Bhadoria D, Bowyer KW (2004) Learning a model from spatially disjoint data. In: 2004 IEEE international conference on systems, man, and cybernetics, October 10–13, vol 2. The Hague, Netherlands, pp 1447–1451

  • Henderson A: The ParaView guide. Kitware Inc., United States (2004)

    Google Scholar 

  • Hullermeier E, Furnkranz J (2005) Learning label preferences: ranking error versus position error. Proceedings IDA05, 6th international symposium on intelligent data analysis, September 8–10. Madrid, Spain, pp 180–191

  • Koegler WS, Kegelmeyer WP (2005) FCLib: a library for building data analysis and data discovery tools. Advances in intelligent data analysis VI IDA 2005, pp 192–203

  • Kong R, Zhang B: A fast incremental learning algorithm for support vector machine. Control Decision 20(10), 1129–1136 (2005)

    MathSciNet  Google Scholar 

  • Korecki JN, Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP (2008) Semi-supervised learning on large complex simulations. In: Proceedings of the 19th conference of the international association for pattern recognition, December 8–11. Tampa, FL, USA

  • Kotsiantis S, Kanellopoulos D, Pintelas P: Handling imbalanced datasets: a review. GESTS Int Trans Comput Sci Eng 30(1), 25–36 (2006)

    Google Scholar 

  • Kusnezov DF (2004) Advanced simulation & computing: the next ten years. Tech. rep., NA-ASC-100R-04, Sandia National Labs, Albuquerque. http://www.acq.usd.mil/dsb/reports/ADA495920.pdf

  • Lazarevic A, Obradovic Z: Boosting algorithms for parallel and distributed learning. Distrib Parallel Databases J 11(2), 203–229 (2002)

    Article  MATH  Google Scholar 

  • Ling CX, Li C (1998) Data mining for direct marketing: problems and solutions. In: Proceedings of the fourth international conference on knowledge discovery and data mining (KDD-98), pp 73–79

  • Maloof MA, Michalski RS: Incremental learning with partial instance memory. Artif Intell 154(1-2), 95–126 (2004)

    Article  MATH  MathSciNet  Google Scholar 

  • Manning C, Raghavan P, Schutze H: Introduction to information retrieval. Cambridge University Press, Cambridge (2008)

    MATH  Google Scholar 

  • Otsu N: A threshold selection method from gray level histograms. IEEE Trans Syst Man Cybern 9, 62–66 (1979)

    Article  Google Scholar 

  • Piatetsky-Shapiro G, Steingold S: Measuring lift quality in database marketing. SIGKDD Explor Newsl 2(2), 76–80 (2000)

    Article  Google Scholar 

  • Schoof LA, Yarberry VR (1998) EXODUS II: a finite element data model, Technical Report # SAND92–2137. Tech. rep., Sandia National Labs, Albuquerque, NM 87185

  • Shipp CA, Kuncheva LI: Relationships between combination methods and measures of diversity in combining classifiers. Inf Fusion 3(2), 135–148 (2002)

    Article  Google Scholar 

  • Shoemaker L, Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP (2006) Learning to predict salient regions from disjoint and skewed training sets. In: 18th IEEE Conference on Tools with Artificial Intelligence (ICTAI 2006), Arlington, VA, USA, pp 116–123

  • Shoemaker L, Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP (2008a) Detecting and ordering salient regions for efficient browsing. In: Proceedings of the 19th conference of the international association for pattern recognition, December 8–11. Tampa, FL, USA

  • Shoemaker L, Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP: Using classifier ensembles to label spatially disjoint data. Inf Fusion 9(1), 120–133 (2008b)

    Article  Google Scholar 

  • Wang F, Ma S, Yang L, Li T (2006) Recommendation on item graphs. Proceedings of the sixth international conference on data mining. pp 1119–1123

  • Webb GI, Boughton JR, Wang Z: Not so naive Bayes: aggregating one-dependence estimators. Mach Learn 58(1), 5–24 (2005)

    Article  MATH  Google Scholar 

  • Weiss G: Mining with rarity: a unifying framework. SIGKDD Explor 6(1), 7–19 (2004)

    Article  Google Scholar 

  • Witten IH, Frank E: Data mining: practical machine learning tools and techniques. 2. Morgan Kaufmann, San Francisco (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Larry Shoemaker.

Additional information

Responsible editor: Chih-Jen Lin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shoemaker, L., Banfield, R.E., Hall, L.O. et al. Detecting and ordering salient regions. Data Min Knowl Disc 22, 259–290 (2011). https://doi.org/10.1007/s10618-010-0194-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-010-0194-6

Keywords

Navigation