ABSTRACT
Labeling data is a seemingly simple task required for training many machine learning systems, but is actually fraught with problems. This paper introduces the notion of concept evolution, the changing nature of a person's underlying concept (the abstract notion of the target class a person is labeling for, e.g., spam email, travel related web pages) which can result in inconsistent labels and thus be detrimental to machine learning. We introduce two structured labeling solutions, a novel technique we propose for helping people define and refine their concept in a consistent manner as they label. Through a series of five experiments, including a controlled lab study, we illustrate the impact and dynamics of concept evolution in practice and show that structured labeling helps people label more consistently in the presence of concept evolution than traditional labeling.
- Amershi, S., Cakmak, M., Knox, W. B., & Kulesza, T. Power to the people: The role of humans in interactive machine learning. AI Magazine (under review).Google Scholar
- Amershi, S., Lee, B., Kapoor, A., Mahajan, R., & Christian, B. CueT: Human-guided fast and accurate network alarm triage. In Proc. CHI, ACM (2011), 157--166. Google ScholarDigital Library
- Basu, S., Fisher, D., Drucker, S. M., & Lu, H. Assisting users with clustering tasks by combining metric learning and classification. In Proc. AAAI (2010), 394--400.Google ScholarCross Ref
- Bennett, P. N., Chickering, D. M., & Mityagin, A. Learning consensus opinion: mining data from a labeling game. In Proc. of WWW (2009), 121--130. Google ScholarDigital Library
- Billsus, D., & Pazzani, M. J. A hybrid user model for news story classification. In Proc. UM (1999), 99--108. Google ScholarDigital Library
- Blackwell, A. F. First steps in programming: A rationale for attention investment models. In Proc. HCC, IEEE (2002), 2--10. Google ScholarDigital Library
- Borlund, P. The concept of relevance in IR. Journal of the American Society for information Science and Technology 54, 10 (2003), 913--925. Google ScholarDigital Library
- Brain, D., & Webb, G. On the effect of data set size on bias and variance in classification learning. In D. Richards, G. Beydoun, A. Hoffmann, & P. Compton (Eds.), Proc. of the Fourth Australian Knowledge Acquisition Workshop (1999), 117--128.Google Scholar
- Brodley, C. E., & Friedl, M. A. Identifying mislabeled training data. Journal of Artificial Intelligence Research 11 (1999), 131--167.Google ScholarCross Ref
- Bshouty, N. H., Eiron, N., & Kushilevitz, E. PAC learning with nasty noise. Theoretical Computer Science 288, 2 (2002), 255--275. Google ScholarDigital Library
- Carterette, B., Bennett, P. N., Chickering, D. M., & Dumais, S. T. Here or there. Advances in Information Retrieval (2008), 16--27. Google ScholarDigital Library
- Conway, D., & White, J. M. Machine Learning for Email: Spam Filtering and Priority Inbox. O'Reilly (2011). Google ScholarDigital Library
- Cunningham, P., Nowlan, N., Delany, S. J., & Haahr, M. A case-based approach to spam filtering that can track concept drift. The ICCBR 3 (2003).Google Scholar
- Czerwinski, M., Dumais, S., Robertson, G., Dziadosz, S., Tiernan, S., & Van Dantzich, M. Visualizing implicit queries for information management and retrieval. In Proc.CHI, ACM (1999), 560--567. Google ScholarDigital Library
- Gabrilovich, E., Dumais, E., & Horvitz, E. NewsJunkie: Providing personalized newsfeeds via analysis of information novelty. In Proc. WWW (2004), 482--490. Google ScholarDigital Library
- Google. Search quality rating guidelines. Online: http://google.com/insidesearch/howsearchworks/assets/searchqualityevaluatorguidelines.pdf (2012).Google Scholar
- Hubert, L., & Arabie, P. Comparing partitions. Journal of classification 2, 1 (1985), 193--218.Google Scholar
- Kotsiantis, S. B., Zaharakis, I. D., & Pintelas, P. E. Supervised machine learning: A review of classification techniques. Informatica 31 (2007), 249--268.Google Scholar
- Law, E., Settles, B., & Mitchell, T. Learning to tag using noisy labels. In Proc. ECML (2010), 1--29.Google Scholar
- McGee, M. A look inside Bing's human search rater guidelines. Online: http://searchengineland.com/bing-search-quality-rating-guidelines-130592 (2012).Google Scholar
- Paul, S. A., & Morris, M. R. Sensemaking in collaborative web search. Human-Computer Interaction 26, 1-2 (2011), 72--122.Google ScholarCross Ref
- Rajaraman, A. & Ullman, J. D. "Data Mining". Mining of Massive Datasets (2011), 1--17.Google Scholar
- Russell, D. M., Stefik, M. J., Pirolli, P., & Card, S. K. The cost structure of sensemaking. In Proc. of INTERACT and CHI, ACM (1993), 269--276. Google ScholarDigital Library
- Robertson, G., Czerwinski, M., Larson, K., Robbins, D. C., Thiel, D., & Van Dantzich, M. Data mountain: using spatial memory for document management. In Proc. UIST, ACM (1998), 153--162. Google ScholarDigital Library
- Santos, J. M., & Embrechts, M. On the use of the adjusted rand index as a metric for evaluating supervised classification. In Artificial Neural Networks - ICANN (2009), 175--184. Google ScholarDigital Library
- Sheng, V. S., Provost, F., & Ipeirotis, P. G. Get another label? Improving data quality and data mining using multiple, noisy labelers. In Proc. KDD (2008), 614--622. Google ScholarDigital Library
- Stanley, K. O. Learning concept drift with a committee of decision trees. Tech. Report UT-AI-TR-03-302, University of Texas at Austin (2003).Google Scholar
- Teevan, J., Cutrell, E., Fisher, D., Drucker, S. M., Ramos, G., André, P., & Hu, C. Visual snippets: summarizing web pages for search and revisitation. In Proc. CHI, ACM (2009), 2023--2032. Google ScholarDigital Library
- Tsymbal, A. The problem of concept drift: definitions and related work. Computer Science Dept., Trinity College Dublin (2004).Google Scholar
- Valiant, L. G. Learning disjunctions of conjunctions. In IJCAI (1985), 560--566. Google ScholarDigital Library
- Westergren, T. The music genome project. Online: http://pandora.com/mgp (2007).Google Scholar
- Whittaker, S., & Hirschberg, J. The character, value, and management of personal paper archives. ACM TOCHI 8, 2 (2001), 150--170. Google ScholarDigital Library
- Widmer, G., & Kubat, M. Learning in the presence of concept drift and hidden contexts. Machine learning 23, 1 (1996), 69--101. Google ScholarDigital Library
- Yih, W. & Jiang, N. Similarity models for ad relevance measures. In MLOAD - NIPS Workshop on online advertising (2010).Google Scholar
- Yoshii, K., Goto, M., Komatani, K., Ogata, T., & Okuno, H. G. An efficient hybrid music recommender system using an incrementally trainable probabilistic generative model. IEEE Transactions on Audio, Speech, and Language Processing 16, 2 (2008), 435--447. Google ScholarDigital Library
Index Terms
- Structured labeling for facilitating concept evolution in machine learning
Recommendations
Concept evolution detection based on noise reduction soft boundary
AbstractConcept evolution detection is an important but difficult task in streaming data analysis, and further the noise may seriously limit the detection performance gains. This paper proposed a concept evolution detection method based on noise ...
Highlights- Noise reduction soft boundary is proposed and then the category distribution can be described reasonably.
- The negative effect of noise sample located near category boundary will be reduced effectively.
- The proposed can effectively ...
Rapidly Labeling and Tracking Dynamically Evolving Concepts in Data Streams
ICDMW '13: Proceedings of the 2013 IEEE 13th International Conference on Data Mining WorkshopsData mining research has produced a significant repertoire of algorithms to predict the classification of data instances with reasonable accuracy. However, data quantity and availability is continuing to rapidly expand such that we no longer have fixed ...
Concept evolution analysis based on the Dissipative Structure of Concept Semantic Space
In the domain of text semantic processing, concept semantic evolution is a common phenomenon involved in the lasting process of a concepts formation and development at different stages, which leads concept evolution analysis to be difficult in ...
Comments