Abstract
We present a novel cooperative strategy based on active learning and crowdsourcing, dedicated to provide a solution to the cold start stage, i.e. initializing the classification of a large set of data with no attached labels. The strategy is moreover designed to handle an imbalanced context in which random selection is highly inefficient. In this purpose, our method is guided by an unsupervised clustering, and the computation of cluster quality and impurity indexes, updated at each active learning step. The strategy is explained on a case study of annotating Twitter content w.r.t. a real flood event. We also show that our technique can cope with multiple heterogeneous data representations.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Almeida, H., Guedes, D., Meira, W., Zaki, M.J.: Is there a best quality metric for graph clusters? In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds.) ECML PKDD 2011. LNCS (LNAI), vol. 6911, pp. 44–59. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23780-5_13
Anthony, B., Derek, G., Pádraig, C.: Using crowdsourcing and active learning to track sentiment in online media, pp. 145–150. https://doi.org/10.3233/978-1-60750-606-5-14. http://www.medra.org/servlet/aliasResolver?alias=iospressISSNISBN&issn=0922-6389&volume=215&spage=145
Bishop, C.: Pattern Recognition and Machine Learning. Information Science and Statistics, 1st edn. pp. xx, 738. Springer, New York (2006)
Brangbour, E., et al.: Computing flood probabilities using Twitter: application to the Houston urban area during Harvey. In: 9th International Workshop on Climate Informatics (2019)
Brangbour, E., et al.: Extracting localized information from a Twitter corpus for flood prevention. arXiv:1903.04748 (2019)
Bruneau, P., Otjacques, B.: A probabilistic model selection criterion for spectral clustering. Intell. Data Anal. 22(5), 1059–1077 (2018)
Bruneau, P., Tamisier, T.: Transfer learning and mixed input deep neural networks for estimating flood severity in news content. In: MediaEval Multimedia Evaluation Workshop (2019)
Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat. 3(1), 1–27 (1974)
Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Davies, D., Bouldin, D.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. PAMI, 1(2), 224–227 (1979)
Dhingra, B., Zhou, Z., Fitzpatrick, D., Muehl, M., Cohen, W.: Tweet2Vec: Character-Based Distributed Representations for Social Media. arXiv:1605.03481 (2016)
Dua, D., Graff, C.: UCI machine learning repository (2017). http://archive.ics.uci.edu/ml
Emmons, S., Kobourov, S., Gallant, M., Börner, K.: Analysis of network clustering algorithms and cluster quality metrics at scale. PLOS One 11(7), e0159161 (2016)
Fowlkes, E.B., Mallows, C.L.: A method for comparing two hierarchical clusterings. J. Am. Stat. Assoc. 78(383), 553–569 (1983)
Frey, B., Dueck, D.: Clustering by passing messages between data points. Science 315(5814), 972–976 (2007)
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: AISTATS, pp. 249–256 (2010)
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)
Joachims, T.: Text categorization with Support Vector Machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0026683
Kang, J., Ryu, K.R., Kwon, H.-C.: Using Cluster-Based Sampling to Select Initial Training Set for Active Learning in Text Classification. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 384–388. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24775-3_46
Krizhevsky, A., Sutskever, I., Hinton, G.: ImageNet classification with deep convolutional neural networks. Adv. Neural Inform. Process. Sys. 25, 1097–1105 (2012)
Matgen, P., et al.: Integrating Data Streams from in-situ Measurements, Social Networks and Satellite Earth Observation to Augment Operational Flood Monitoring and Forecasting: the 2017 Hurricane Season in the Americas as a Large-scale Test Case. AGU Fall Meeting Abstracts 31 (2017)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781 (2013)
Nowak, S., Rüger, S.: How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation. In: Proceedings of the International Conference on Multimedia Information Retrieval - MIR’10, p. 557. ACM Press. https://doi.org/10.1145/1743384.1743478. http://portal.acm.org/citation.cfm?doid=1743384.1743478
Rosenberg, A., Hirschberg, J.: V-Measure: a conditional entropy-based external cluster evaluation measure. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 410–420 (2007)
Rousseeuw, P.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Scholkopf, B., Smola, A.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA (2001)
Settles, B.: Active learning literature survey. Technical report. University of Wisconsin-Madison Department of Computer Sciences (2009). https://minds.wisconsin.edu/handle/1793/60660
Vidal, R.: Subspace clustering. IEEE Sig. Proc. Mag. 28(2), 52–68 (2011)
Vinh, N., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 11(95), 2837–2854 (2010)
Zelnik-Manor, L., Perona, P.: Self-tuning spectral clustering. In: Advances in Neural Information Processing Systems 17, pp. 1601–1608. MIT Press (2005)
Zhu, X., Gharamani, Z.: Learning from labeled and unlabeled data with label propagation. Technical report. CMU-CALD-02-107, Carnegie Mellon University (2002)
Acknowledgements
This work was performed in the context of the Publimape project, funded by the CORE programme of the Luxembourgish National Research Fund (FNR).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Brangbour, E., Bruneau, P., Tamisier, T., Marchand-Maillet, S. (2020). Active Learning with Crowdsourcing for the Cold Start of Imbalanced Classifiers. In: Luo, Y. (eds) Cooperative Design, Visualization, and Engineering. CDVE 2020. Lecture Notes in Computer Science(), vol 12341. Springer, Cham. https://doi.org/10.1007/978-3-030-60816-3_22
Download citation
DOI: https://doi.org/10.1007/978-3-030-60816-3_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60815-6
Online ISBN: 978-3-030-60816-3
eBook Packages: Computer ScienceComputer Science (R0)