Abstract
More and more, researchers in humanities and companies need large classified document data-sets. These users are not familiar with information retrieval or data science notions. For data scientists, there is also often a need for those classified document data-sets as ground truth. There are multiple tools that allow users to carry out this classification task on large data-sets, involving always a quite expert level in computer and data science. More over, these tools are not usually oriented to the domain of micro-blogs or do not always take into account meta data and attached images as additional dimensions to improve the classification. In this work, we present a platform to enable end users to classify large document collections of several hundred thousands documents in an assisted way, within a humanly acceptable number of clicks, with no coding and without having data science and information retrieval expert knowledge. The system includes a graphical user interface with several classification assistants doing text- and image-based event detection, geographical filtering, image clustering, search services with rich visual metaphors to visualize their results and finally Active Learning (AL) with different sampling strategies. We also present a comparative study on the impact of using different and interchangeable AL components on the number of clicks needed to reach a stable level of accuracy.
Supported by LABEX IMU under the project IDENUM: Identitées numériques urbaines. http://imu.universite-lyon.fr/projet/idenum-identites-numeriques-urbaines.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
CATI’s documentation, videos and source code: https://bitbucket.org/idenum/cati/wiki/Home.
- 4.
- 5.
References
Bontcheva, K., Derczynski, L., Funk, A., Greenwood, M.A., Maynard, D., Aswani, N.: TwitIE: an open-source information extraction pipeline for microblog text. In: Proceedings of the International Conference Recent Advances in Natural Language Processing, pp. 83–90 (2013)
Bosetti, G., Egyed-Zsigmond, E., Ono, L.: CATI: an active learning system for event detection on mibroblogs’ large datasets. In: Proceedings of the 15th International Conference on Web Information Systems and Technologies. Scitepress (2019). https://doi.org/10.5220/0008355301510160, https://www.scitepress.org/ProceedingsDetails.aspx?ID=tv9WTo7buso=&t=1
Cai, H., Yang, Y., Li, X., Huang, Z.: What are popular : exploring twitter features for event detection, tracking and visualization. In: Proceedings of the 23rd ACM International Conference on Multimedia, pp. 89–98 (2015)
Cunningham, H., Maynard, D., Bontcheva, K.: Text Processing with Gate. Gateway Press, Sheffield (2011)
Odeh, F.: Event detection in heterogeneous data streams. Technical report Lyon (2018)
Gaillard, M., Egyed-Zsigmond, E.: Large scale reverse image search-a method comparison for almost identical image retrieval. In: INFORSID, pp. 127–142 (2017)
Gobbel Dr, G.T., et al.: Assisted annotation of medical free text using RapTAT. J. Am. Med. Inf. Assoc. 21(5), 833–841 (2014)
Guille, A., Favre, C.: Mention-anomaly-based event detection and tracking in twitter. In: ASONAM 2014 - Proceedings of the 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp. 375–382 (2014)
Hardeniya, N., Perkins, J., Chopra, D., Joshi, N., Mathur, I.: Natural Language Processing: Python and NLTK. Packt Publishing Ltd., Sebastopol (2016)
Honnibal, M., Montani, I.: spacy 2: natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing (2017)
Hu, X., Tang, J., Gao, H., Liu, H.: ActNeT: Active Learning for Networked Texts in Microblogging (2013)
Katragadda, S., Virani, S., Benton, R., Raghavan, V.: Detection of event onset using Twitter. In: Proceedings of the International Joint Conference on Neural Networks, pp. 1539–1546 (2016).https://doi.org/10.1109/IJCNN.2016.7727381
Lieberman, H., Paternò, F., Klann, M., Wulf, V.: End-user development: an emerging paradigm. In: Lieberman, H., Paternò, F., Wulf, V. (eds.) End User Development, Chapter 1, pp. 1–8. Springer, Netherlands, Dordrecht (2006). https://doi.org/10.1007/1-4020-5386-X_1
Makki, R.: ATR-Vis: visual and interactive information retrieval for parliamentary discussions in twitter. ACM Trans. Knowl. Disc. Data 12(1), 33 (2018)
McCallum, A.: MALLET: A Machine Learning for Language Toolkit. http://mallet.cs.umass.edu (2002)
Miller, B., Linder, F., Mebane Jr., W.R.: Active Learning Approaches for Labeling Text. Technical report, University of Michigan, Ann Arbor, MI (2018). http://www-personal.umich.edu/~wmebane/active-learning-approaches-4-18-2018.pdf
Řehuřek, R., Sojka, P.: Gensim - statistical semantics in python. In: EuroScipy (2011)
Settles, B.: Active learning literature survey. University of Wisconsin-Madison Department of Computer Sciences, Technical report (2009)
Spina, D., Peetz, M.H., de Rijke, M.: Active Learning for Entity Filtering in Microblog Streams, pp. 975–978. ACM, New York (2015). https://doi.org/10.1145/2766462.2767839
Trivedi, G., Pham, P., Chapman, W.W., Hwa, R., Wiebe, J., Hochheiser, H.: NLPReViz: an interactive tool for natural language processing on clinical text. J. Am. Med. Inf. Assoc. 25(1), 81–87 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Bosetti, G., Egyed-Zsigmond, E. (2020). CATI: An Extensible Platform Supporting Assisted Classification of Large Datasets. In: Bozzon, A., Domínguez Mayo, F.J., Filipe, J. (eds) Web Information Systems and Technologies. WEBIST 2019. Lecture Notes in Business Information Processing, vol 399. Springer, Cham. https://doi.org/10.1007/978-3-030-61750-9_6
Download citation
DOI: https://doi.org/10.1007/978-3-030-61750-9_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-61749-3
Online ISBN: 978-3-030-61750-9
eBook Packages: Computer ScienceComputer Science (R0)