CATI: An Extensible Platform Supporting Assisted Classification of Large Datasets

Bosetti, Gabriela; Egyed-Zsigmond, Előd

doi:10.1007/978-3-030-61750-9_6

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 399))

Included in the following conference series:

International Conference on Web Information Systems and Technologies

298 Accesses

Abstract

More and more, researchers in humanities and companies need large classified document data-sets. These users are not familiar with information retrieval or data science notions. For data scientists, there is also often a need for those classified document data-sets as ground truth. There are multiple tools that allow users to carry out this classification task on large data-sets, involving always a quite expert level in computer and data science. More over, these tools are not usually oriented to the domain of micro-blogs or do not always take into account meta data and attached images as additional dimensions to improve the classification. In this work, we present a platform to enable end users to classify large document collections of several hundred thousands documents in an assisted way, within a humanly acceptable number of clicks, with no coding and without having data science and information retrieval expert knowledge. The system includes a graphical user interface with several classification assistants doing text- and image-based event detection, geographical filtering, image clustering, search services with rich visual metaphors to visualize their results and finally Active Learning (AL) with different sampling strategies. We also present a comparative study on the impact of using different and interchangeable AL components on the number of clicks needed to reach a stable level of accuracy.

Supported by LABEX IMU under the project IDENUM: Identitées numériques urbaines. http://imu.universite-lyon.fr/projet/idenum-identites-numeriques-urbaines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://opennlp.apache.org.
2.
http://imu.universite-lyon.fr/projet/idenum-identites-numeriques-urbaines.
3.
CATI’s documentation, videos and source code: https://bitbucket.org/idenum/cati/wiki/Home.
4.
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/28075.
5.
https://monkeylearn.com/.

References

Bontcheva, K., Derczynski, L., Funk, A., Greenwood, M.A., Maynard, D., Aswani, N.: TwitIE: an open-source information extraction pipeline for microblog text. In: Proceedings of the International Conference Recent Advances in Natural Language Processing, pp. 83–90 (2013)
Google Scholar
Bosetti, G., Egyed-Zsigmond, E., Ono, L.: CATI: an active learning system for event detection on mibroblogs’ large datasets. In: Proceedings of the 15th International Conference on Web Information Systems and Technologies. Scitepress (2019). https://doi.org/10.5220/0008355301510160, https://www.scitepress.org/ProceedingsDetails.aspx?ID=tv9WTo7buso=&t=1
Cai, H., Yang, Y., Li, X., Huang, Z.: What are popular : exploring twitter features for event detection, tracking and visualization. In: Proceedings of the 23rd ACM International Conference on Multimedia, pp. 89–98 (2015)
Google Scholar
Cunningham, H., Maynard, D., Bontcheva, K.: Text Processing with Gate. Gateway Press, Sheffield (2011)
Google Scholar
Odeh, F.: Event detection in heterogeneous data streams. Technical report Lyon (2018)
Google Scholar
Gaillard, M., Egyed-Zsigmond, E.: Large scale reverse image search-a method comparison for almost identical image retrieval. In: INFORSID, pp. 127–142 (2017)
Google Scholar
Gobbel Dr, G.T., et al.: Assisted annotation of medical free text using RapTAT. J. Am. Med. Inf. Assoc. 21(5), 833–841 (2014)
Article Google Scholar
Guille, A., Favre, C.: Mention-anomaly-based event detection and tracking in twitter. In: ASONAM 2014 - Proceedings of the 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp. 375–382 (2014)
Google Scholar
Hardeniya, N., Perkins, J., Chopra, D., Joshi, N., Mathur, I.: Natural Language Processing: Python and NLTK. Packt Publishing Ltd., Sebastopol (2016)
Google Scholar
Honnibal, M., Montani, I.: spacy 2: natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing (2017)
Google Scholar
Hu, X., Tang, J., Gao, H., Liu, H.: ActNeT: Active Learning for Networked Texts in Microblogging (2013)
Google Scholar
Katragadda, S., Virani, S., Benton, R., Raghavan, V.: Detection of event onset using Twitter. In: Proceedings of the International Joint Conference on Neural Networks, pp. 1539–1546 (2016).https://doi.org/10.1109/IJCNN.2016.7727381
Lieberman, H., Paternò, F., Klann, M., Wulf, V.: End-user development: an emerging paradigm. In: Lieberman, H., Paternò, F., Wulf, V. (eds.) End User Development, Chapter 1, pp. 1–8. Springer, Netherlands, Dordrecht (2006). https://doi.org/10.1007/1-4020-5386-X_1
Makki, R.: ATR-Vis: visual and interactive information retrieval for parliamentary discussions in twitter. ACM Trans. Knowl. Disc. Data 12(1), 33 (2018)
MathSciNet Google Scholar
McCallum, A.: MALLET: A Machine Learning for Language Toolkit. http://mallet.cs.umass.edu (2002)
Miller, B., Linder, F., Mebane Jr., W.R.: Active Learning Approaches for Labeling Text. Technical report, University of Michigan, Ann Arbor, MI (2018). http://www-personal.umich.edu/~wmebane/active-learning-approaches-4-18-2018.pdf
Řehuřek, R., Sojka, P.: Gensim - statistical semantics in python. In: EuroScipy (2011)
Google Scholar
Settles, B.: Active learning literature survey. University of Wisconsin-Madison Department of Computer Sciences, Technical report (2009)
Google Scholar
Spina, D., Peetz, M.H., de Rijke, M.: Active Learning for Entity Filtering in Microblog Streams, pp. 975–978. ACM, New York (2015). https://doi.org/10.1145/2766462.2767839
Trivedi, G., Pham, P., Chapman, W.W., Hwa, R., Wiebe, J., Hochheiser, H.: NLPReViz: an interactive tool for natural language processing on clinical text. J. Am. Med. Inf. Assoc. 25(1), 81–87 (2018)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Université de Lyon, LIRIS UMR 5205 CNRS, Bâtiment Blaise Pascal, 20 Avenue Albert Einstein, 69621, Villeurbanne, France
Gabriela Bosetti & Előd Egyed-Zsigmond

Authors

Gabriela Bosetti
View author publications
You can also search for this author in PubMed Google Scholar
Előd Egyed-Zsigmond
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Előd Egyed-Zsigmond .

Editor information

Editors and Affiliations

Delft University of Technology, Delft, The Netherlands
Alessandro Bozzon
University of Seville, Sevilla, Sevilla, Spain
Francisco José Domínguez Mayo
Polytechnic Institute of Setúbal/INSTICC, Setúbal, Portugal
Joaquim Filipe

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bosetti, G., Egyed-Zsigmond, E. (2020). CATI: An Extensible Platform Supporting Assisted Classification of Large Datasets. In: Bozzon, A., Domínguez Mayo, F.J., Filipe, J. (eds) Web Information Systems and Technologies. WEBIST 2019. Lecture Notes in Business Information Processing, vol 399. Springer, Cham. https://doi.org/10.1007/978-3-030-61750-9_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-61750-9_6
Published: 03 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-61749-3
Online ISBN: 978-3-030-61750-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics