Skip to main content

CATI: An Extensible Platform Supporting Assisted Classification of Large Datasets

  • Conference paper
  • First Online:
Web Information Systems and Technologies (WEBIST 2019)

Abstract

More and more, researchers in humanities and companies need large classified document data-sets. These users are not familiar with information retrieval or data science notions. For data scientists, there is also often a need for those classified document data-sets as ground truth. There are multiple tools that allow users to carry out this classification task on large data-sets, involving always a quite expert level in computer and data science. More over, these tools are not usually oriented to the domain of micro-blogs or do not always take into account meta data and attached images as additional dimensions to improve the classification. In this work, we present a platform to enable end users to classify large document collections of several hundred thousands documents in an assisted way, within a humanly acceptable number of clicks, with no coding and without having data science and information retrieval expert knowledge. The system includes a graphical user interface with several classification assistants doing text- and image-based event detection, geographical filtering, image clustering, search services with rich visual metaphors to visualize their results and finally Active Learning (AL) with different sampling strategies. We also present a comparative study on the impact of using different and interchangeable AL components on the number of clicks needed to reach a stable level of accuracy.

Supported by LABEX IMU under the project IDENUM: Identitées numériques urbaines. http://imu.universite-lyon.fr/projet/idenum-identites-numeriques-urbaines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://opennlp.apache.org.

  2. 2.

    http://imu.universite-lyon.fr/projet/idenum-identites-numeriques-urbaines.

  3. 3.

    CATI’s documentation, videos and source code: https://bitbucket.org/idenum/cati/wiki/Home.

  4. 4.

    https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/28075.

  5. 5.

    https://monkeylearn.com/.

References

  1. Bontcheva, K., Derczynski, L., Funk, A., Greenwood, M.A., Maynard, D., Aswani, N.: TwitIE: an open-source information extraction pipeline for microblog text. In: Proceedings of the International Conference Recent Advances in Natural Language Processing, pp. 83–90 (2013)

    Google Scholar 

  2. Bosetti, G., Egyed-Zsigmond, E., Ono, L.: CATI: an active learning system for event detection on mibroblogs’ large datasets. In: Proceedings of the 15th International Conference on Web Information Systems and Technologies. Scitepress (2019). https://doi.org/10.5220/0008355301510160, https://www.scitepress.org/ProceedingsDetails.aspx?ID=tv9WTo7buso=&t=1

  3. Cai, H., Yang, Y., Li, X., Huang, Z.: What are popular : exploring twitter features for event detection, tracking and visualization. In: Proceedings of the 23rd ACM International Conference on Multimedia, pp. 89–98 (2015)

    Google Scholar 

  4. Cunningham, H., Maynard, D., Bontcheva, K.: Text Processing with Gate. Gateway Press, Sheffield (2011)

    Google Scholar 

  5. Odeh, F.: Event detection in heterogeneous data streams. Technical report Lyon (2018)

    Google Scholar 

  6. Gaillard, M., Egyed-Zsigmond, E.: Large scale reverse image search-a method comparison for almost identical image retrieval. In: INFORSID, pp. 127–142 (2017)

    Google Scholar 

  7. Gobbel Dr, G.T., et al.: Assisted annotation of medical free text using RapTAT. J. Am. Med. Inf. Assoc. 21(5), 833–841 (2014)

    Article  Google Scholar 

  8. Guille, A., Favre, C.: Mention-anomaly-based event detection and tracking in twitter. In: ASONAM 2014 - Proceedings of the 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp. 375–382 (2014)

    Google Scholar 

  9. Hardeniya, N., Perkins, J., Chopra, D., Joshi, N., Mathur, I.: Natural Language Processing: Python and NLTK. Packt Publishing Ltd., Sebastopol (2016)

    Google Scholar 

  10. Honnibal, M., Montani, I.: spacy 2: natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing (2017)

    Google Scholar 

  11. Hu, X., Tang, J., Gao, H., Liu, H.: ActNeT: Active Learning for Networked Texts in Microblogging (2013)

    Google Scholar 

  12. Katragadda, S., Virani, S., Benton, R., Raghavan, V.: Detection of event onset using Twitter. In: Proceedings of the International Joint Conference on Neural Networks, pp. 1539–1546 (2016).https://doi.org/10.1109/IJCNN.2016.7727381

  13. Lieberman, H., Paternò, F., Klann, M., Wulf, V.: End-user development: an emerging paradigm. In: Lieberman, H., Paternò, F., Wulf, V. (eds.) End User Development, Chapter 1, pp. 1–8. Springer, Netherlands, Dordrecht (2006). https://doi.org/10.1007/1-4020-5386-X_1

  14. Makki, R.: ATR-Vis: visual and interactive information retrieval for parliamentary discussions in twitter. ACM Trans. Knowl. Disc. Data 12(1), 33 (2018)

    MathSciNet  Google Scholar 

  15. McCallum, A.: MALLET: A Machine Learning for Language Toolkit. http://mallet.cs.umass.edu (2002)

  16. Miller, B., Linder, F., Mebane Jr., W.R.: Active Learning Approaches for Labeling Text. Technical report, University of Michigan, Ann Arbor, MI (2018). http://www-personal.umich.edu/~wmebane/active-learning-approaches-4-18-2018.pdf

  17. Řehuřek, R., Sojka, P.: Gensim - statistical semantics in python. In: EuroScipy (2011)

    Google Scholar 

  18. Settles, B.: Active learning literature survey. University of Wisconsin-Madison Department of Computer Sciences, Technical report (2009)

    Google Scholar 

  19. Spina, D., Peetz, M.H., de Rijke, M.: Active Learning for Entity Filtering in Microblog Streams, pp. 975–978. ACM, New York (2015). https://doi.org/10.1145/2766462.2767839

  20. Trivedi, G., Pham, P., Chapman, W.W., Hwa, R., Wiebe, J., Hochheiser, H.: NLPReViz: an interactive tool for natural language processing on clinical text. J. Am. Med. Inf. Assoc. 25(1), 81–87 (2018)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Előd Egyed-Zsigmond .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bosetti, G., Egyed-Zsigmond, E. (2020). CATI: An Extensible Platform Supporting Assisted Classification of Large Datasets. In: Bozzon, A., Domínguez Mayo, F.J., Filipe, J. (eds) Web Information Systems and Technologies. WEBIST 2019. Lecture Notes in Business Information Processing, vol 399. Springer, Cham. https://doi.org/10.1007/978-3-030-61750-9_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-61750-9_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-61749-3

  • Online ISBN: 978-3-030-61750-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics