Papers Papers/2022 Papers Papers/2022



Authors: Juris Rats and Inguna Pede

Affiliation: RIX Technologies, Blaumana 5a-3, Riga, LV-1011, Latvia

Keyword(s): Machine Learning, Trainset Annotation, Text Clustering, Text Classification, More like This Query, Elasticsearch.

Abstract: Volumes of documents organisations receive on a daily basis increase constantly which makes organizations hire more people to index and route them properly. A machine learning based model aimed at automation of the indexing of the incoming documents is proposed in this article. The overall automation process is described and two methods for support of trainset annotation are analysed and compared. Experts are supported during the annotation process by grouping the stream of documents into clusters of similar documents. It is expected that this may improve both the process of topic selection and that of document annotation. Grouping of the document stream is performed firstly via clustering of documents and selecting the next document from the same cluster and secondly searching the next document via Elasticsearch More Like This (MLT) query. Results of the experiments show that MLT query outperforms the clustering.


Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Rats, J. and Pede, I. (2022). Supporting Trainset Annotation for Text Classification of Incoming Enterprise Documents. In Proceedings of the 11th International Conference on Data Science, Technology and Applications - DATA; ISBN 978-989-758-583-8; ISSN 2184-285X, SciTePress, pages 211-218. DOI: 10.5220/0011113000003269

author={Juris Rats and Inguna Pede},
title={Supporting Trainset Annotation for Text Classification of Incoming Enterprise Documents},
booktitle={Proceedings of the 11th International Conference on Data Science, Technology and Applications - DATA},


JO - Proceedings of the 11th International Conference on Data Science, Technology and Applications - DATA
TI - Supporting Trainset Annotation for Text Classification of Incoming Enterprise Documents
SN - 978-989-758-583-8
IS - 2184-285X
AU - Rats, J.
AU - Pede, I.
PY - 2022
SP - 211
EP - 218
DO - 10.5220/0011113000003269
PB - SciTePress