QUALM: Ganzheitliche Messung und Verbesserung der Datenqualität in der Textanalyse

Kiefer, Cornelia; Reimann, Peter; Mitschang, Bernhard

doi:10.1007/s13222-019-00318-7

QUALM: Ganzheitliche Messung und Verbesserung der Datenqualität in der Textanalyse

Schwerpunktbeitrag
Published: 06 June 2019

Volume 19, pages 137–148, (2019)
Cite this article

Datenbank-Spektrum Aims and scope Submit manuscript

461 Accesses
Explore all metrics

Zusammenfassung

Bestehende Ansätze zur Messung und Verbesserung der Qualität von Textdaten in der Textanalyse bringen drei große Nachteile mit sich. Evaluationsmetriken wie zum Beispiel Accuracy messen die Qualität zwar verlässlich, sie (1) sind jedoch auf aufwändig händisch zu erstellende Goldannotationen angewiesen und (2) geben keine Ansatzpunkte für die Verbesserung der Qualität. Erste domänenspezifische Datenqualitätsmethoden für unstrukturierte Textdaten kommen zwar ohne Goldannotationen aus und geben Ansatzpunkte zur Verbesserung der Datenqualität. Diese Methoden wurden jedoch nur für begrenzte Anwendungsgebiete entwickelt und (3) berücksichtigen deshalb nicht die Spezifika vieler Analysetools in Textanalyseprozessen. In dieser Arbeit präsentieren wir hierzu das QUALM-Konzept zum qualitativ hochwertigen Mining von Textdaten (QUALity Mining), das die drei o.g. Nachteile adressiert. Das Ziel von QUALM ist es, die Qualität der Analyseergebnisse, z. B. bzgl. der Accuracy einer Textklassifikation, auf Basis einer Messung und Verbesserung der Datenqualität zu erhöhen. QUALM bietet hierzu eine Menge an QUALM-Datenqualitätsmethoden. QUALM-Indikatoren erfassen die Datenqualität ganzheitlich auf Basis der Passung zwischen den Eingabedaten und den Spezifika der Analysetools, wie den verwendeten Features, Trainingsdaten und semantischen Ressourcen (wie zum Beispiel Wörterbüchern oder Taxonomien). Zu jedem Indikator gehört ein passender Modifikator, mit dem sowohl die Daten als auch die Spezifika der Analysetools verändert werden können, um die Datenqualität zu erhöhen. In einer ersten Evaluation von QUALM zeigen wir für konkrete Analysetools und Datensätze, dass die Anwendung der QUALM-Datenqualitätsmethoden auch mit einer Erhöhung der Qualität der Analyseergebnisse im Sinne der Evaluationsmetrik Accuracy einhergeht. Die Passung zwischen Eingabedaten und Spezifika der Analysetools wird hierzu mit konkreten QUALM-Modifikatoren erhöht, die zum Beispiel Abkürzungen auflösen oder automatisch auf Basis von Textähnlichkeitsmetriken passende Trainingsdaten vorschlagen.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

https://tika.apache.org/.
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html.
https://www.kaggle.com/datasets.
https://wordnet.princeton.edu/.
http://www.sfs.uni-tuebingen.de/GermaNet/.
http://www.nltk.org/.
https://stanfordnlp.github.io/CoreNLP/, https://nlp.stanford.edu/software/CRF-NER.shtml.
https://www.cs.waikato.ac.nz/ml/weka/.
https://rapidminer.com/.
https://github.com/kieferca/qualm.
http://www.nltk.org/nltk_data/.
https://www.nltk.org/api/nltk.sentiment.html.
http://www.cs.cornell.edu/people/pabo/movie-review-data/.
Die Tweets aus der Datenkollektion in NLTK wurden für dieses Beispiel in zwei disjunkte Trainings- und Testdatensätze gesplittet.
https://github.com/felipebravom/StaticTwitterSent/tree/master/extra/Sentiment140-Lexicon-v0.1.
https://www.kaggle.com/paoloripamonti/twitter-sentiment-analysis.
https://www.mongodb.com/.
http://www.nltk.org/nltk_data/.
https://tika.apache.org/.
https://github.com/optimaize/language-detector.
Aus der Bibliothek DKPro Core: https://dkpro.github.io/dkpro-core/.
http://www.nltk.org/_modules/nltk/tag/crf.html.
http://www.nltk.org/_modules/nltk/tag/perceptron.html.
https://www.nltk.org/_modules/nltk/tag/tnt.html.
https://github.com/kieferca/qualm.

References

Balamurali A, Joshi A, Bhattacharyya P (2012) Cost and benefit of using wordnet senses for sentiment analysis. In: LREC
Google Scholar
Batini C, Scannapieco M (2016) Data and information quality. Springer, Cham
Book MATH Google Scholar
Botha GR, Barnard E (2012) Factors that affect the accuracy of text-based language identification. Comput Speech Lang 26(5):307–320
Article Google Scholar
Cavnar WB, Trenkle JM (1994) N‑gram-based text categorization, S 161–175
Google Scholar
Gröger C, Kassner L, Hoos E, Königsberger J, Kiefer C, Silcher S, Mitschang B (2016) The data-driven factory. Leveraging big industrial data for agile, learning and human-centric manufacturing. In: Hammoudi S et al. (ed.) Proceedings of the 18th international conference on enterprise information systems, S 40–52
Google Scholar
Bär D, Zesch T (2013) Iryna Gurevych: DKPro similarity: An open source framework for text similarity. In: Proceedings of the 51st annual meeting of the association for computational linguistics. USA, Stroudsburg, S 121–126
Google Scholar
Devlin J, Chang MW, Lee K, Toutanova K (2018) BERT: Pre-training of deep bidirectional transformers for language understanding (CoRR (abs/1810.04805))
Google Scholar
Feurer M, Klein A, Eggensperger K, Springenberg J, Blum M, Hutter F (2015) Efficient and robust automated machine learning. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (Hrsg) Advances in neural information processing systems, Bd. 28, S 2962–2970
Google Scholar
Flisar J, Podgorelec V (2018) Document enrichment using DBPedia ontology for short text classification. In: Proceedings of the 8th international conference on web intelligence, mining and semantics, WIMS ’18. ACM, New York, S 8:1–8:9
Google Scholar
Gimpel K, Schneider N, O’Connor B, Das D, Mills D, Eisenstein J, Heilman M, Yogatama D, Flanigan J, Smith NA (2011) Part-of-speech tagging for twitter: Annotation, features, and experiments. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies: Short papers - Volume 2, HLT ’11. Association for Computational Linguistics, Stroudsburg, S 42–47
Google Scholar
Goméz-Perez A, Manzano Macho D (2004) An overview of methods and tools for ontology learning from texts. Knowl Eng Rev 19(3):187–212
Article Google Scholar
Hamdan H, Béchet F, Bellot P (2013) Experiments with DBpedia, wordnet and sentiwordnet as resources for sentiment analysis in micro-blogging. In: Second joint conference on lexical and computational semantics (*SEM). Association for Computational Linguistics, Atlanta, S 455–459 (Volume 2: Proceedings of the seventh international workshop on semantic evaluation (SemEval 2013))
Google Scholar
Herschel M, Diestelkämper R, Ben Lahmar H (2017) A survey on provenance: What for? What form? What from? VLDB J 26(6):881–906
Article Google Scholar
Hirmer P, Behringer M (2016) Flexmash 2.0 - Flexible modeling and execution of data mashups. In: RMC
Google Scholar
Hossin M, Sulaiman MN (2015) A review on evaluation metrics for data classification evaluations. Int J Data Min Knowl Manag Process 5:1–11
Google Scholar
Hotho A, Staab S, Stumme G (2003) Ontologies improve text document clustering. In: Third IEEE international conference on data mining, S 541–544
Book Google Scholar
Immonen A, Paakkonen P, Ovaska E (2015) Evaluating the quality of social media data in big data architecture. IEEE Access 3:1
Article Google Scholar
Jonquet C, Musen MA, Shah NH (2010) Building a biomedical ontology recommender web service. J Biomed Semantics 1(Suppl 1):S1
Article Google Scholar
Kandel S, Heer J, Plaisant C, Kennedy J, van Ham F, Riche NH, Weaver C, Lee B, Brodbeck D, Buono P (2011) Research directions in data wrangling: Visualizations and transformations for usable and credible data. Inf Vis 10(4):271–288
Article Google Scholar
Kassner L, Kiefer C (2015) Taxonomy transfer: Adapting a knowledge representing resource to new domains and tasks. In: Proceedings of the 16th European conference on knowledge management, S 399–407
Google Scholar
Kassner L, Mitschang B (2016) Exploring text classification for messy data: An industry use case for domain-specific analytics. In: Advances in database technology - EDBT 2016, 19th international conference on extending database technology, S 491–502 (OpenProceedings.org)
Google Scholar
Kiefer C (2016) Assessing the quality of unstructured data: An initial overview. In: Krestel R, Mottin D, Müller E (Hrsg) Proceedings of the LWDA, CEUR workshop proceedings, S 62–73
Google Scholar
Kiefer C (2017) Die Gratwanderung zwischen qualitativ hochwertigen und einfach zu erstellenden domänenspezifischen Textanalysen. In: Lecture Notes in Informatics (LNI) (B. Mitschang et al. (eds.))
Google Scholar
Kiefer C (2019) Quality indicators for text data. In: Meyer H et al (Hrsg) Datenbanksysteme für Business, Technologie und Web (BTW 2019), 18. Fachtagung des GI-Fachbereichs ,,Datenbanken und Informationssysteme (DBIS), 4.-8. März 2019, Rostock, Germany, Workshopband, LNI, Bd. P‑290. Gesellschaft für Informatik, Bonn, S 145–154
Google Scholar
Landauer TK, Foltz PW, Laham D (1998) An introduction to latent semantic analysis. Discourse Process 25(2-3):259–284
Article Google Scholar
Li Y, Ye J (2018) Learning adversarial networks for semi-supervised text classification via policy gradient. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, KDD ’18. ACM, New York, S 1715–1723
Google Scholar
Liu Y, Ge T, Mathews KS, Ji H, McGuinness DL (2018) Exploiting task-oriented resources to learn word embeddings for clinical abbreviation expansion (CoRR)
Google Scholar
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, New York
Book MATH Google Scholar
Marcus MP, Marcinkiewicz MA, Santorini B (1993) Building a large annotated corpus of English: The Penn Treebank. Comput Linguist 19(2):313–330
Google Scholar
Miltsakaki E, Kukichy K (2000) Automated evaluation of coherence in student essays. In: Proceedings of LREC, S 1–8
Google Scholar
Misirlis N, Vlachopoulou M (2018) Social media metrics and Analytics in marketing – S3M: A mapping literature review. Int J Inf Manage 38(1):270–276
Article Google Scholar
Niu C, Li W, Ding J, Srihari RK (2004) Orthographic case restoration using supervised learning without manual annotation. Int J Artif Intell Tools. https://doi.org/10.1142/S0218213004001454
Google Scholar
Olvera-López J, Ariel Carrasco-Ochoa J, Martínez-Trinidad JF, Kittler J (2010) A review of instance selection methods. Artif Intell Rev 34:133–143
Article Google Scholar
Ranjit S, Kawaljeet S (2010) A descriptive classification of causes of data quality problems in data warehousing. International Journal of Computer Science Issues 7(3):41–50
Google Scholar
Schierle M, Trabold D (2010) Multilingual knowledge-based concept recognition in textual data. In: Fink A, Lausen B, Seidel W, Ultsch A (Hrsg) Advances in data analysis, data handling and business intelligence, studies in classification, data analysis, and knowledge organization. Springer, Berlin, Heidelberg, S 327–336
Google Scholar
Schmidt A, Ireland C, Gonzales E, Del Pilar Angeles M, Burdescu DD (2012) On the quality of non-structured data. http://www.iaria.org/conferences2012/filesDBKDA12/DBKDA_2012_PANEL.pdf. Accessed: 5 June 2019
Google Scholar
Sebastian-Coleman L (2013) Measuring data quality for ongoing improvement: A data quality assessment framework. Elsevier, Burlington
Book Google Scholar
Sonntag D (2004) Assessing the quality of natural language text data. In: GI Jahrestagung, S 259–263
Google Scholar
Tartir S, Arpinar IB (2007) Ontology evaluation and ranking using ontoQA. In: International conference on semantic computing (ICSC 2007), S 185–192
Book Google Scholar
Todoran IG, Lecornu L, Khenchaf A, Le Caillec JM (2015) A methodology to evaluate important dimensions of information quality in systems. ACM J Data Inf Qual 6(2-3):1–23
Article Google Scholar
Wang RY, Strong DM (1996) Beyond accuracy: What data quality means to data consumers. J Manag Inf Syst 12(4):5–33
Article Google Scholar
Wong W, Liu W, Bennamoun M (2008) Enhanced integrated scoring for cleaning dirty texts (CoRR)
Google Scholar
Yu W, Li Q, Chen J, Cao J (2007) OS-RANK: Structure analysis for ontology ranking, S 339–346
Google Scholar

Download references

Danksagung

Die Autoren danken der Deutschen Forschungsgemeinschaft (DFG) für finanzielle Unterstützung dieses Projekts im Rahmen der Graduiertenschule GSaME (Graduate School of Excellence advanced Manufacturing Engineering) an der Universität Stuttgart.

Author information

Authors and Affiliations

Graduate School of Excellence advanced Manufacturing Engineering (GSaME), Universität Stuttgart, Stuttgart, Deutschland
Cornelia Kiefer & Peter Reimann
Institut für Parallele und Verteilte Systeme (IPVS), Universität Stuttgart, Stuttgart, Deutschland
Bernhard Mitschang

Authors

Cornelia Kiefer
View author publications
You can also search for this author in PubMed Google Scholar
Peter Reimann
View author publications
You can also search for this author in PubMed Google Scholar
Bernhard Mitschang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cornelia Kiefer.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kiefer, C., Reimann, P. & Mitschang, B. QUALM: Ganzheitliche Messung und Verbesserung der Datenqualität in der Textanalyse. Datenbank Spektrum 19, 137–148 (2019). https://doi.org/10.1007/s13222-019-00318-7

Download citation

Received: 13 February 2019
Accepted: 27 May 2019
Published: 06 June 2019
Issue Date: 01 July 2019
DOI: https://doi.org/10.1007/s13222-019-00318-7

Schlüsselwörter

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

QUALM: Ganzheitliche Messung und Verbesserung der Datenqualität in der Textanalyse

Zusammenfassung

Access this article

Notes

References

Danksagung

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Schlüsselwörter

Search

Navigation