research-article

CorpSum: Towards an Enabling Tool-Design for Language Researchers to Explore, Analyze and Visualize Corpora

Authors:
Asil Çetin

Austrian Academy of Sciences, Austria

Austrian Academy of Sciences, Austria
View Profile

,
Torsten Moeller

Faculty of Computer Science University of Vienna, Austria

Faculty of Computer Science University of Vienna, Austria
View Profile

,
Thomas Torsney-Weir

Computer Science Swansea University, United Kingdom

Computer Science Swansea University, United Kingdom
View Profile

CHI '21: Proceedings of the 2021 CHI Conference on Human Factors in Computing SystemsMay 2021Article No.: 637Pages 1–12https://doi.org/10.1145/3411764.3445145

Published:07 May 2021Publication History

CHI '21: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems

Pages 1–12

ABSTRACT

Linguists use annotated text collections to validate, refute and refine a hypothesis about the written language. This research requires the creation and analysis of complex queries which are often above the technical expertise of the domain users. In this paper, we present a tool-design which enables language researchers to easily query annotated text corpora and conduct a comparative multi-faceted analysis on a single screen. The results of the iterative design process, including requirement analysis, multiple prototyping and user evaluation sessions, and expert reviews, are documented in detail. Our tool, called CorpSum, shows a 43.12 point increase in the mean SUS score in a randomized within-subjects test and an improvement of 3.18 times in mean task completion duration compared to a conventional solution. Two detailed case studies with linguists demonstrate a significant improvement for solving the real-world problems of the domain users.

References

Laurence Anthony. 2019. AntConc (Version 3.5.8) [Computer Software]. Tokyo Japan: Waseda University.Google Scholar
Michael Barlow. 2019. MONOCONC: Text Searching Software. https://www.monoconc.com/. Accessed: 2019-10-28.Google Scholar
Fabian Beck, Sebastian Koch, and Daniel Weiskopf. 2016. Visual analysis and dissemination of scientific literature collections with SurVis. IEEE Transactions on Visualization and Computer Graphics 22, 1 (Jan. 2016), 180–189. https://doi.org/10.1109/tvcg.2015.2467757Google ScholarDigital Library
Matthew Brehmer and Tamara Munzner. 2013. A multi-level typology of abstract visualization tasks. IEEE Transactions on Visualization and Computer Graphics 19 (Dec. 2013), 2376–85. https://doi.org/10.1109/TVCG.2013.124Google ScholarDigital Library
Vaclav Brezina. 2018. Statistics in corpus linguistics: A practical guide. Cambridge University Press, Cambridge; New York.Google Scholar
Vaclav Brezina, Tony McEnery, and Matt Timperley. 2019. LancsBox: Lancaster University corpus toolbox. http://corpora.lancs.ac.uk/lancsbox/index.php. Accessed: 2019-10-28.Google Scholar
John Brooke. 1996. SUS: A quick and dirty usability scale.Google Scholar
Matthias Cetto, Christina Niklaus, André Freitas, and Siegfried Handschuh. 2018. Graphene: A context-preserving open information extraction system. arxiv:1808.09463 [cs.CL]Google Scholar
CLARIN-DK. 2019. CLARIN-DK presents: Teaching the teachers an interactive workshop for the Voyant Tools. https://www.clarin.eu/blog/clarin-dk-presents-teaching-teachers-%E2%80%93-interactive-workshop-voyant-tools. Accessed: 2021-01-10.Google Scholar
Mark Davies. 2020. English Corpora: Most widely used online corpora. https://www.english-corpora.org/faq.asp. Accessed: 2019-10-28.Google Scholar
Explosion. 2020. displaCy Named Entity Visualizer · Explosion. https://explosion.ai/demos/displacy-ent. Accessed: 2020-12-29.Google Scholar
Paolo Federico, Florian Heimerl, Steffen Koch, and Silvia Miksch. 2017. A survey on visual approaches for analyzing scientific literature and patents. IEEE Transactions on Visualization and Computer Graphics 23, 9 (Sept. 2017), 2179–2198. https://doi.org/10.1109/TVCG.2016.2610422Google ScholarDigital Library
Cristian Felix, Anshul Vikram Pandey, and Enrico Bertini. 2017. TextTile: An interactive visualization tool for seamless exploratory analysis of structured data and unstructured text. IEEE Transactions on Visualization and Computer Graphics 23, 1 (Jan. 2017), 161–170. https://doi.org/10.1109/TVCG.2016.2598447Google ScholarDigital Library
Stephen Few. 2007. Dashboard confusion revisited. http://perceptualedge.com/articles/visual_business_intelligence/dboard_confusion_revisited.pdf. Accessed: 2021-01-12.Google Scholar
Allen Institute for AI. 2020. Spacy Visualiser. https://spacy-vis.apps.allenai.org/spacy-parser. Accessed: 2020-12-29.Google Scholar
Zhao Geng, Robert S. Laramee, Fernando Loizides, and George Buchanan. 2011. Visual analysis of document triage data. In Proceedings of the International Conference on Imaging Theory and Applications and International Conference on Information Visualization Theory and Applications - Volume 1: IVAPP, (VISIGRAPP 2011). INSTICC, SciTePress, Vilamoura, Algarve, Portugal, 151–163. https://doi.org/10.5220/0003320401510163Google ScholarCross Ref
Andrew Hardie. 2019. CQPweb. https://cqpweb.lancs.ac.uk/. Accessed: 2019-10-28.Google Scholar
Petra Isenberg, Tobias Isenberg, Michael Sedlmair, Jian Chen, and Torsten Möller. 2017. Visualization as seen through its research paper keywords. IEEE Transactions on Visualization and Computer Graphics 23, 1 (Jan. 2017), 771–780. https://doi.org/10.1109/TVCG.2016.2598827Google ScholarDigital Library
Adam Kilgarriff, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. 2014. The Sketch Engine: Ten years on. Lexicography 1 (July 2014), 7–36. https://doi.org/10.1007/s40607-014-0009-9Google ScholarCross Ref
Bum Chul Kwon, Brian Fisher, and Ji Soo Yi. 2011. Visual analytic roadblocks for novice investigators. In 2011 IEEE Conference on Visual Analytics Science and Technology (VAST). IEEE, Providence, RI, USA, 3–11. https://doi.org/10.1109/VAST.2011.6102435Google ScholarCross Ref
Shahid Latif and Fabian Beck. 2019. VIS author profiles: Interactive descriptions of publication records combining text and visualization. IEEE Transactions on Visualization and Computer Graphics 25, 1 (Jan. 2019), 152–161. https://doi.org/10.1109/TVCG.2018.2865022Google ScholarDigital Library
Sukwon Lee, Sung-Hee Kim, Ya-Hsin Hung, Heidi Lam, Youn-Ah Kang, and Ji Soo Yi. 2016. How do people make sense of unfamiliar visualizations?: A grounded model of novice’s information visualization sensemaking. IEEE Transactions on Visualization and Computer Graphics 22, 1 (Aug. 2016), 499–508. https://doi.org/10.1109/TVCG.2015.2467195Google ScholarDigital Library
Clayton Lewis and John Rieman. 1993. Task-centered user interface design: A practical introduction. University of Colorado, Boulder.Google Scholar
Shixia Liu, Jialun Yin, Xiting Wang, Weiwei Cui, Kelei Cao, and Jian Pei. 2016. Online visual analytics of text streams. IEEE Transactions on Visualization and Computer Graphics 22, 11 (Nov. 2016), 2451–2466. https://doi.org/10.1109/TVCG.2015.2509990Google ScholarDigital Library
Tony McEnery and Andrew Hardie. 2012. Corpus Linguistics: Method, Theory and Practice. Cambridge University Press, Cambridge; New York. OCLC: ocn732967848.Google Scholar
Franco Moretti. 2013. Distant reading. Verso, London.Google Scholar
Jakob Nielsen and Thomas K. Landauer. 1993. A mathematical model of the finding of usability problems. In Proceedings of the INTERACT ’93 and CHI ’93 Conference on Human Factors in Computing Systems (Amsterdam, The Netherlands) (CHI ’93). Association for Computing Machinery, New York, NY, USA, 206––213. https://doi.org/10.1145/169059.169166Google ScholarDigital Library
Deok Gun Park, Seungyeon Kim, Jurim Lee, Jaegul Choo, Nicholas Diakopoulos, and Niklas Elmqvist. 2018. ConceptVector: Text visual analytics via interactive lexicon building using word embedding. IEEE Transactions on Visualization and Computer Graphics 24 (Jan. 2018), 361–370. Issue 1. https://doi.org/10.1109/TVCG.2017.2744478Google ScholarCross Ref
Jutta Ransmayr, Karlheinz Mörth, and Matej Ďurčo. 2013. Linguistic variation in the Austrian Media Corpus. Dealing with the challenges of large amounts of data. Procedia - Social and Behavioral Sciences 95 (Oct. 2013), 111–115. https://doi.org/10.1016/j.sbspro.2013.10.629Google ScholarCross Ref
Paul Rayson. 2018. Wmatrix corpus analysis and comparison tool. http://ucrel.lancs.ac.uk/wmatrix/. Accessed: 2019-10-29.Google Scholar
Jonathan C Roberts. 2007. State of the art: Coordinated & multiple views in exploratory visualization. In Fifth International Conference on Coordinated and Multiple Views in Exploratory Visualization (CMV 2007). IEEE, IEEE, Zurich, Switzerland, 61–71. https://doi.org/10.1109/CMV.2007.20Google ScholarDigital Library
Jeffrey Rubin and Dana Chisnell. 2008. Handbook of Usability Testing: How to Plan, Design, and Conduct Effective Tests (2nd ed.). Wiley Pub, Indianapolis, IN.Google Scholar
Mike Scott. 2020. WordSmith Tools version 8.Google Scholar
Michael Sedlmair, Miriah Meyer, and Tamara Munzner. 2012. Design study methodology: Reflections from the trenches and the Stacks. IEEE Transactions on Visualization and Computer Graphics 18, 12 (Dec. 2012), 2431–2440. https://doi.org/10.1109/TVCG.2012.213Google ScholarDigital Library
Helen Sharp. 2019. Interaction design: Beyond Human-computer interaction (fifth ed.). John Wiley and Sons, Indianapolis, IN.Google Scholar
Ben Shneiderman. 1996. The eyes have it: a task by data type taxonomy for information visualizations. In Proceedings 1996 IEEE Symposium on Visual Languages. IEEE Computer Society, USA, 336–343.Google ScholarCross Ref
Ben Shneiderman and Catherine Plaisant. 2006. Strategies for Evaluating Information Visualization Tools: Multi-Dimensional in-Depth Long-Term Case Studies. In Proceedings of the 2006 AVI Workshop on BEyond Time and Errors: Novel Evaluation Methods for Information Visualization. ACM, New York, NY, USA, 1–7. https://doi.org/10.1145/1168149.1168158Google ScholarDigital Library
Stéfan Sinclair and Geoffrey Rockwell. 2016. Voyant Tools.Google Scholar
Stéfan Sinclair and Geoffrey Rockwell. 2018. Loading Texts into Voyant Tools. https://digihum.mcgill.ca/voyant/ui/loading-texts/. Accessed: 2020-09-15.Google Scholar
Hendrik Strobelt, Daniela Oelke, Christian Rohrdantz, Andreas Stoffel, Daniel A. Keim, and Oliver Deussen. 2009. Document cards: A top trumps visualization for documents. IEEE Transactions on Visualization and Computer Graphics 15, 6 (Nov. 2009), 1145–1152. https://doi.org/10.1109/TVCG.2009.139Google ScholarDigital Library
Nicole Sultanum, Devin Singh, Michael Brudno, and Fanny Chevalier. 2019. Doccurate: A curation-based approach for clinical text visualization. IEEE Transactions on Visualization and Computer Graphics 25, 1 (Jan. 2019), 142–151. https://doi.org/10.1109/TVCG.2018.2864905Google ScholarDigital Library
Wolfgang Teubert. 2005. My version of corpus linguistics. International Journal of Corpus Linguistics 10, 1 (Jan. 2005), 1–13. https://doi.org/10.1075/ijcl.10.1.01teuGoogle ScholarCross Ref
Christopher Tribble. 2012. Teaching and language corpora: Quo vadis? 10th Teaching and Language Corpora Conference, Warsaw.Google Scholar
T. S. Tullis and Jacqueline N. Stetson. 2004. A Comparison of Questionnaires for Assessing Website Usability.Google Scholar

Index Terms

CorpSum: Towards an Enabling Tool-Design for Language Researchers to Explore, Analyze and Visualize Corpora

Index terms have been assigned to the content through auto-classification.

Recommendations

History, Features, and Typology of Language Corpora
Read More
Mining comparable bilingual text corpora for cross-language information integration
KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining

Integrating information in multiple natural languages is a challenging task that often requires manually created linguistic resources such as a bilingual dictionary or examples of direct translations of text. In this paper, we propose a general cross-...
Read More
Utility and Application of Language Corpora
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CHI '21: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems
May 2021
10862 pages
ISBN:9781450380966
DOI:10.1145/3411764
General Chairs:
Yoshifumi Kitamura
Tohoku University, Japan
,
Aaron Quigley
University of New South Wales, Australia
,
Program Chairs:
Katherine Isbister
University of California Santa Cruz, USA
,
Takeo Igarashi
The University of Tokyo, Japan
,
Publications Chairs:
Pernille Bjørn
University of Copenhagen, Denmark
,
Steven Drucker
Microsoft Research, USA
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 May 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Prototyping/Implementation
Text/Speech/Language
Visual Design
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate6,199of26,314submissions,24%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 246
  Total Downloads
- Downloads (Last 12 months)47
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

CorpSum: Towards an Enabling Tool-Design for Language Researchers to Explore, Analyze and Visualize Corpora

CHI '21: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

History, Features, and Typology of Language Corpora

Mining comparable bilingual text corpora for cross-language information integration

Utility and Application of Language Corpora