skip to main content
10.1145/3411764.3445145acmconferencesArticle/Chapter ViewAbstractPublication PageschiConference Proceedingsconference-collections
research-article

CorpSum: Towards an Enabling Tool-Design for Language Researchers to Explore, Analyze and Visualize Corpora

Published:07 May 2021Publication History

ABSTRACT

Linguists use annotated text collections to validate, refute and refine a hypothesis about the written language. This research requires the creation and analysis of complex queries which are often above the technical expertise of the domain users. In this paper, we present a tool-design which enables language researchers to easily query annotated text corpora and conduct a comparative multi-faceted analysis on a single screen. The results of the iterative design process, including requirement analysis, multiple prototyping and user evaluation sessions, and expert reviews, are documented in detail. Our tool, called CorpSum, shows a 43.12 point increase in the mean SUS score in a randomized within-subjects test and an improvement of 3.18 times in mean task completion duration compared to a conventional solution. Two detailed case studies with linguists demonstrate a significant improvement for solving the real-world problems of the domain users.

References

  1. Laurence Anthony. 2019. AntConc (Version 3.5.8) [Computer Software]. Tokyo Japan: Waseda University.Google ScholarGoogle Scholar
  2. Michael Barlow. 2019. MONOCONC: Text Searching Software. https://www.monoconc.com/. Accessed: 2019-10-28.Google ScholarGoogle Scholar
  3. Fabian Beck, Sebastian Koch, and Daniel Weiskopf. 2016. Visual analysis and dissemination of scientific literature collections with SurVis. IEEE Transactions on Visualization and Computer Graphics 22, 1 (Jan. 2016), 180–189. https://doi.org/10.1109/tvcg.2015.2467757Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Matthew Brehmer and Tamara Munzner. 2013. A multi-level typology of abstract visualization tasks. IEEE Transactions on Visualization and Computer Graphics 19 (Dec. 2013), 2376–85. https://doi.org/10.1109/TVCG.2013.124Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Vaclav Brezina. 2018. Statistics in corpus linguistics: A practical guide. Cambridge University Press, Cambridge; New York.Google ScholarGoogle Scholar
  6. Vaclav Brezina, Tony McEnery, and Matt Timperley. 2019. LancsBox: Lancaster University corpus toolbox. http://corpora.lancs.ac.uk/lancsbox/index.php. Accessed: 2019-10-28.Google ScholarGoogle Scholar
  7. John Brooke. 1996. SUS: A quick and dirty usability scale.Google ScholarGoogle Scholar
  8. Matthias Cetto, Christina Niklaus, André Freitas, and Siegfried Handschuh. 2018. Graphene: A context-preserving open information extraction system. arxiv:1808.09463 [cs.CL]Google ScholarGoogle Scholar
  9. CLARIN-DK. 2019. CLARIN-DK presents: Teaching the teachers an interactive workshop for the Voyant Tools. https://www.clarin.eu/blog/clarin-dk-presents-teaching-teachers-%E2%80%93-interactive-workshop-voyant-tools. Accessed: 2021-01-10.Google ScholarGoogle Scholar
  10. Mark Davies. 2020. English Corpora: Most widely used online corpora. https://www.english-corpora.org/faq.asp. Accessed: 2019-10-28.Google ScholarGoogle Scholar
  11. Explosion. 2020. displaCy Named Entity Visualizer · Explosion. https://explosion.ai/demos/displacy-ent. Accessed: 2020-12-29.Google ScholarGoogle Scholar
  12. Paolo Federico, Florian Heimerl, Steffen Koch, and Silvia Miksch. 2017. A survey on visual approaches for analyzing scientific literature and patents. IEEE Transactions on Visualization and Computer Graphics 23, 9 (Sept. 2017), 2179–2198. https://doi.org/10.1109/TVCG.2016.2610422Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Cristian Felix, Anshul Vikram Pandey, and Enrico Bertini. 2017. TextTile: An interactive visualization tool for seamless exploratory analysis of structured data and unstructured text. IEEE Transactions on Visualization and Computer Graphics 23, 1 (Jan. 2017), 161–170. https://doi.org/10.1109/TVCG.2016.2598447Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Stephen Few. 2007. Dashboard confusion revisited. http://perceptualedge.com/articles/visual_business_intelligence/dboard_confusion_revisited.pdf. Accessed: 2021-01-12.Google ScholarGoogle Scholar
  15. Allen Institute for AI. 2020. Spacy Visualiser. https://spacy-vis.apps.allenai.org/spacy-parser. Accessed: 2020-12-29.Google ScholarGoogle Scholar
  16. Zhao Geng, Robert S. Laramee, Fernando Loizides, and George Buchanan. 2011. Visual analysis of document triage data. In Proceedings of the International Conference on Imaging Theory and Applications and International Conference on Information Visualization Theory and Applications - Volume 1: IVAPP, (VISIGRAPP 2011). INSTICC, SciTePress, Vilamoura, Algarve, Portugal, 151–163. https://doi.org/10.5220/0003320401510163Google ScholarGoogle ScholarCross RefCross Ref
  17. Andrew Hardie. 2019. CQPweb. https://cqpweb.lancs.ac.uk/. Accessed: 2019-10-28.Google ScholarGoogle Scholar
  18. Petra Isenberg, Tobias Isenberg, Michael Sedlmair, Jian Chen, and Torsten Möller. 2017. Visualization as seen through its research paper keywords. IEEE Transactions on Visualization and Computer Graphics 23, 1 (Jan. 2017), 771–780. https://doi.org/10.1109/TVCG.2016.2598827Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Adam Kilgarriff, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. 2014. The Sketch Engine: Ten years on. Lexicography 1 (July 2014), 7–36. https://doi.org/10.1007/s40607-014-0009-9Google ScholarGoogle ScholarCross RefCross Ref
  20. Bum Chul Kwon, Brian Fisher, and Ji Soo Yi. 2011. Visual analytic roadblocks for novice investigators. In 2011 IEEE Conference on Visual Analytics Science and Technology (VAST). IEEE, Providence, RI, USA, 3–11. https://doi.org/10.1109/VAST.2011.6102435Google ScholarGoogle ScholarCross RefCross Ref
  21. Shahid Latif and Fabian Beck. 2019. VIS author profiles: Interactive descriptions of publication records combining text and visualization. IEEE Transactions on Visualization and Computer Graphics 25, 1 (Jan. 2019), 152–161. https://doi.org/10.1109/TVCG.2018.2865022Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Sukwon Lee, Sung-Hee Kim, Ya-Hsin Hung, Heidi Lam, Youn-Ah Kang, and Ji Soo Yi. 2016. How do people make sense of unfamiliar visualizations?: A grounded model of novice’s information visualization sensemaking. IEEE Transactions on Visualization and Computer Graphics 22, 1 (Aug. 2016), 499–508. https://doi.org/10.1109/TVCG.2015.2467195Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Clayton Lewis and John Rieman. 1993. Task-centered user interface design: A practical introduction. University of Colorado, Boulder.Google ScholarGoogle Scholar
  24. Shixia Liu, Jialun Yin, Xiting Wang, Weiwei Cui, Kelei Cao, and Jian Pei. 2016. Online visual analytics of text streams. IEEE Transactions on Visualization and Computer Graphics 22, 11 (Nov. 2016), 2451–2466. https://doi.org/10.1109/TVCG.2015.2509990Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Tony McEnery and Andrew Hardie. 2012. Corpus Linguistics: Method, Theory and Practice. Cambridge University Press, Cambridge; New York. OCLC: ocn732967848.Google ScholarGoogle Scholar
  26. Franco Moretti. 2013. Distant reading. Verso, London.Google ScholarGoogle Scholar
  27. Jakob Nielsen and Thomas K. Landauer. 1993. A mathematical model of the finding of usability problems. In Proceedings of the INTERACT ’93 and CHI ’93 Conference on Human Factors in Computing Systems (Amsterdam, The Netherlands) (CHI ’93). Association for Computing Machinery, New York, NY, USA, 206––213. https://doi.org/10.1145/169059.169166Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Deok Gun Park, Seungyeon Kim, Jurim Lee, Jaegul Choo, Nicholas Diakopoulos, and Niklas Elmqvist. 2018. ConceptVector: Text visual analytics via interactive lexicon building using word embedding. IEEE Transactions on Visualization and Computer Graphics 24 (Jan. 2018), 361–370. Issue 1. https://doi.org/10.1109/TVCG.2017.2744478Google ScholarGoogle ScholarCross RefCross Ref
  29. Jutta Ransmayr, Karlheinz Mörth, and Matej Ďurčo. 2013. Linguistic variation in the Austrian Media Corpus. Dealing with the challenges of large amounts of data. Procedia - Social and Behavioral Sciences 95 (Oct. 2013), 111–115. https://doi.org/10.1016/j.sbspro.2013.10.629Google ScholarGoogle ScholarCross RefCross Ref
  30. Paul Rayson. 2018. Wmatrix corpus analysis and comparison tool. http://ucrel.lancs.ac.uk/wmatrix/. Accessed: 2019-10-29.Google ScholarGoogle Scholar
  31. Jonathan C Roberts. 2007. State of the art: Coordinated & multiple views in exploratory visualization. In Fifth International Conference on Coordinated and Multiple Views in Exploratory Visualization (CMV 2007). IEEE, IEEE, Zurich, Switzerland, 61–71. https://doi.org/10.1109/CMV.2007.20Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Jeffrey Rubin and Dana Chisnell. 2008. Handbook of Usability Testing: How to Plan, Design, and Conduct Effective Tests (2nd ed.). Wiley Pub, Indianapolis, IN.Google ScholarGoogle Scholar
  33. Mike Scott. 2020. WordSmith Tools version 8.Google ScholarGoogle Scholar
  34. Michael Sedlmair, Miriah Meyer, and Tamara Munzner. 2012. Design study methodology: Reflections from the trenches and the Stacks. IEEE Transactions on Visualization and Computer Graphics 18, 12 (Dec. 2012), 2431–2440. https://doi.org/10.1109/TVCG.2012.213Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Helen Sharp. 2019. Interaction design: Beyond Human-computer interaction (fifth ed.). John Wiley and Sons, Indianapolis, IN.Google ScholarGoogle Scholar
  36. Ben Shneiderman. 1996. The eyes have it: a task by data type taxonomy for information visualizations. In Proceedings 1996 IEEE Symposium on Visual Languages. IEEE Computer Society, USA, 336–343.Google ScholarGoogle ScholarCross RefCross Ref
  37. Ben Shneiderman and Catherine Plaisant. 2006. Strategies for Evaluating Information Visualization Tools: Multi-Dimensional in-Depth Long-Term Case Studies. In Proceedings of the 2006 AVI Workshop on BEyond Time and Errors: Novel Evaluation Methods for Information Visualization. ACM, New York, NY, USA, 1–7. https://doi.org/10.1145/1168149.1168158Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Stéfan Sinclair and Geoffrey Rockwell. 2016. Voyant Tools.Google ScholarGoogle Scholar
  39. Stéfan Sinclair and Geoffrey Rockwell. 2018. Loading Texts into Voyant Tools. https://digihum.mcgill.ca/voyant/ui/loading-texts/. Accessed: 2020-09-15.Google ScholarGoogle Scholar
  40. Hendrik Strobelt, Daniela Oelke, Christian Rohrdantz, Andreas Stoffel, Daniel A. Keim, and Oliver Deussen. 2009. Document cards: A top trumps visualization for documents. IEEE Transactions on Visualization and Computer Graphics 15, 6 (Nov. 2009), 1145–1152. https://doi.org/10.1109/TVCG.2009.139Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Nicole Sultanum, Devin Singh, Michael Brudno, and Fanny Chevalier. 2019. Doccurate: A curation-based approach for clinical text visualization. IEEE Transactions on Visualization and Computer Graphics 25, 1 (Jan. 2019), 142–151. https://doi.org/10.1109/TVCG.2018.2864905Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Wolfgang Teubert. 2005. My version of corpus linguistics. International Journal of Corpus Linguistics 10, 1 (Jan. 2005), 1–13. https://doi.org/10.1075/ijcl.10.1.01teuGoogle ScholarGoogle ScholarCross RefCross Ref
  43. Christopher Tribble. 2012. Teaching and language corpora: Quo vadis? 10th Teaching and Language Corpora Conference, Warsaw.Google ScholarGoogle Scholar
  44. T. S. Tullis and Jacqueline N. Stetson. 2004. A Comparison of Questionnaires for Assessing Website Usability.Google ScholarGoogle Scholar

Index Terms

  1. CorpSum: Towards an Enabling Tool-Design for Language Researchers to Explore, Analyze and Visualize Corpora
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              CHI '21: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems
              May 2021
              10862 pages
              ISBN:9781450380966
              DOI:10.1145/3411764

              Copyright © 2021 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 7 May 2021

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Research
              • Refereed limited

              Acceptance Rates

              Overall Acceptance Rate6,199of26,314submissions,24%

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            HTML Format

            View this article in HTML Format .

            View HTML Format