Nonuniform language in technical writing: Detection and correction

Weibo Wang; Aminul Islam; Abidalrahman Moh’d; Axel J. Soto; Evangelos E. Milios

doi:10.1017/S1351324920000133

Nonuniform language in technical writing: Detection and correction

Published online by Cambridge University Press: 06 March 2020

Weibo Wang ,

and

Weibo Wang: Affiliation:
Faculty of Computer Science, Dalhousie University, Canada Dash Hudson, Canada
Aminul Islam: Affiliation:
School of Computing and Informatics, University of Louisiana at Lafayette, USA
Abidalrahman Moh’d: Affiliation:
Department of Mathematics and Computer Science, Eastern Illinois University, USA
Axel J. Soto*: Affiliation:
Institute for Computer Science and Engineering, CONICET–UNS, Argentina Department of Computer Science and Engineering, Universidad Nacional del Sur, Argentina
Evangelos E. Milios: Affiliation:
Faculty of Computer Science, Dalhousie University, Canada
*: *Corresponding author. E-mail: axel.soto@cs.uns.edu.ar

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Technical writing in professional environments, such as user manual authoring, requires the use of uniform language. Nonuniform language refers to sentences in a technical document that are intended to have the same meaning within a similar context, but use different words or writing style. Addressing this nonuniformity problem requires the performance of two tasks. The first task, which we named nonuniform language detection (NLD), is detecting such sentences. We propose an NLD method that utilizes different similarity algorithms at lexical, syntactic, semantic and pragmatic levels. Different features are extracted and integrated by applying a machine learning classification method. The second task, which we named nonuniform language correction (NLC), is deciding which sentence among the detected ones is more appropriate for that context. To address this problem, we propose an NLC method that combines contraction removal, near-synonym choice, and text readability comparison. We tested our methods using smartphone user manuals. We finally compared our methods against state-of-the-art methods in paraphrase detection (for NLD) and against expert annotators (for both NLD and NLC). The experiments demonstrate that the proposed methods achieve performance that matches expert annotators.

Keywords

Technical language Text similarity Text simplification Semantic Similarity Sentence similarity Paraphrase detection Text error detection Text error correction

Type: Article
Information: Natural Language Engineering , Volume 27 , Issue 3 , May 2021 , pp. 293 - 314

DOI: https://doi.org/10.1017/S1351324920000133 [Opens in a new window]
Copyright: © Cambridge University Press 2020

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Agarwal, B., Ramampiaro, H., Langseth, H. and Ruocco, M. (2018). A deep network model for paraphrase detection in short text messages. Information Processing & Management 54(6), 922–937.CrossRef Google Scholar

Androutsopoulos, I. and Malakasiotis, P. (2010). A survey of paraphrasing and textual entailment methods. Journal of Artificial Intelligent Research 38(1), 135–187.CrossRef Google Scholar

Apple Inc. (2015). iPhone User Guide For iOS 8.4 Software. Available at: https://manuals.info.apple.com/MANUALS/1000/MA1565/en_US/iphone_user_guide.pdf (Accessed 01 December 2015).Google Scholar

Bhargava, R., Sharma, G. and Sharma, Y. (2017). Deep paraphrase detection in indian languages. In Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2017, ASONAM’17, New York, NY, USA. Association for Computing Machinery, pp. 1152–1159.CrossRef Google Scholar

Bird, S., Klein, E. and Loper, E. (2009). Natural Language Processing with Python. O’Reilly Media, Inc.Google Scholar

Brants, T. and Franz, A. (2009). Web 1T 5-gram, 10 European languages version 1. LDC2009T25. Web Download. Philadelphia: Linguistic Data Consortium.Google Scholar

Chen, Q., Hu, Q., Huang, J. X. and He, L. (2018). CA-RNN: Using context-aligned recurrent neural networks for modeling sentence similarity. In Thirty-Second AAAI Conference on Artificial Intelligence, AAAI Press, pp. 265–273.Google Scholar

Chin, F.Y.L. and Poon, C.K. (1991). A fast algorithm for computing longest common subsequences of small alphabet size. Journal of Information Processing 13(4), 463–469.Google Scholar

Cohen, J. (1968). Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin 70(4), 213.CrossRef Google Scholar PubMed

Cohen, P.R. and Howe, A.E. (1988). How evaluation guides AI research: The message still counts more than the medium. AI Magazine 9(4), 35.Google Scholar

Coleman, M. and Liau, T.L. (1975). A computer readability formula designed for machine scoring. Journal of Applied Psychology 60(2), 283.CrossRef Google Scholar

Crossley, S., Salsbury, T. and McNamara, D. (2009). Measuring L2 lexical growth using hypernymic relationships. Language Learning 59(2), 307–334.CrossRef Google Scholar

Crossley, S.A., Allen, D.B. and McNamara, D.S. (2011). Text readability and intuitive simplification: A comparison of readability formulas. Reading in a Foreign Language 23(1), 84–101.Google Scholar

Das, D. and Smith, N.A. (2009). Paraphrase identification as probabilistic quasi-synchronous recognition. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1. Association for Computational Linguistics, pp. 468–476.CrossRef Google Scholar

De Clercq, O. and Hoste, V. (2016). All mixed up? finding the optimal feature set for general readability prediction and its application to english and dutch. Computational Linguistics 42(3), 457–490.CrossRef Google Scholar

Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis, Minnesota: Association for Computational Linguistics, pp. 4171–4186.Google Scholar

Farkas, D.K. (1985). The concept of consistency in writing and editing. Journal of Technical Writing and Communication 15(4), 353–364.CrossRef Google Scholar

Feng, L., Jansche, M., Huenerfauth, M. and Elhadad, N. (2010). A comparison of features for automatic readability assessment. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters. Association for Computational Linguistics, pp. 276–284.Google Scholar

Fleiss, J.L. and Cohen, J. (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement 33(3), 613–619.CrossRef Google Scholar

Gionis, A., Indyk, P. and Motwani, R. (1999). Similarity search in high dimensions via hashing. In Proceedings of the 25th International Conference on Very Large Data Bases, VLDB ’99, San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., pp. 518–529 Google Scholar

Gong, C., Huang, Y., Cheng, X. and Bai, S. (2008). Detecting near-duplicates in large-scale short text databases. In Advances in Knowledge Discovery and Data Mining. Springer, pp. 877–883.CrossRef Google Scholar

Graesser, A.C., McNamara, D.S., Louwerse, M.M. and Cai, Z. (2004). Coh-metrix: Analysis of text on cohesion and language. Behavior Research Methods, Instruments, & Computers 36(2), 193–202.CrossRef Google Scholar

Gunning, R. (1969). The fog index after twenty years. Journal of Business Communication 6(2), 3–13.CrossRef Google Scholar

Höfler, S. (2012). Legislative drafting guidelines: How different are they from controlled language rules for technical writing? In International Workshop on Controlled Natural Language, Berlin, Heidelberg: Springer, pp. 138–151.CrossRef Google Scholar

Inkpen, D.Z. (2007). Near-synonym choice in an intelligent thesaurus. In HLT-NAACL, Rochester, NY, April 22–27, 2007, pp. 356–363.Google Scholar

Irving, R.W. and Fraser, C. (1992). Two algorithms for the longest common subsequence of three (or more) strings. In Proceedings of the Third Annual Symposium on Combinatorial Pattern Matching, CPM ’92, London, UK. Springer-Verlag, pp. 214–229.CrossRef Google Scholar

Islam, A. and Inkpen, D. (2008). Semantic text similarity using corpus-based word similarity and string similarity. ACM Transactions on Knowledge Discovery from Data 2(2), 10:1–10:25.CrossRef Google Scholar

Islam, A. and Inkpen, D. (2009). Real-word spelling correction using google web 1t n-gram data set. In Proceedings of the 18th ACM Conference on Information and Knowledge Management. ACM, pp. 1689–1692.Google Scholar

Islam, A. and Inkpen, D. (2010). Near-synonym choice using a 5-gram language model. Research in Computing Sciences, 46, 41–52.Google Scholar

Islam, A., Milios, E. and Kešelj, V. (2012). Text similarity using google tri-grams. In Kosseim, L. and Inkpen, D. (eds), Advances in Artificial Intelligence: 25th Canadian Conference on Artificial Intelligence, Canadian AI 2012, Berlin, Heidelberg. Springer, pp. 312–317.CrossRef Google Scholar

Kešelj, V. and Cercone, N. (2004). CNG method with weighted voting. In Ad-hoc Authorship Attribution Competition. Proceedings 2004 Joint International Conference of the Association for Literary and Linguistic Computing and the Association for Computers and the Humanities (ALLC/ACH 2004).Google Scholar

Kincaid, J.P., Fishburne, Jr. R.P., Rogers, R.L. and Chissom, B.S. (1975). Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Technical report, Naval Technical Training Command Millington TN Research Branch.CrossRef Google Scholar

Kövecses, Z. and Radden, G. (1998). Metonymy: Developing a cognitive linguistic view. Cognitive Linguistics (Includes Cognitive Linguistic Bibliography) 9(1), 37–78.Google Scholar

LG (2009). LG600G User Guide. Available at: https://www.manualslib.com/manual/92956/Lg-Lg600g.html#product-LG600G (Accessed 15 December 2015).Google Scholar

Manku, G.S., Jain, A. and Das Sarma, A. (2007). Detecting near-duplicates for web crawling. In Proceedings of the 16th International Conference on World Wide Web. Association for Computer Machinery, pp. 141–150.CrossRef Google Scholar

McNamara, D.S., Graesser, A.C., McCarthy, P.M. and Cai, Z. (2014). Automated Evaluation of Text and Discourse with Coh-Metrix. Cambridge: Cambridge University Press.CrossRef Google Scholar

Mei, J., Kou, X., Yao, Z., Rau-Chaplin, A., Islam, A., Moh’d, A. and Milios, E.E. (2015). Efficient Computation of Co-Occurrence Based Word Relatedness. Available at Demo URL: http://ares.research.cs.dal.ca/gtm/ (Accessed 01 December 2015).Google Scholar

Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D. and Miller, K.J. (1990). Introduction to WordNet: An on-line lexical database. International Journal of Lexicography 3(4), 235–244.CrossRef Google Scholar

Mueller, J. and Thyagarajan, A. (2016). Siamese recurrent architectures for learning sentence similarity. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI ’16. AAAI Press, pp. 2786–2792.Google Scholar

Neculoiu, P., Versteegh, M. and Rotaru, M. (2016). Learning text similarity with Siamese recurrent networks. In Proceedings of the 1st Workshop on Representation Learning for NLP, Berlin, Germany: Association for Computational Linguistics, pp. 148–157.CrossRef Google Scholar

Nulty, P. and Costello, F. (2009). Using lexical patterns in the google web 1t corpus to deduce semantic relations between nouns. In Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions, SEW ’09, USA: Association for Computational Linguistics, pp. 58–63.CrossRef Google Scholar

Samsung (2011). Samsung 010505d5 cell phone user manual. Available at: http://cellphone.manualsonline.com/manuals/mfg/samsung/010505d5.html?p=53 (Accessed 01 December 2015).Google Scholar

Senter, R. and Smith, E.A. (1967). Automated readability index. Wright-Patterson Air Force Base. AMRL-TR-6620, 3.Google Scholar

Socher, R., Huang, E.H., Pennington, J., Ng, A.Y. and Manning, C.D. (2011). Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In Proceedings of the 24th International Conference on Neural Information Processing Systems, NIPS ’11. Red Hook, NY, USA: Curran Associates Inc., pp. 801–809.Google Scholar

Soto, A.J., Mohammad, A., Albert, A., Islam, A., Milios, E., Doyle, M., Minghim, R. and Ferreira de Oliveira, M.C. (2015). Similarity-based support for text reuse in technical writing. In Proceedings of the 2015 ACM Symposium on Document Engineering, DocEng ’15, Lausanne, Switzerland. Association for Computer Machinery, pp. 97–106.CrossRef Google Scholar

Sun, Y., Qin, J. and Wang, W. (2013). Near duplicate text detection using frequency-biased signatures. In Web Information Systems Engineering–WISE 2013, Berlin, Heidelberg: Springer, pp. 277–291.CrossRef Google Scholar

Vapnik, V. (2013). The Nature of Statistical Learning Theory, 2nd edn. New York: Springer Verlag.Google Scholar

Wang, W., Moh’d, A., Islam, A., Soto, A.J. and Milios, E. (2016). Non-uniform language detection in technical writing. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas: Association for Computational Linguistics, pp. 1892–1900.CrossRef Google Scholar

Wang, X., Li, C., Zheng, Z. and Xu, B. (2018). Paraphrase recognition via combination of neural classifier and keywords. In 2018 International Joint Conference on Neural Networks (IJCNN). IEEE, pp. 1–8.CrossRef Google Scholar

Wołkowicz, J. and Kešelj, V. (2013). Evaluation of n-gram-based classification approaches on classical music corpora. In Yust J., Wild J. and Burgoyne J.A. (eds), Mathematics and Computation in Music: 4th International Conference, MCM 2013, vol. 7937, Berlin Heidelberg: Springer, pp. 213–225.CrossRef Google Scholar

Wu, Z. and Palmer, M. (1994). Verbs Semantics and Lexical Selection. Available at: Demo URL: http://ws4jdemo.appspot.com/?mode=w&s1=&w1=photo&s2=&w2=video (Accessed 01 December 2015).Google Scholar

Article contents

Nonuniform language in technical writing: Detection and correction

Abstract

Keywords

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests