Hostname: page-component-8448b6f56d-t5pn6 Total loading time: 0 Render date: 2024-04-18T19:23:31.863Z Has data issue: false hasContentIssue false

Nonuniform language in technical writing: Detection and correction

Published online by Cambridge University Press:  06 March 2020

Weibo Wang
Affiliation:
Faculty of Computer Science, Dalhousie University, Canada Dash Hudson, Canada
Aminul Islam
Affiliation:
School of Computing and Informatics, University of Louisiana at Lafayette, USA
Abidalrahman Moh’d
Affiliation:
Department of Mathematics and Computer Science, Eastern Illinois University, USA
Axel J. Soto*
Affiliation:
Institute for Computer Science and Engineering, CONICET–UNS, Argentina Department of Computer Science and Engineering, Universidad Nacional del Sur, Argentina
Evangelos E. Milios
Affiliation:
Faculty of Computer Science, Dalhousie University, Canada
*
*Corresponding author. E-mail: axel.soto@cs.uns.edu.ar

Abstract

Technical writing in professional environments, such as user manual authoring, requires the use of uniform language. Nonuniform language refers to sentences in a technical document that are intended to have the same meaning within a similar context, but use different words or writing style. Addressing this nonuniformity problem requires the performance of two tasks. The first task, which we named nonuniform language detection (NLD), is detecting such sentences. We propose an NLD method that utilizes different similarity algorithms at lexical, syntactic, semantic and pragmatic levels. Different features are extracted and integrated by applying a machine learning classification method. The second task, which we named nonuniform language correction (NLC), is deciding which sentence among the detected ones is more appropriate for that context. To address this problem, we propose an NLC method that combines contraction removal, near-synonym choice, and text readability comparison. We tested our methods using smartphone user manuals. We finally compared our methods against state-of-the-art methods in paraphrase detection (for NLD) and against expert annotators (for both NLD and NLC). The experiments demonstrate that the proposed methods achieve performance that matches expert annotators.

Type
Article
Copyright
© Cambridge University Press 2020

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Agarwal, B., Ramampiaro, H., Langseth, H. and Ruocco, M. (2018). A deep network model for paraphrase detection in short text messages. Information Processing & Management 54(6), 922937.CrossRefGoogle Scholar
Androutsopoulos, I. and Malakasiotis, P. (2010). A survey of paraphrasing and textual entailment methods. Journal of Artificial Intelligent Research 38(1), 135187.CrossRefGoogle Scholar
Apple Inc. (2015). iPhone User Guide For iOS 8.4 Software. Available at: https://manuals.info.apple.com/MANUALS/1000/MA1565/en_US/iphone_user_guide.pdf (Accessed 01 December 2015).Google Scholar
Bhargava, R., Sharma, G. and Sharma, Y. (2017). Deep paraphrase detection in indian languages. In Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2017, ASONAM’17, New York, NY, USA. Association for Computing Machinery, pp. 11521159.CrossRefGoogle Scholar
Bird, S., Klein, E. and Loper, E. (2009). Natural Language Processing with Python. O’Reilly Media, Inc.Google Scholar
Brants, T. and Franz, A. (2009). Web 1T 5-gram, 10 European languages version 1. LDC2009T25. Web Download. Philadelphia: Linguistic Data Consortium.Google Scholar
Chen, Q., Hu, Q., Huang, J. X. and He, L. (2018). CA-RNN: Using context-aligned recurrent neural networks for modeling sentence similarity. In Thirty-Second AAAI Conference on Artificial Intelligence, AAAI Press, pp. 265273.Google Scholar
Chin, F.Y.L. and Poon, C.K. (1991). A fast algorithm for computing longest common subsequences of small alphabet size. Journal of Information Processing 13(4), 463469.Google Scholar
Cohen, J. (1968). Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin 70(4), 213.CrossRefGoogle ScholarPubMed
Cohen, P.R. and Howe, A.E. (1988). How evaluation guides AI research: The message still counts more than the medium. AI Magazine 9(4), 35.Google Scholar
Coleman, M. and Liau, T.L. (1975). A computer readability formula designed for machine scoring. Journal of Applied Psychology 60(2), 283.CrossRefGoogle Scholar
Crossley, S., Salsbury, T. and McNamara, D. (2009). Measuring L2 lexical growth using hypernymic relationships. Language Learning 59(2), 307334.CrossRefGoogle Scholar
Crossley, S.A., Allen, D.B. and McNamara, D.S. (2011). Text readability and intuitive simplification: A comparison of readability formulas. Reading in a Foreign Language 23(1), 84101.Google Scholar
Das, D. and Smith, N.A. (2009). Paraphrase identification as probabilistic quasi-synchronous recognition. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1. Association for Computational Linguistics, pp. 468476.CrossRefGoogle Scholar
De Clercq, O. and Hoste, V. (2016). All mixed up? finding the optimal feature set for general readability prediction and its application to english and dutch. Computational Linguistics 42(3), 457490.CrossRefGoogle Scholar
Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis, Minnesota: Association for Computational Linguistics, pp. 41714186.Google Scholar
Farkas, D.K. (1985). The concept of consistency in writing and editing. Journal of Technical Writing and Communication 15(4), 353364.CrossRefGoogle Scholar
Feng, L., Jansche, M., Huenerfauth, M. and Elhadad, N. (2010). A comparison of features for automatic readability assessment. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters. Association for Computational Linguistics, pp. 276284.Google Scholar
Fleiss, J.L. and Cohen, J. (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement 33(3), 613619.CrossRefGoogle Scholar
Gionis, A., Indyk, P. and Motwani, R. (1999). Similarity search in high dimensions via hashing. In Proceedings of the 25th International Conference on Very Large Data Bases, VLDB ’99, San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., pp. 518529 Google Scholar
Gong, C., Huang, Y., Cheng, X. and Bai, S. (2008). Detecting near-duplicates in large-scale short text databases. In Advances in Knowledge Discovery and Data Mining. Springer, pp. 877883.CrossRefGoogle Scholar
Graesser, A.C., McNamara, D.S., Louwerse, M.M. and Cai, Z. (2004). Coh-metrix: Analysis of text on cohesion and language. Behavior Research Methods, Instruments, & Computers 36(2), 193202.CrossRefGoogle Scholar
Gunning, R. (1969). The fog index after twenty years. Journal of Business Communication 6(2), 313.CrossRefGoogle Scholar
Höfler, S. (2012). Legislative drafting guidelines: How different are they from controlled language rules for technical writing? In International Workshop on Controlled Natural Language, Berlin, Heidelberg: Springer, pp. 138151.CrossRefGoogle Scholar
Inkpen, D.Z. (2007). Near-synonym choice in an intelligent thesaurus. In HLT-NAACL, Rochester, NY, April 22–27, 2007, pp. 356363.Google Scholar
Irving, R.W. and Fraser, C. (1992). Two algorithms for the longest common subsequence of three (or more) strings. In Proceedings of the Third Annual Symposium on Combinatorial Pattern Matching, CPM ’92, London, UK. Springer-Verlag, pp. 214229.CrossRefGoogle Scholar
Islam, A. and Inkpen, D. (2008). Semantic text similarity using corpus-based word similarity and string similarity. ACM Transactions on Knowledge Discovery from Data 2(2), 10:110:25.CrossRefGoogle Scholar
Islam, A. and Inkpen, D. (2009). Real-word spelling correction using google web 1t n-gram data set. In Proceedings of the 18th ACM Conference on Information and Knowledge Management. ACM, pp. 16891692.Google Scholar
Islam, A. and Inkpen, D. (2010). Near-synonym choice using a 5-gram language model. Research in Computing Sciences, 46, 4152.Google Scholar
Islam, A., Milios, E. and Kešelj, V. (2012). Text similarity using google tri-grams. In Kosseim, L. and Inkpen, D. (eds), Advances in Artificial Intelligence: 25th Canadian Conference on Artificial Intelligence, Canadian AI 2012, Berlin, Heidelberg. Springer, pp. 312317.CrossRefGoogle Scholar
Kešelj, V. and Cercone, N. (2004). CNG method with weighted voting. In Ad-hoc Authorship Attribution Competition. Proceedings 2004 Joint International Conference of the Association for Literary and Linguistic Computing and the Association for Computers and the Humanities (ALLC/ACH 2004).Google Scholar
Kincaid, J.P., Fishburne, Jr. R.P., Rogers, R.L. and Chissom, B.S. (1975). Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Technical report, Naval Technical Training Command Millington TN Research Branch.CrossRefGoogle Scholar
Kövecses, Z. and Radden, G. (1998). Metonymy: Developing a cognitive linguistic view. Cognitive Linguistics (Includes Cognitive Linguistic Bibliography) 9(1), 3778.Google Scholar
LG (2009). LG600G User Guide. Available at: https://www.manualslib.com/manual/92956/Lg-Lg600g.html#product-LG600G (Accessed 15 December 2015).Google Scholar
Manku, G.S., Jain, A. and Das Sarma, A. (2007). Detecting near-duplicates for web crawling. In Proceedings of the 16th International Conference on World Wide Web. Association for Computer Machinery, pp. 141150.CrossRefGoogle Scholar
McNamara, D.S., Graesser, A.C., McCarthy, P.M. and Cai, Z. (2014). Automated Evaluation of Text and Discourse with Coh-Metrix. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
Mei, J., Kou, X., Yao, Z., Rau-Chaplin, A., Islam, A., Moh’d, A. and Milios, E.E. (2015). Efficient Computation of Co-Occurrence Based Word Relatedness. Available at Demo URL: http://ares.research.cs.dal.ca/gtm/ (Accessed 01 December 2015).Google Scholar
Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D. and Miller, K.J. (1990). Introduction to WordNet: An on-line lexical database. International Journal of Lexicography 3(4), 235244.CrossRefGoogle Scholar
Mueller, J. and Thyagarajan, A. (2016). Siamese recurrent architectures for learning sentence similarity. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI ’16. AAAI Press, pp. 27862792.Google Scholar
Neculoiu, P., Versteegh, M. and Rotaru, M. (2016). Learning text similarity with Siamese recurrent networks. In Proceedings of the 1st Workshop on Representation Learning for NLP, Berlin, Germany: Association for Computational Linguistics, pp. 148157.CrossRefGoogle Scholar
Nulty, P. and Costello, F. (2009). Using lexical patterns in the google web 1t corpus to deduce semantic relations between nouns. In Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions, SEW ’09, USA: Association for Computational Linguistics, pp. 5863.CrossRefGoogle Scholar
Samsung (2011). Samsung 010505d5 cell phone user manual. Available at: http://cellphone.manualsonline.com/manuals/mfg/samsung/010505d5.html?p=53 (Accessed 01 December 2015).Google Scholar
Senter, R. and Smith, E.A. (1967). Automated readability index. Wright-Patterson Air Force Base. AMRL-TR-6620, 3.Google Scholar
Socher, R., Huang, E.H., Pennington, J., Ng, A.Y. and Manning, C.D. (2011). Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In Proceedings of the 24th International Conference on Neural Information Processing Systems, NIPS ’11. Red Hook, NY, USA: Curran Associates Inc., pp. 801809.Google Scholar
Soto, A.J., Mohammad, A., Albert, A., Islam, A., Milios, E., Doyle, M., Minghim, R. and Ferreira de Oliveira, M.C. (2015). Similarity-based support for text reuse in technical writing. In Proceedings of the 2015 ACM Symposium on Document Engineering, DocEng ’15, Lausanne, Switzerland. Association for Computer Machinery, pp. 97106.CrossRefGoogle Scholar
Sun, Y., Qin, J. and Wang, W. (2013). Near duplicate text detection using frequency-biased signatures. In Web Information Systems Engineering–WISE 2013, Berlin, Heidelberg: Springer, pp. 277291.CrossRefGoogle Scholar
Vapnik, V. (2013). The Nature of Statistical Learning Theory, 2nd edn. New York: Springer Verlag.Google Scholar
Wang, W., Moh’d, A., Islam, A., Soto, A.J. and Milios, E. (2016). Non-uniform language detection in technical writing. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas: Association for Computational Linguistics, pp. 1892–1900.CrossRefGoogle Scholar
Wang, X., Li, C., Zheng, Z. and Xu, B. (2018). Paraphrase recognition via combination of neural classifier and keywords. In 2018 International Joint Conference on Neural Networks (IJCNN). IEEE, pp. 18.CrossRefGoogle Scholar
Wołkowicz, J. and Kešelj, V. (2013). Evaluation of n-gram-based classification approaches on classical music corpora. In Yust J., Wild J. and Burgoyne J.A. (eds), Mathematics and Computation in Music: 4th International Conference, MCM 2013, vol. 7937, Berlin Heidelberg: Springer, pp. 213225.CrossRefGoogle Scholar
Wu, Z. and Palmer, M. (1994). Verbs Semantics and Lexical Selection. Available at: Demo URL: http://ws4jdemo.appspot.com/?mode=w&s1=&w1=photo&s2=&w2=video (Accessed 01 December 2015).Google Scholar