1 Introduction

A Programming Language (PL) is a formal constructed language used to create a program, a list of instructions, to perform a task.

Although a PL specifies a notation (Aaby 1996) to write programs, these are often written with a combination of mathematical and everyday language characters, words and phrases.

According to World Language Statistics (SIL International 2015), English is the 3rd most spoken language in the world, with 5.43 % of speakers, behind ChineseFootnote 1 and Spanish with 14.4 % and 6.15 %, respectively. Nonetheless, a survey of the most used PLs’ (TIOBE Software BV 2015) Syntax, Semantics, Standard Library and Runtime System indicates that the most popular are all English-based.

Although Non-English-based PLs exist (Wikipedia 2015), currently the most used have syntax, learning resources, Runtime, and Development Environments that are developed with an English-speaking audience in mind.

Hypothetically, in a universe of more than 7 Billion people, to make usage of the speed and computational capacity of machines to solve problems, approximately 94 % of the people would have to be able to express their instructions to the computer in English, even though not speaking it as a native language.

Software Engineering is a fast changing and evolving field. Thus, it is a challenge to translate and distribute the learning material in languages other than English, keeping pace with the technology development. This fact often categorizes a non-native English speaker student of Software Engineering as an English Language Learner (ELL) since the learning process makes usage of material and tools that are in English, regardless of whether the medium of instruction is English or not.

The discrepancy between the English Language not being the most spoken Natural-Language but being the most widely used in the most popular PLs, inability of ELLs to use their native languages and the constraint of being taught in one language while practicing the concepts (programming) in a different language altogether create a Knowledge barrier, or Knowledge Divide, to ELLs in Software Engineering.

To keep pace with innovations and generate ideas, people need to be able to produce and manage knowledge. However, the increase in the 21st century of access to information has resulted in an uneven overall ability to assimilate it.

Knowledge Divide is a term that denotes the differences between those who have access to knowledge and can assimilate it, participating in knowledge-sharing and using it as a tool for development, and others who are impaired in this process (Bindé and Matsuura 2005).

A Knowledge Divide in Software Engineering is constituted by the differences in Software Engineering-related knowledge assimilation capabilities between native English-speakers and ELLs, due to the English-language barrier.

ELLs need to develop English language and literacy skills in the context of the subjects being taught to keep up with English-speaking students (Lee 2005). However, the linguistic knowledge that students already possess is often not taken into consideration (Janzen 2008).

By allowing students to employ their existing language skills, the Knowledge Divide can be decreased.

Thus, this paper proposes a methodology to bridge the aforementioned Knowledge Divide.

2 Data Collection

During the month of April of 2015, a Survey was conducted to 78 students of the University College of Engineering, Osmania University. The students were from different streams of Engineering but had common introductory Programming courses in C and C++.

The sample was split into two groups, of 34 and 44 students to have a representative sample, and the questions presented to the students intended to study the following factors:

  • Perceived importance of comments in source code

  • Perceived importance and difficulty in understanding source code written in a native language.

3 Results

Students found a program without comments easier to understand but when presented a choice the version with comments was more favorable. When asked about the importance attributed to comments the majority (54 %) of the students was neutral. This inconsistency might suggest that comments are under-used, although of considerable importance in reading and understanding the source code (Fig. 1).

Fig. 1.
figure 1

Perceived difficulty in understanding a program’s source code

Regarding the difficulty and importance of the usage of Native languages in source codes, a similar scenario could be verified.

58 % of the students found understanding a program written in their mother tongue difficult. Although a small portion would prefer reading a program written in their native language, when asked about its importance 45 % were neutral (Fig. 2).

Fig. 2.
figure 2

Perceived importance of a native-language in understanding source code

It is this considerably undecided portion of the sample that led to the following questions being raised regarding students’ perceptions:

  • Are students aware of the resources available to them?

  • Are the resources being presented and contextualized to suit the students’ learning process?

  • What determines the outcome of the learning process in Software Development: students’ usage of their existing resources or their ability to adapt to the already established required resources?

Having this questions in mind, we decided to venture in the construction of a learning and practice model that would highlight the importance of using students’ existing resources.

4 Multilingual vs. NLN Programming Languages

In one hand, Multilingual PLs, also called International PLs, allow the usage of more than one Natural Language for writing programs. Such are the cases of ALGOL 68 (van Wijngaarden et al. 1969) and BabylScript (Iu 2011).

ALGOL 68 is the 1968 version of the Algorithmic Language. It is an imperative PL, which succeeds ALGOL 60, and provides translations of its Standard in Russian, German, French, Bulgarian, Japanese and Chinese. The translations allow the internationalization of the PL.

BabylScript is an open-source, multilingual scripting language that compiles to JavaScript. It is implemented using the Java PL, by modifying the open-source Mozilla Rhino JavaScript engine. BabylScript has different language modes in which keywords, objects, and functions names are translated into non-English languages. With this feature, it allows programmers to write programs in languages other than English. BabylScript also allows a mixed language model, on which the same source code can contain code written in more than one language.

At the time of writing, BabylScript has 17 language translations including Chinese, Hindi, Swahili, Spanish and Russian.

Although Multilingual PLs reduce the initial language barrier, they pose a threat to their development and adoption for being Natural-language isolated. A larger audience can be engaged, but ultimately only speakers of the same language can collaborate.

So far, the approaches used for the creation of Multilingual PLs have not been standardized and a single approach to enable the feature to different PLs, existing as well as newly created, has not been identified.

On the other hand, NLN is an approach that intends to provide tools and methodologies to allow speakers of different Natural-Languages to learn, practice and collaborate in an environment that is Natural-Language-agnostic.

By allowing learners with different Native languages to interact in a unified platform, the Single Natural Language (English) knowledge requirement can be reduced.

The English-language is still required in Software Engineering. However, re-establishing a balance between its usage as a Lingua franca and native languages is desirable, recognizing the existing linguistic diversity (Bindé and Matsuura 2005).

NLN can be integrated into an already existing or newly created PL, taking advantage of the most used English-based ones (Fig. 3).

Fig. 3.
figure 3

Natural-Language Neutrality model

At the core of the NLN approach is a Natural-Language Translation mechanism. The required translation is only of the PL’s keywords, not of the complete source code.

Hence, we came up with a Translation mechanism that can be further exploited.

5 Tools

The proposed tools contemplate the source code keywords, comments and Collaboration between programmers.

Although each tool is designed to iterate over the elements of Bloom’s Taxonomy Cognitive Dimension (Bloom et al. 1956), they mainly intend to stimulate the Affective Dimension elements in students, through the inclusion of their existing linguistic knowledge in the problem-solving process.

5.1 Glotter: A Compiler-Level Natural-Language Neutrality Enabler

A Glotter is a Lexical Analysis tool that converts the source code Lexical Units (tokens) from a Source to a Target Natural-Language.

A Source Natural Language can be any existing Natural Language while the Target Natural-Language is a predefined Bridge Natural Language, a Lingua Franca, which will enable all other Source Natural Languages to be translated to and from it.

The name is derived from the Latin word glot, which means Language, and the English word enabler. Therefore, a Glotter is a Language Enabler.

The Glotter receives a list of tokens, a list of Language Dictionaries and a selected Natural-Language, which serves as the context for the translation.

Its integration to a compiler enables the possibility of different (translated) versions of the same keywords being compiled into a single version. This process ultimately serves the purpose of enabling a single PL to be used with various Natural Languages, while maintaining all the syntactic and semantic structure and rules.

Upon processing, if the keyword is present at the selected Language Dictionary its value is substituted by the matching value. Otherwise, it is left intact. Although it is possible to implement an error reporting functionality upon detection of a non-existent keyword version in the selected Language, this feature might exceed the responsibilities of a Lexical Analyzer. Furthermore, this error can be reported by the Syntax Analyzer.

Implementation Methods

Embedded.

By integrating the Glotter to the compiler, each token can be translated at a time. This approach is more flexible and does not add a performance impact on the normal working of the compiler. An Embedded Glotter requires modification to the compiler source code for an existing PL, what poses a disadvantage in case a seamless integration is expected (Fig. 4).

Fig. 4.
figure 4

Embedded Glotter Implementation

Standalone.

In this alternative method, the Glotter is separated from the Compiler. The complete source code (input) is parsed by the Glotter, in a process that involves a Lexical Analysis (Tokenization) of the given code. Therefore, the source code is Tokenized twice. This process requires no modification of an existing compiler’s source code, a fact that constitutes an advantage to enabling NLN in already existing PLs (Fig. 5).

Fig. 5.
figure 5

Standalone Glotter implementation

Note:

It is assumed that the List of Lexical Units comprises of a list of objects with at least type and value properties and upon not finding an entry or entry-value in a dictionary null is returned.

5.2 Glotation: Natural-Language Annotated Comments

A Glotation is a special kind of comment that includes a source Natural Language attribute and the comment message.

The name is derived from the Latin word Glot, which means Language, and the English word Annotation, metadata attached to text (in this case, attached to source code).

The source language attribute can later be used to translate the comment message to a different Natural Language.

Syntax.

  • @xx message

  • Where:

  • @ is a Symbol that denotes a Glotation, xx is a two-letter lowercase ISO 639-1 Language codeFootnote 2 and message is the Comment message or text.

Example:

  • @en This is a Glotation

  • @pt Esta é uma Glotação

  • @fr Ceci est un Glotation

The example above creates Glotations in English, Portuguese and French with the equivalents of “This is a Glotation”. Each time a user will access the source code, an option to translate the Glotations, Glotate, can allow the translations to occur, provided the user specifies to Environment (target) Language. Therefore, although the comments can be written in different languages, a user can choose to visualize all comments in his/hers context-Natural-Language.

The @ symbol is desirable since its usage is not common among the most used PLs. Therefore, it is possible to avoid confusion between a general comment and a Glotation.

A Glotation translation can be achieved using a Third Party translation service, which might require an internet connection.

To implement Glotations, the rules of the Syntax Analyzer (Parser) should be modified. The rules should detect a Glotation by the symbol @ and build an Abstract SyntaxFootnote 3 node with the following properties:

  • type: “Glotation”

  • language: two-letter country code (content immediately following the @ symbol)

  • value: the message text (separated from the country code by a whitespace).

Therefore, the rules for a well-formed Glotation can be deduced as:

  1. 1.

    Starts with the @ symbol

  2. 2.

    Has no space between the symbol and the following text

  3. 3.

    The text immediately following the @ symbol consists of a two-character string

  4. 4.

    Immediately following the two character string, there is a whitespace

  5. 5.

    After the whitespace follows the comment message with alphanumeric and special characters, including whitespace.

A message should only be translated if the Glotation language is different from the Language currently being used in the Development Environment by the user. Therefore, there should be a mechanism to obtain the Development Environment language.

5.3 Natural-Language Neutrality Collaborative Model for Programming Languages

Making usage of the Glotter and Glotations, a collaborative model can be implemented to allow dissimilar Natural-Languages to be used in a programming environment. Such model should employ a mechanism to allow a user to write a program with keywords and comments in his/hers Natural-Language granted that this same program can be understood by a user with a different Natural-Language.

Translation of keywords and comments can be achieved by the Glotter and Glotations, respectively, but the key factor lies in the data format being used when storing and exchanging the program among the users (Fig. 6).

Fig. 6.
figure 6

NLN collaborative environment workflow

Upon creation, the source code to be exchanged should desirably possess only Glotations, instead of only comments or a mix. Such process can be automated on the Source code editor by automatically replacing general comments with Glotations, granted that the user has already permitted the functionality and chosen the environment Natural-Language. Similarly, the source code should always be stored with the keywords in the target Natural-language.

Such source code file, with Glotations and keywords in the target Natural-Language, will serve as the intermediate file format, the essence of the collaborative model.

When a different user receives this same source code, the process of contextualization can be performed by applying the Glotter and Glotation functionalities, joint or separately.

Therefore, the inverse process can take place by the second author editing the source code file, storing it in the intermediate file format and sending it back to the first author.

6 Conclusion

Language plays a critical role in a student’s effective education. This process also depends on the teaching institutions taking into consideration the sociocultural aspects of the learners, such as their identity and experiences (Janzen 2008; Lee 2005).

Making the current trends and developments in the Software Engineering field available should be accompanied by processes, tools, and resources that will enable or, at least, ease the ability to assimilate this knowledge to the underprivileged. This increase in literacy would benefit not only the disadvantaged but the society as a whole since more people would be brought to an acceptable level of literacy and employability, becoming active contributors in combating poverty.

Although Multilingual PLs exist, a standardized and methodological approach is required to explore the context of Bridging the Knowledge Divide in Software Engineering thoroughly. Such can be accomplished through the proposed NLN approach.

Further research should be undertaken to understand its underlying factors, provide quantitative as well as qualitative indicators of its effectiveness and to incorporate new tools and methodologies to support it.