Towards semantically linked multilingual corpus

https://doi.org/10.1016/j.ijinfomgt.2015.01.004Get rights and content

Highlights

  • We propose a framework for semantically linked multilingual corpus.

  • We provide a solution to construct semantically linked multilingual corpus.

  • Establishing metadata and content based semantic links in different granularities.

  • Analyzing multilingual applications of semantically linked multilingual corpus.

  • Study the cognitive theoretical basis of semantic associations and semantic links.

Abstract

Multilingual information processing gains more and more attention in recent years with the development of information globalization. Multilingual corpus is a key challenge for multilingual information extraction, analysis, management and service in a wide range of systems. This work addresses on the study and analysis of semantic associations among elements in a multilingual corpus. A solution is proposed in this paper to optimize the semantic organization of multilingual corpus by linking the corpus elements into a semantic link network. This enhances the text-basd applications of multilingual corpus such as corpus linguistics study, dictionary search, machine translation and cross-lingual information retrieval.

Introduction

The rapid development of information and communication technologies enables the information globalization. Multilingual information has been coming into our daily life and business. Users are often puzzled to face with multilingual information resources when surfing on the Web because most of them are only familiar with one or two natural languages. It is necessary to find effective ways to bridge the gap caused by different languages, and multilingual information processing has been gaining more and more attention in recent years.

Multilingual information processing includes organization, search, translation, management and analysis, where organization is the primary basis for other services. An intuitive idea for multilingual information organization is to weave them into a network with semantic associations. Semantic association is a concept from the cognitive science. When a concept is mentioned, other concepts occurring in the mind of human being are considered “having semantic association with” the mentioned concept. Semantic associations among resources have great influence on information search, and finding relevant multilingual information resources is the basis of the utilizing of multilingual information. Multilingual information resources and semantic associations among them formulate a complex network, which is the research object of intelligence analysis. So it is important to establish semantic associations among multilingual information resources for management, search and analysis.

Semantic associations exist in information systems ubiquitously, in which links are used to reflect the semantic associations among the multilingual information resources. For example, hyperlinks among the web pages imply the semantic associations among information resources represented by URI in the Web; semantic associations may be relationships in the specific domains such as refer and cite in the scientific literature. Some associations are hard to be represented by specific relationships, such as similar and relevant. Similar is a specific instance of relevant. If two things are similar, they must be also relevant; however, if two things are relevant, they may not be similar.

Multilingual information services are based on the result of multilingual information processing. Machine translation (MT) and cross-language information retrieval (CLIR) are two typical multilingual information services. They have the same ultimate aim – semantically linking multilingual information for users to easily access without language barriers. MT aims to translate text into other foreign natural languages, while CLIR aims to find the relevance between information resources in different natural languages. So users can use CLIR to find relevant information, and then use MT to translate the text for further reading and analysis.

Multilingual corpora are necessary for multilingual information processing systems such as MT and CLIR. Besides, multilingual corpora are important for contrastive linguistics (McEnery & Hardie, 2011). In linguistics, a corpus is a large and structured set of texts. Currently, corpora are electronically stored and processed, and they have been widely used in statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory. Corpora have different categories according to different perspectives:

  • Monolingual corpus and multilingual corpus. A corpus containing texts in a single language is called monolingual corpus, while a corpus containing text data in multiple languages is called multilingual corpus. Multilingual corpora that have been specially formatted for side-by-side comparison are called aligned parallel corpora, which are heavily depended by the Statistical MT systems.

  • Raw corpus and labeled corpus. Raw corpus contains the plain text without any manual annotation, while labeled corpus needs manual annotations on the collected texts. The manual annotation tasks mainly include word segmentation, part-of-speech tagging, syntactic annotation, semantic annotation and pragmatic/discourse annotation.

Web has been regarded as a large-scale multilingual corpus (Liu & Curran, 2006). Massive multilingual information is emerging quickly on the Web every time, social networks (He, Zha, & Li, 2013), cloud computing (Yee, Chia, Tsai, Tiong, & Kanagasabai, 2011) and it is expected to increase with the extension of the Web to the Web of Everything (Jara et al., 2014). However, there are no explicit semantic links or even no explicit hyperlinks among the multilingual information resources. Information resources are massive, but connections between them are sparse and the information resources formulate many information islands. This becomes an embarrassment of efficient access and use of multilingual information. Information resources are urgent to be semantically connected by establishing semantic links among multilingual information resources automatically in order to solve the described problem of information islands. Therefore, linking multilingual documents becomes one of the key challenges for the multilingual information effective sharing and use.

To establish semantic links among information resources, there are two feasible ways. One way is to build semantic links according to their attributes which are often contained in metadata of information resources, and the other way is to define semantic links according to the meaning implied in the textual descriptions. Metadata of information resources could be used to link multilingual documents based on the values of attributes ([Zhang et al., 2010], [Zhuge and Zhang, 2011]). An extensible semantic model is proposed for information management by organizing semantic data with event-linked network where events and links can be extracted from raw data in the Internet of Things (Sun et al., 2014; Sun and Jara, 2014 ). However, many multilingual documents on the Web have no complete metadata. Consequently, text analysis based on natural language processing becomes the alternative way to establishing the semantic links based on the full texts of documents. The processing objects of natural language processing include word, phrase, sentence and text (paragraph, section or document).

Establishing semantic links between multilingual information resources depends on the analysis on words, phrases and sentences in the descriptive text. All the information resources such as documents, images, audios and videos can be all described by text in natural languages. The establishment of attribute-based semantic associations has been discussed in the previous work (Zhang et al., 2013), so we focus on the content-based semantic associations among documents in this paper. To process the content of documents, natural language processing is a necessary technology, especially in establishing semantic links based on text analysis.

In this paper, a semantically linked multilingual corpus in science and technology has been designed for the following aims: (1) providing a multilingual corpus for natural language processing research such as machine translation in science and technology; (2) providing support for applications based on the semantically linked corpus such as multilingual information analysis and cross-lingual information retrieval. The main contributions include: (1) a framework for constructing semantically linked multilingual corpus based on the metadata and content of texts; (2) a solution to construct semantically linked multilingual corpus; and (3) the feasible multilingual applications based on the semantically linked multilingual corpus.

Section snippets

Corpora for multilingual words and phrases

During the construction of multilingual corpus, alignment is necessary for words, sentences and texts in different languages. Word alignment is the basis of phrase alignment, sentence alignment and text alignment. The alignment results of words or phrases formulate multilingual dictionaries.

Manually constructed bilingual dictionaries have been used in dictionary search and machine translation. Although statistic-based method can align the bilingual words based on parallel sentences

Semantic association analysis

Semantic associations among multilingual information resources are distributed in two layers: organization layer and application layer.

Organization layer includes the management of attribute-based semantic associations and content-based semantic associations.

  • Attribute-based semantic associations. Each information resource has its own attributes such as author, creation_time, language, file size and so on. According to the values of attributes, semantic associations such as sameAuthor, before,

Establishing semantic links

Alignment of multilingual words/phrases is the first step to link multilingual documents according to the content of documents. The meaning of a text is represented by the sentences. However, it is hard to find the same sentence in two different texts, so the alignment of texts is directly based on the alignments of words/phrases. After the alignment of multilingual words/phrases, semantic links among multilingual documents can be built, and some specific relations could be further established

Semantic associations among words and phrases

As mentioned above, word is defined as the minimum element to represent a meaning in the natural languages, and phrase contains one or more words to represent a concept. A concept is reflected by a set of words and phrases. Phrase has different alias such as term, keyword and controlled vocabulary such as thesaurus in library and information science. Phrases are often used to index the document for information retrieval. Keywords and thesaurus have been used to label the information resources

Conclusions and future work

Multilingual information processing gains more and more attention in recent years. Multilingual corpus is important for multilingual information analysis, research and service. How to organize multilingual corpus has great influence on the usage of multilingual corpus and the corpus-based applications. In this paper, we study the elements of a multilingual corpus and semantic associations among these elements. We propose an approach to semantically link the elements into a semantic link network

Acknowledgments

This research work is partially supported by International Science & Technology Cooperation Program of China under Grant No. 2014DFA11350; National Natural Science Foundation of China (Grant Nos. 61371185 and 61171014), and ISTIC Research Foundation Projects (Grant Nos. XK2014-6 and ZD2014-3-4). The authors would like to thank the HES-SO and the Institute of Information Systems funding and support. Finally, we would like to thank the European Project “In-Network Programmability for

Junsheng Zhang is an associate professor and the director of Language and Knowledge Technology Lab, Institute of Scientific and Technical Information of China. He received his PhD in Computer Science in 2009 from Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China. His research interests include information and knowledge management, natural language processing, mobile computing and cloud computing.

References (35)

  • W. He et al.

    Social media competitive analysis and text mining: A case study in the pizza industry?

    International Journal of Information Management

    (2013)
  • H. Zhuge et al.

    The schema theory for semantic link network?

    Future Generation Computer Systems

    (2010)
  • J. Aitchison et al.

    Thesaurus construction and use: A practical manual

    (2000)
  • P. Barnaghi et al.

    Semantics for the Internet of Things: Early progress and back to the future?

    International Journal on Semantic Web and Information Systems (IJSWIS)

    (2012)
  • T. Berners-Lee et al.

    A framework for web science?

    Foundations and Trends in Web Science

    (2006)
  • F. Chang et al.

    Bigtable: A distributed storage system for structured data

    ACM Transactions on Computer Systems (TOCS)

    (2008)
  • D. Chiang

    A hierarchical phrase-based model for statistical machine translation

  • D. Chiang

    Hierarchical phrase-based translation?

    Computational linguistics

    (2007)
  • C.J. Crouch et al.

    Experiments in automatic statistical thesaurus construction

  • Z. Dong et al.

    HowNet and the computation of meaning

    (2006)
  • A. Eisele et al.

    Multiun: A multilingual corpus from United Nation documents

  • T. Erjavec et al.

    The multext-east corpus

  • K. Gorman et al.

    The object-oriented entity-relationship model (ooerm)

    Journal of Management Information Systems

    (1990)
  • J. Han et al.

    Survey on nosql database

  • R. Hecht et al.

    Nosql evaluation

  • A.J. Jara et al.

    Semantic Web of things: An analysis of the application semantics for the IoT moving towards the IoT convergence?

    International Journal of Web and Grid Services

    (2014)
  • K.G. Jeffery

    The Internet of Things: The death of a traditional database?

    IETE Technical Review

    (2009)
  • Cited by (15)

    • Evolutionary natural-language coreference resolution for sentiment analysis

      2022, International Journal of Information Management Data Insights
      Citation Excerpt :

      It then performs text preprocessing tasks in order to extract lexical and syntactical information to represent messages in the form of (candidate) graph relations. A Genetic Algorithm (GA) then iteratively searches for and optimizes the best graphs representing coreference chains based on the reference annotated corpus and defined linguistic heuristics (Zhang, Sun, & Jara, 2015). In order to collect a working corpus, the twitter API was used to download tweet threads on a given query based on local news.

    • Evaluation of XML schema support in knowledge management

      2020, Frontiers in Artificial Intelligence and Applications
    View all citing articles on Scopus

    Junsheng Zhang is an associate professor and the director of Language and Knowledge Technology Lab, Institute of Scientific and Technical Information of China. He received his PhD in Computer Science in 2009 from Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China. His research interests include information and knowledge management, natural language processing, mobile computing and cloud computing.

    Yunchuan Sun is an associate professor and the director of Lab for Economics and Business in Beijing Normal University, Beijing, China. He acts as Secretary of IEEE Communication Society Emerging Technical Subcommittee of Internet of Things from January 2013. He also acts as an associate editor of the Springer journal Personal and Ubiquitous Computing. He received his PhD in Computer Science in 2009 from the Institute of Computing Technology, Chinese Academy of Science, Beijing, China. His research interests include Internet of Things, Data Science, Event-Linked Network, Semantic Technology, and Business Model.

    Antonio J. Jara is an Assistant Prof. and PostDoc at University of Applied Sciences Western Switzerland (HES-SO) from Switzerland, vice-chair of the IEEE Communications Society Internet of Things Technical Committee, and founder of the Wearable Computing and Personal Area Networks company HOP Ubiquitous S.L., He did his PhD (Cum Laude) at the Intelligent Systems and Telematics Research Group of the University of Murcia (UMU) from Spain. He received two M.S. (Hons. – valedictorian) degrees. Since 2007, he has been working on several projects related to IPv6, WSNs. and RFID applications in building automation and healthcare. He is especially focused on the design and development of new protocols for security and mobility for Future Internet of things, which were the topic of his Ph.D. Nowadays, he continues working on IPv6 technologies for the Internet of Things in projects such as IoT6, and also Big Data and Knowledge Engineering for Smart Cities and eHealth. He has also carried out a Master in Business Administration (MBA). He has published over 100 international papers, As well, he holds one patent. Finally, he participates in several projects about the IPv6, Internet of Things, Smart Cities, and mobile healthcare.

    View full text