Elsevier

Information Sciences

Volume 380, 20 February 2017, Pages 12-30
Information Sciences

Generating automatic linguistic descriptions with big data

https://doi.org/10.1016/j.ins.2016.11.002Get rights and content

Highlights

  • Dealing with Big Data, we have identified and analyzed seven issues: (1) scalability, (2) efficient processing, (3) incomplete and inaccurate data, (4) specific domains, (5) relevance of information, (6) levels of detail, and (7) intuitive and effective knowledge representation.

  • We have developed a novel linguistic approach to deal with Big Data that fulfils these seven issues.

  • We have provided an implementation of the paradigm “Linguistic Descriptions of Complex Phenomena” by applying MapReduce.

  • We have taken advantage of Fuzzy Logic in order to manage incomplete and inaccurate Big Data.

Abstract

In highly connected world, the volume and variety of data is growing and growing. The Big Data era opens new challenges to address. Dealing with Big Data, we have identified and analyzed seven issues: (1) scalability, (2) efficient processing, (3) incomplete and inaccurate data, (4) specific domains, (5) relevance of information, (6) levels of detail, and (7) intuitive and effective knowledge representation. The analysis reveals that five of these issues are related to knowledge representation and human perception. Linguistic Descriptions of Complex Phenomena is a technology aimed to compute and generate linguistic reports customized to the user needs. In this paper, we present and describe an approach to Big Data based on this technology that faces the seven issues under study. Namely, we generate linguistic reports from Big Data that fulfill with the user requirements. To evaluate the generated linguistic reports we propose specific evaluation criteria based on the maxims of Grice. We illustrate the usefulness of the proposed solution by presenting a practical experiment based on the census data of the United States of America.

Introduction

Big Data poses new challenges [38] to Data Science [18]. Big Data is usually defined by three dimensions or 3Vs: Volume, Velocity and Variety [31]. Volume dimension involves processing large amounts of data. Velocity dimension indicates that data are often generated at high speed and they need to be processed in real or near real time, in batch or as streams. Variety dimension indicates that data come from a different sources and have different formats, thus can be structured, unstructured or semi-structured. In [17] Demchenko introduced two more dimensions, namely, Value and Veracity. Value dimension makes reference to the added-value (or new knowledge) that the collected data can bring to the intended process or activity. Veracity dimension makes reference to the consistency, certainty, authenticity or reputation of the data, including the infrastructures and methods that operate with these data.

One of the main challenges in Data Science is to process Big Data in a meaningful manner [29], [38] by means of the so-called Knowledge Discovery Process (KDP) [13]. A KDP consists of a set of sequential steps to seek new knowledge in particular application domains that include analysis, interpretation and representation of the extracted knowledge. Big Data scientists commonly analyze these data using adapted machine learning and data mining techniques [52]. To visualize the extracted knowledge, they use two complementary options. The most common option is the so-called Advanced Data Visualization that combines data analysis methods with interactive visualization of tables, charts or graphs [47]. Other possible option is the Automatic Text Generation [42] that involves computer programs that automatically produce texts from input data.

The creation of knowledge-oriented systems to extract the meaningful insights in Big Data is not a straightforward task. We have identified the seven issues to deal with Big Data as follows [29], [30], [38]:

  • 1.

    Scalability: Considering Volume dimension, Big Data techniques must rely on scalable algorithms, i.e. algorithms of which performance is independent of the volume of data. The definition of scalable algorithms is a mandatory requirement in parallel processing.

  • 2.

    Efficient processing: Considering Volume and Velocity dimensions, Big Data techniques must process data efficiently. Time is a priority for many applications (e.g., real-time applications). Data scientists need to explore new technologies aimed at handling large amounts of data more efficiently.

  • 3.

    Incomplete and inaccurate data: Considering Veracity dimension, Big Data techniques must manage the usual imprecision and incompleteness in input data. These techniques must also take care of data transformations in order to avoid the loss of information.

  • 4.

    Specific domains: Considering Variety and Value dimensions, Big Data techniques must model specific domains. For data to become information, data need to be put in a context, i.e., data scientists have to process data in specific contexts in order to obtain valuable knowledge.

  • 5.

    Relevance of information: Considering Value dimension, Big Data techniques must remark the relevant and hide the irrelevant information. There is a growing interest to develop interactive tools that facilitate the interpretation of knowledge to users. To achieve this goal, data scientists demand new techniques to select the most relevant facts for each specific user.

  • 6.

    Levels of detail: Considering Value dimension, Big Data techniques must represent information at different levels of granularity. The higher level corresponds to the higher abstraction and the lower level to the most specific information. Data scientists look for new approaches to identify the appropriate level of abstraction to summarize and select the information in accordance with the user’s preferences.

  • 7.

    Intuitive and effective knowledge representation: Considering Value dimensions, Big Data techniques must offer data representation capabilities with the aim of assisting users to interpret intuitively and effectively the derived knowledge. For example, allowing the user to configure the language or type of graphics that are addressed.

To sum up with, the first Big Data issues “Scalability” and “Efficient processing” are more related to computers, i.e., data processing. On the contrary, the other five issues are related to knowledge representation and human perception. Quoting Zadeh “In its traditional sense, computing involves for the most part manipulation of numbers and symbols. By contrast, humans employ mostly words in computing and reasoning, arriving at conclusions expressed as words from premises expressed in a natural language or having the form of mental perceptions” [58]. Consequently, in this paper we focus on exploring a new type of KDP which analyzes, interprets and represents Big Data in a linguistic way.

To the best of our knowledge there are not scientific publications regarding this topic. Nevertheless we have found four companies that offer text generation applied to Big Data. They are Yseop, Arria Data2Text, Automated Insights and Narrative Science.

We have carefully evaluated their solutions paying attention to the information they provide in their websites and related patents (see Section 2). Table 1 summarizes the evaluation of these solutions in accordance with how they fulfill the seven issues mentioned above. The symbol “√” means that the company successfully addresses the related issues, and the symbol “-” means that no information is available. To measure the last issue 7 “Intuitive and effective knowledge representation” the companies were evaluated according to two features: a) text generation based on advanced templates and b) the user language configuration. The table shows that patents keep hidden the techniques they use to process Big Data. The available information (see Table 1) indicates that none of them thoroughly describe how they deal with the following issues: “Scalability”, “Efficient processing” and “Incomplete and inaccurate data”.

The main contribution of this paper is to describe a linguistic approach to Big Data that fulfills all the seven issues under study in the Table 1. For this purpose, we have adapted our previously developed paradigm, so called Linguistic Descriptions of Complex Phenomena (LDCP) [50] to deal with Big Data by applying the KDP model and the MapReduce paradigm [15].

In this paper, we applied fuzzy techniques to deal with the issue of incompleteness and inaccurateness of data. Fuzzy Logic [55] allows us to express the uncertainty caused by the lack of precision and completeness of data. Other authors have already taken advantage of applying fuzzy techniques to Big Data. For example, Sara del Rio et al. [46] presented a new algorithm to deal with Big Data classification problems. The proposed algorithm is a linguistic fuzzy rule-based classification system that uses the MapReduce framework to learn and fuse fuzzy rule bases. Although we also use the MapReduce framework to deal with Big Data, our proposal is completely different. Here, we propose an architecture that uses the MapReduce framework to interpret Big Data and generate linguistic descriptions of complex phenomena.

It is important to consider that this paper does not deal with Natural Language understanding but only with text generation. Note that we do not use as input data linguistic expressions but we use either numerical values obtained from sensors or numbers obtained from databases.

We illustrate the usefulness of the proposed approach with a practical experiment. The goal in this experiment is the automatic generation of linguistic reports from the USA census related to years 2000 and 2010.

The rest of this paper is structured as follows. Section 2 provides the analysis of solutions and products on textual reports generation. Section 3 introduces the LDCP technology. Section 4 provides an overview of how to generate LDCP from Big Data. In Sections 5 and 6 we explain the details of our approach. Section 7 presents the experimentation. Section 8 summarizes conclusions and sketches future work.

Section snippets

Analysis of solutions and products on textual reports generation

In literature, we have found two main ways to generate reports in natural language from datasets, namely: Natural Language Generation and Linguistic Descriptions of Data. A thorough review of these two fields is presented in [40] by Ramos-Soto et al. Here, we focus on the most significant solutions and products for text generation in the context of Big Data.

Linguistic descriptions of complex phenomena

The architecture to generate LDCP (see Fig. 1) has three sequential steps: Data acquisition, Interpretation and Report generation. In a preliminary stage designers collect a corpus of natural language expressions that are typically used in the application domain to describe the relevant features of the analyzed phenomena. Then, they analyze the particular meaning of each linguistic expression in each specific situation and the user profiles to define the GLMP and the Report Template. At

Linguistic approach to big data

Fig. 3 shows the typical KDP model [38]. It comprises five stages:

  • Data Recording: This stage deals with storing and accessing the input data. It is in charge of obtaining the relevant data to analyze.

  • Data Cleaning/Integration/Representation: This stage uses pre-processing techniques to remove noise and inconsistent data, combines multiple sources and transforms data into an appropriate format for analysis.

  • Data Analysis: This stage uses efficient and scalable algorithms to search patterns or

GLMP Applied to big data

The technique for applying the GLMP to Big Data consists of executing in parallel way, using the MapReduce paradigm, the PMs of the GLMP. Indeed, the PM aggregation function g is closely related to the concept of reduce function. In order to implement a GLMP using the MapReduce paradigm: the map function must prepare the PM input U and the reduce function must implement the PM aggregation function g that processes the input U and generates the output y (see Section 3).

With the aim of defining

Report generation applied to big data

In Section 3.4 we remarked that a good report should include only the most relevant information with the appropriate level of detail to each specific user. Here we use the MapReduce paradigm to select and generate the final report that better fulfills these features, among the big number of available candidates.

Experimental analysis

We have validated our proposal with an experiment dealing with generating relevant linguistic reports which include specific aspects of the US census, requested through a user query. This is just one illustrative example but many other interesting ones can be found.

The Census Bureau of US8 publishes handmade summaries as well as their data sources obtained from surveys. Census data include items about population like sex, age, race, household relationship, household type

Conclusions and future research

This paper is devoted to an important area of Big Data analysis generating human readable reports. In particular, it allows meaningful communication between data scientists and subject domain specialists.

We have proposed an architecture able to generate linguistic descriptions of Big Data. It fulfills a list of seven relevant issues. The architecture uses scalable techniques to solve the steps Data Analysis and Data Visualization/Interpretation. It allows to build efficient computational models

Acknowledgments

This work was supported in part by the Spanish Ministry of Science and Innovation under Grant FPI-MICINN BES-2012-057427, and the Spanish Ministry of Economy and Competitiveness under Grants TIN2014-56633-C3-3-R and TIN2014-56633-C3-1-R and TIN-2014-56967-R.

References (60)

  • D. Sanchez-Valdes et al.

    Dynamic linguistic descriptions of time series applied to self-track the physical activity

    Fuzzy Sets Syst.

    (2016)
  • G. Trivino et al.

    Towards linguistic descriptions of phenomena

    Int. J. Approx. Reason.

    (2013)
  • R. Turner et al.

    Generating spatio-temporal descriptions in pollen forecasts

    Eleventh Conference of the European Chapter of the Association for Computational Linguistics: Posters and Demonstrations

    (2006)
  • R.R. Yager

    Fuzzy summaries in database mining

    Proc. 11th Conf. on Artificial Intelligent for Applications

    (1995)
  • L.A. Zadeh

    The concept of a linguistic variable and its application to approximate reasoning-i

    Inf. Sci.

    (1975)
  • L.A. Zadeh

    A computational approach to fuzzy quantifiers in natural languages

    Comput. Math. Appl.

    (1983)
  • L.A. Zadeh

    From computing with numbers to computing with words. From manipulation of measurements to manipulation of perceptions

    IEEE Trans. Circ. Syst. I

    (1999)
  • L.A. Zadeh

    Toward human level machine intelligence - is it achievable? The need for a paradigm shift

    IEEE Comput. Intell. Mag.

    (2008)
  • N.D. Allen et al.

    StatsMonkey: a data-driven sports narrative writer

    AAAI Fall Symposium: Computational Models of Narrative

    (2010)
  • R.C. Allen

    Systems for Dynamically Generating and Presenting Narrative Content

    (2013)
  • A. Bargiela et al.

    Granular Computing: An Introduction

    (2003)
  • T. Bethem et al.

    Generation of real-time narrative summaries for real-time water levels and meteorological observations in PORTS®

    Fourth Conference on Artificial Intelligence Applications to Environmental Science

    (2005)
  • L.A. Birnbaum et al.

    System and Method for Using Data to Automatically Generate a Narrative Story

    (2014)
  • R. Brachman et al.

    Knowledge Representation and Reasoning

    (2004)
  • D.E. Caldwell et al.

    Bilingual generation of job descriptions from quasi-conceptual forms

    Proc. 4th Conf. on Applied Natural Language Processing

    (1994)
  • T. Calvo et al.

    Aggregation Operators: New Trends and Applications

    (2012)
  • R. Castillo-Ortega et al.

    Time series comparison using linguistic fuzzy techniques

    Proc. 13th Int. Conf. on Information Processing and Management Uncertainty

    (2010)
  • R. Castillo-Ortega et al.

    A fuzzy approach to the linguistic summarization of time series

    J. Multiple-Valued Logic Soft Comput.

    (2011)
  • R. Castillo-Ortega et al.

    Linguistic query answering on data cubes with time dimension

    Int. J. Intell. Syst.

    (2011)
  • K.J. Cios et al.

    Data Mining: A Knowledge Discovery Approach

    (2007)
  • Cited by (0)

    View full text