Generating automatic linguistic descriptions with big data
Introduction
Big Data poses new challenges [38] to Data Science [18]. Big Data is usually defined by three dimensions or 3Vs: Volume, Velocity and Variety [31]. Volume dimension involves processing large amounts of data. Velocity dimension indicates that data are often generated at high speed and they need to be processed in real or near real time, in batch or as streams. Variety dimension indicates that data come from a different sources and have different formats, thus can be structured, unstructured or semi-structured. In [17] Demchenko introduced two more dimensions, namely, Value and Veracity. Value dimension makes reference to the added-value (or new knowledge) that the collected data can bring to the intended process or activity. Veracity dimension makes reference to the consistency, certainty, authenticity or reputation of the data, including the infrastructures and methods that operate with these data.
One of the main challenges in Data Science is to process Big Data in a meaningful manner [29], [38] by means of the so-called Knowledge Discovery Process (KDP) [13]. A KDP consists of a set of sequential steps to seek new knowledge in particular application domains that include analysis, interpretation and representation of the extracted knowledge. Big Data scientists commonly analyze these data using adapted machine learning and data mining techniques [52]. To visualize the extracted knowledge, they use two complementary options. The most common option is the so-called Advanced Data Visualization that combines data analysis methods with interactive visualization of tables, charts or graphs [47]. Other possible option is the Automatic Text Generation [42] that involves computer programs that automatically produce texts from input data.
The creation of knowledge-oriented systems to extract the meaningful insights in Big Data is not a straightforward task. We have identified the seven issues to deal with Big Data as follows [29], [30], [38]:
- 1.
Scalability: Considering Volume dimension, Big Data techniques must rely on scalable algorithms, i.e. algorithms of which performance is independent of the volume of data. The definition of scalable algorithms is a mandatory requirement in parallel processing.
- 2.
Efficient processing: Considering Volume and Velocity dimensions, Big Data techniques must process data efficiently. Time is a priority for many applications (e.g., real-time applications). Data scientists need to explore new technologies aimed at handling large amounts of data more efficiently.
- 3.
Incomplete and inaccurate data: Considering Veracity dimension, Big Data techniques must manage the usual imprecision and incompleteness in input data. These techniques must also take care of data transformations in order to avoid the loss of information.
- 4.
Specific domains: Considering Variety and Value dimensions, Big Data techniques must model specific domains. For data to become information, data need to be put in a context, i.e., data scientists have to process data in specific contexts in order to obtain valuable knowledge.
- 5.
Relevance of information: Considering Value dimension, Big Data techniques must remark the relevant and hide the irrelevant information. There is a growing interest to develop interactive tools that facilitate the interpretation of knowledge to users. To achieve this goal, data scientists demand new techniques to select the most relevant facts for each specific user.
- 6.
Levels of detail: Considering Value dimension, Big Data techniques must represent information at different levels of granularity. The higher level corresponds to the higher abstraction and the lower level to the most specific information. Data scientists look for new approaches to identify the appropriate level of abstraction to summarize and select the information in accordance with the user’s preferences.
- 7.
Intuitive and effective knowledge representation: Considering Value dimensions, Big Data techniques must offer data representation capabilities with the aim of assisting users to interpret intuitively and effectively the derived knowledge. For example, allowing the user to configure the language or type of graphics that are addressed.
To sum up with, the first Big Data issues “Scalability” and “Efficient processing” are more related to computers, i.e., data processing. On the contrary, the other five issues are related to knowledge representation and human perception. Quoting Zadeh “In its traditional sense, computing involves for the most part manipulation of numbers and symbols. By contrast, humans employ mostly words in computing and reasoning, arriving at conclusions expressed as words from premises expressed in a natural language or having the form of mental perceptions” [58]. Consequently, in this paper we focus on exploring a new type of KDP which analyzes, interprets and represents Big Data in a linguistic way.
To the best of our knowledge there are not scientific publications regarding this topic. Nevertheless we have found four companies that offer text generation applied to Big Data. They are Yseop, Arria Data2Text, Automated Insights and Narrative Science.
We have carefully evaluated their solutions paying attention to the information they provide in their websites and related patents (see Section 2). Table 1 summarizes the evaluation of these solutions in accordance with how they fulfill the seven issues mentioned above. The symbol “√” means that the company successfully addresses the related issues, and the symbol “-” means that no information is available. To measure the last issue 7 “Intuitive and effective knowledge representation” the companies were evaluated according to two features: a) text generation based on advanced templates and b) the user language configuration. The table shows that patents keep hidden the techniques they use to process Big Data. The available information (see Table 1) indicates that none of them thoroughly describe how they deal with the following issues: “Scalability”, “Efficient processing” and “Incomplete and inaccurate data”.
The main contribution of this paper is to describe a linguistic approach to Big Data that fulfills all the seven issues under study in the Table 1. For this purpose, we have adapted our previously developed paradigm, so called Linguistic Descriptions of Complex Phenomena (LDCP) [50] to deal with Big Data by applying the KDP model and the MapReduce paradigm [15].
In this paper, we applied fuzzy techniques to deal with the issue of incompleteness and inaccurateness of data. Fuzzy Logic [55] allows us to express the uncertainty caused by the lack of precision and completeness of data. Other authors have already taken advantage of applying fuzzy techniques to Big Data. For example, Sara del Rio et al. [46] presented a new algorithm to deal with Big Data classification problems. The proposed algorithm is a linguistic fuzzy rule-based classification system that uses the MapReduce framework to learn and fuse fuzzy rule bases. Although we also use the MapReduce framework to deal with Big Data, our proposal is completely different. Here, we propose an architecture that uses the MapReduce framework to interpret Big Data and generate linguistic descriptions of complex phenomena.
It is important to consider that this paper does not deal with Natural Language understanding but only with text generation. Note that we do not use as input data linguistic expressions but we use either numerical values obtained from sensors or numbers obtained from databases.
We illustrate the usefulness of the proposed approach with a practical experiment. The goal in this experiment is the automatic generation of linguistic reports from the USA census related to years 2000 and 2010.
The rest of this paper is structured as follows. Section 2 provides the analysis of solutions and products on textual reports generation. Section 3 introduces the LDCP technology. Section 4 provides an overview of how to generate LDCP from Big Data. In Sections 5 and 6 we explain the details of our approach. Section 7 presents the experimentation. Section 8 summarizes conclusions and sketches future work.
Section snippets
Analysis of solutions and products on textual reports generation
In literature, we have found two main ways to generate reports in natural language from datasets, namely: Natural Language Generation and Linguistic Descriptions of Data. A thorough review of these two fields is presented in [40] by Ramos-Soto et al. Here, we focus on the most significant solutions and products for text generation in the context of Big Data.
Linguistic descriptions of complex phenomena
The architecture to generate LDCP (see Fig. 1) has three sequential steps: Data acquisition, Interpretation and Report generation. In a preliminary stage designers collect a corpus of natural language expressions that are typically used in the application domain to describe the relevant features of the analyzed phenomena. Then, they analyze the particular meaning of each linguistic expression in each specific situation and the user profiles to define the GLMP and the Report Template. At
Linguistic approach to big data
Fig. 3 shows the typical KDP model [38]. It comprises five stages:
Data Recording: This stage deals with storing and accessing the input data. It is in charge of obtaining the relevant data to analyze.
Data Cleaning/Integration/Representation: This stage uses pre-processing techniques to remove noise and inconsistent data, combines multiple sources and transforms data into an appropriate format for analysis.
Data Analysis: This stage uses efficient and scalable algorithms to search patterns or
GLMP Applied to big data
The technique for applying the GLMP to Big Data consists of executing in parallel way, using the MapReduce paradigm, the PMs of the GLMP. Indeed, the PM aggregation function g is closely related to the concept of reduce function. In order to implement a GLMP using the MapReduce paradigm: the map function must prepare the PM input U and the reduce function must implement the PM aggregation function g that processes the input U and generates the output y (see Section 3).
With the aim of defining
Report generation applied to big data
In Section 3.4 we remarked that a good report should include only the most relevant information with the appropriate level of detail to each specific user. Here we use the MapReduce paradigm to select and generate the final report that better fulfills these features, among the big number of available candidates.
Experimental analysis
We have validated our proposal with an experiment dealing with generating relevant linguistic reports which include specific aspects of the US census, requested through a user query. This is just one illustrative example but many other interesting ones can be found.
The Census Bureau of US8 publishes handmade summaries as well as their data sources obtained from surveys. Census data include items about population like sex, age, race, household relationship, household type
Conclusions and future research
This paper is devoted to an important area of Big Data analysis generating human readable reports. In particular, it allows meaningful communication between data scientists and subject domain specialists.
We have proposed an architecture able to generate linguistic descriptions of Big Data. It fulfills a list of seven relevant issues. The architecture uses scalable techniques to solve the steps Data Analysis and Data Visualization/Interpretation. It allows to build efficient computational models
Acknowledgments
This work was supported in part by the Spanish Ministry of Science and Innovation under Grant FPI-MICINN BES-2012-057427, and the Spanish Ministry of Economy and Competitiveness under Grants TIN2014-56633-C3-3-R and TIN2014-56633-C3-1-R and TIN-2014-56967-R.
References (60)
- et al.
Linguistic description of the human gait quality
Eng. Appl. Artif. Intell.
(2013) - et al.
Computational interpretations of the Gricean maxims in the generation of referring expressions
Cogn. Sci.
(1995) - et al.
Fuzzy cardinality based evaluation of quantified sentences
Int. J. Approx. Reason.
(2000) - et al.
Automatic linguistic reporting in driving simulation environments
Appl. Soft. Comput.
(2013) - et al.
Linguistic summarization of time series using a fuzzy quantifier driven aggregation
Fuzzy Sets Syst.
(2008) - et al.
Combining semantic web technologies and computational theory of perceptions for text generation in financial analysis
Proc. 19th IEEE Int. Conf. on Fuzzy Systems
(2010) - et al.
Automatic generation of textual summaries from neonatal intensive care data
Artif. Intell.
(2009) - et al.
On the role of linguistic descriptions of data in the building of natural language generation systems
Fuzzy Sets Syst.
(2016) - et al.
Linguistic descriptions for automatic generation of textual short-term weather forecasts on real prediction data
IEEE Trans. Fuzzy Syst.
(2015) Method and Apparatus for Situational Analysis Text Generation
(2014)