Keywords

1 Introduction

Big data or huge amounts of data, as the name implies, refers to a large amount of information. Big data is called big data when it is huge enough that the database system can not store, calculate, process and analyze information that can be interpreted within a reasonable time. There is a wealth of information in these huge amounts of information, such as relevance, unexplained patterns, market trends, may be buried with unprecedented knowledge and application waiting for us to discover. However, the traditional methods do not work because the data volume is too large and the flow rate is too fast. Therefore, promoting us to continue to develop a new generation of data storage devices and technology, hope we can extract from the big data that valuable information. Deep learning is the key to unlocking the big data era [1].

Deep learning is a sub-area of machine learning that simulates the human brain to analyze, interpret and manipulate text, images [2], and voice [3] messages. It builds a learning model with multiple hidden layers to train massive data through supervised/unsupervised training. After training, it can automatically learn features and improve classification or prediction accuracy. The advantages of the deep learning model are the next two sides. The first is the ability to extract features. The resulting feature data is more representative of the original data, which will greatly facilitate the classification and visualization issues. The second is the combinatorial ability of features. Because deep learning has a more complicated network structure and has many non-linear parts, its feature combination ability is very strong.

At the same time, the evolution of big data has spawned the upgrade of hardware and software systems. The distributed architecture makes the performance of the algorithm no longer have a bottleneck, parallel framework and training methods to speed up the deep learning more efficient. Deep learning has transformed the mindset of problem solving and is the best partner for big data [4]. This study describes three typical deep learning models and their applications.

2 The Main Model of Deep Learning

Although the concept of deep learning [5] has been short-lived since its introduction, the algorithm model has rich research achievements and has also performed well in an era of rapid data growth. Deep learning algorithms often involve large-scale hidden neurons and millions of parameters that can process vast amounts of data and handle complex models. This section presents three depth models: Multilayer Perceptron, Convolutional Neural Networks, and Recurrent Neural Networks, as well as algorithmic and model improvements when working with large-scale data.

2.1 Multilayer Perceptron

Multilayer perceptron [6, 7] is a feedforward artificial neural network model, which contains multiple neurons arranged in multiple layers. Adjacent layers have nodes or edges and connections are provided with random weights. Usually they are the random numbers which between [−0.5, 0.5]. The principle is to map multiple input data sets to a single output data set. In statistical analysis, pattern recognition, optical symbol recognition, multi-layer perceptron is a powerful tool (Fig. 1).

Fig. 1.
figure 1

Multilayer perception machine model

At every layer of neural network output, the trigger value of each neuron is calculated. The trigger value is the sum of the product of the value of all the neurons connected to the previous level of this neuron and the corresponding weights. The activation function is used to normalize the output of each neuron. For a K-layer multi-layer perceptron machine, the matrix is expressed as follows:

$$ y(x)\, = \,f_{K} ( \ldots f_{2} (w_{2}^{T} f_{1} (w_{1}^{T} x\, + \,b1)\, + \,b2) \ldots \, + \,b_{K\, - \,1} ) $$
(1)

The backtracking algorithm uses delta rules to compute a local gradient drop from the output neuron back to each neuron in the input layer. First get the error of each output neuron

$$ e_{j} (n)\, = \,d_{j} (n)\, - \,o_{j} (n) $$
(2)

The \( j \)-th neuron for the output layer \( l \) is then calculated

$$ delta_{j}^{(L)} (n)\, = \,e_{j}^{(L)} (n)\,*\,f^{{\prime }} (u_{j}^{(L)} (n)) $$
(3)

Finally calculate there \( j \)-th neuron for hidden layer

$$ delta_{j}^{(l)} (n)\, = \,f^{{\prime }} (u_{j}^{(l)} (n))\sum\nolimits_{k} {(delta_{k}^{(l + 1)} (n)\,*\,w_{kj}^{l + 1} (n))} $$
(4)

After getting the delta for all the neurons, adjust the weight according to the following formula:

$$ w_{ij}^{l} (n\, + \,1)\, = \,w_{ij}^{(l)} (n)\, + \,\alpha \,*\,[w_{ij}^{(l)} (n)\, - \,w_{ij}^{(l)} (n\, - \,1)]\, + \,\eta \,*\,delta_{j}^{(l)} (n)y_{i}^{(l - 1)} (n) $$
(5)

For layer \( l \), the new weight is the current weight plus a potential coefficient α and the learning coefficient η multiplied by the layer \( l \)’s delta and the output of the neurons of the previous layer \( l\, - \,1 \). The implementation of the delta rule is usually also an approximation of the gradient of the error sum. If there is a sufficiently small learning rate, the delta rule will find a set of weight-minimization error equations.

The research [8] shows that the multi-layer perceptron model neural network has the ability of parallel processing and self-learning. When the neural network has more than two hidden neural nodes, it can approximate the nonlinear function with arbitrary precision. Zen et al. [9] proposed a speech synthesis model based on multi-layer perceptrons using the algorithm in literature [10]. The model uses the input feature sequence to represent the input text. Each frame of the input feature sequence is mapped to the respective output features through multiple layers of perceptrons to generate speech parameters. Finally, the speech is synthesized through voiceprint. The training data consists of 33,000 segments of voice material recorded in US English by a female professional speaker. The results of this model are superior to those of the Hidden Markov Model method under a large number of data tests.

Large scale data will undoubtedly lead to slow training, in order to improve the problem, Cheng et al. [11] proposed a Learning-NEAT (LNEAT), a grid training method for large-scale data classification problems, which simplifies network evolution by splitting a problem into several subtasks. Learning subtasks is accomplished by applying Back Propagation rules in the NEAT algorithm. The LNEAT algorithm takes into account the advantages of the NEAT algorithm and the BP algorithm in topology and weight search, and overcomes the problems caused by the use of the NEAT algorithm. LNEAT algorithm has got satisfactory results in speech recognition, and has greatly improved the speed of network training.

2.2 Convolutional Neural Networks

Convolutional neural network [12] is a deep machine learning method from artificial neural network. In recent years, it has achieved great success in the field of image recognition. Convolutional neural network retains the deep structure of the network because of using local connection and weight sharing, and at the same time greatly reduces the network parameters so that the model has good generalization ability and is easier to train. A convolutional neural network is mainly composed of the following four parts: convolutional layer, pooling layer, fully connected layer and loss function. In the process of concatenation of different layers, the feature extraction and feature combination of image features are realized (Fig. 2).

Fig. 2.
figure 2

Convolution neural network model

Convolution layer is used for feature extraction. Usually, multi-layer convolution layer is used to get deeper feature maps. Low-level convolution layer performs edge detection on the image. Higher-level convolution layer extracts Feature map for further feature extraction. Each convolution layer contains a number of convolution kernels, through different convolution kernel to complete the extraction of different features, the public expression is:

$$ x_{j}^{l} \, = \,f(\sum\limits_{{i \in M_{j} }} {x_{i}^{l - 1} \,*\,k_{ij} } \, + \,b_{j}^{l} ) $$
(6)

The activation function f is used to increase the nonlinear factors to enhance the system’s ability to express. Commonly used activation functions are as follows:

$$ Sigmoid(x)\, = \,\frac{1}{{1\, + \,e^{x} }} $$
(7)
$$ TanH(x)\, = \,\frac{{e^{x} \, - \,e^{ - x} }}{{e^{x} \, + \,e^{ - x} }} $$
(8)
$$ \text{Re} LU(x)\, = \,\hbox{max} (0,\,x) $$
(9)

After the features are obtained by the convolution layer, the features need to be sorted by the pooling layer [13]. The pooling layer aggregates the features of different locations. Common collection methods are the largest collection and average collection

$$ O_{i,jk} \, = \,\mathop {MAX}\limits_{0\, \le \,x\, \le \,m,0\, \le \,y\, \le \,m} (I_{im\, + \,x,jm\, + \,y,k} ) $$
(10)
$$ O_{i,jk} \, = \,\mathop {AVE}\limits_{0\, \le \,x\, \le \,m,0\, \le \,y\, \le \,m} (I_{im\, + \,x,jm\, + \,y,k} ) $$
(11)

The fully connected layer acts as a “classifier” throughout the convolutional neural network [14]. If the operations such as convolutional layer, pooling layer and activation function layer map the original data to hidden layer feature space, then the fully connected layer serves to map the learned “distributed feature representation” to the sample markup space. In practice, the fully connected layer can be implemented by a convolution operation: the fully connected layer to the front layer can be transformed into a convolution kernel with a 1 the “fully coated feature representation” to the sample markup space follows: of different features, the public expression is the global convolution of \( h\, \times \,w \), \( h \) and \( w \) are respectively the height and width of the convolution results of the previous layer. Usually fully connected layer is located at the top of the network.

The loss function can continuously compare the output characteristic distance with the target, and use the backward conduction algorithm to continuously adjust the parameters in the whole network so as to achieve the goal of continuously optimizing the network structure and making the network develop in the expected direction. The European distance loss function [15] and softMax loss function are commonly used. The Euclidean distance loss function is a fundamental loss function aimed at reducing the Euclidean distance between the system output and a given label. The objective function of softMax can be written as

$$ J(\theta )\, = \, - \frac{1}{m}[\sum\limits_{i = 1}^{m} {1\, - \,y^{(i)} } \log (1\, - \,h_{\theta } (x^{(i)} ))\, + \,y^{(i)} \log h_{\theta } (x^{(i)} )] $$
(12)

The use of convolutional neural networks in large-scale data requires the use of multicore GPUs. The number of threads required for training depends on the size of the filter selected. Researchers at Microsoft Research Asia [16] used a network of depths up to 100 in the ImageNet Challenge, winning at an error rate as low as 3.57%. The number of layers in this network is more than 5 times that of any neural network that has been successfully used in the past, and has achieved good results in dealing with large-scale images.

When the required space in the GPU exceeds the available memory, the data needs to be copied into the CPU memory, but the transfer rate between the GPU and the CPU is relatively slow. Satish et al. [17] modulated data transmission into integer linear programs and improved simulated annealing algorithms/mixed integer linear programming algorithms significantly reduce the data transfer between the GPU and the CPU. Compared with the non-optimized method, the 30-fold reduction in data throughput is obtained, which provides important support for convolutional neural network processing large-scale data and parallel computing.

2.3 Recurrent Neural Networks

Recurrent neural networks (RNN) are neural networks with fixed weights [18], external inputs, and internal states that can be thought of as behavior dynamics of internal states with weights and external inputs as parameters. According to the basic variables is the neuron state or local field state can be divided into static field neural network model and local field neural network model. According to different ways of processing signals, the neural network can be divided into continuous system and discrete system.

Figure 3 is example of a fully deployed RNN network [19]. \( x_{t} \) represents the \( t \)-step input. \( s_{t} \) is the state of the \( t \)-step hidden layer, which is the memory unit of the network. \( s_{t} \) Calculate based on the output of the current input layer and the state of the hidden layer in the previous step

Fig. 3.
figure 3

Expanding recurrent neural network model

$$ S{}_{t}\, = \,f(Ux_{t} \, + \,Ws_{t - 1} ) $$
(13)

\( f \) is generally a non-linear activation function such as \( \tanh \) or \( RELU \).\( s_{ - 1} \), the hidden state of the first word, is needed to calculate \( s_{0} \), but it does not exist and is usually set to \( o \) in the implementation. \( o_{t} \) is the output of the step \( t \), expressed in vector as

$$ o_{t} \, = \,soft\hbox{max} (Vs_{t} ) $$
(14)

The recursive neural network introduces the ring structure, so the output at a certain time is not only related to the input of the current time, but also related to the state of the previous moment, and it can be used to deal with the variable length sequence problem through the weight sharing so that the recursive neural network can enhance its robustness.

However, the recursive neural network is more complicated than the feedforward neural network and has more computation and decoding speed. Which limits the application of recurrent neural network in the task of high real-time and large amount of data. In order to achieve the purpose of accelerating computation, Zhang et al. [20] proposed a method of frame skipping computation, which can reduce computational overhead by regularly dropping overlapping frames and directly reducing the number of frames to be calculated in the neural network. It can be applied directly to the recursive neural network model by adding the necessary cross-state transitions to the HMM in the Hidden Markov Model and the network structure itself does not change. The method can get 2–4 times speedup with less loss of accuracy.

Yosuke et al. [21] proposed a performance model for a distributed Deep Neural Network (DNN) training system called SPRINT, as Fig. 4, which takes DNN architecture and machine specifications as input parameters, taking into account the low-volume and gradient probability distributions that are core parameters of asynchronous SGD training (ASGD Training). Using asynchronous GPU processing based on the smallest batch SGD, the average error time was estimated as 5%, 9%, and 19%, respectively, in the processing of the entire dataset on supercomputers on thousands of GPUs. Progress has been made in dealing with the speed of large-scale data.

Fig. 4.
figure 4

Timeline of ASGD training [21]

3 Deep Learning Application in Big Data

This section will introduce some applications of deep learning under large-scale data. The results of the integration of big data and deep learning in engineering applications are mainly reflected in intelligent voice systems and machine vision images. As learning progresses and technology grows, it also begins to take advantage of the following areas.

3.1 Multi-function Network

Although the combination of deep learning and big data already has a lot of flexibility, the combination of the two can only do one problem at a time. For example, training a network or only recognize the picture, or only recognize the voice, they can not be identified at the same time. There is not yet a network that recognizes objects both visually and audibly. Despite the multitasking learning technology, the web can identify profiles, gestures, shades, texts, and more while recognizing image categories, but today’s deep neural networks can be very low-energy compared to our human-versatile brain.

At present, if an application needs different capabilities, multiple networks must be combined, which is not only a huge consumption for computing resources but also difficult to form an effective interaction between different networks. How to enable them to achieve multiple goals at the same time, the current enlightenment from the human brain is that, in some way, it may be possible to connect the networks responsible for different functions to form a larger network. Noam et al. [22] introduced a decentralized gated mixed-expert layer (MOE) consisting of up to thousands of feedforward subnets, using a trainable gated network to determine the sparse combination of these experts, then present model architectures in which a MOE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, these models achieve significantly better results than state-of-the-art at lower computational cost.

3.2 Medical System

Medical institutions, whether pathological reports, cure programs or drug reports and so on are relatively large data industries, the face of many viruses, tumor cells are in the process of evolution, the diagnosis will find the diagnosis of the disease and treatment programs to determine is very difficult. This can be based on the pain points in the medical field, building a deep learning model, the use of big data analysis to solve medical problems, such as users can use the big data acquisition behavior analysis, visual query and analysis, multidimensional analysis, retention analysis, funnel analysis, return visit analysis, etc., to build a deep learning model, the depth of the patient’s various problems.

Google’s DeepMind and IBM’s Watson are active in this area, especially Watson, in some specific areas, its diagnostic accuracy has exceeded human experts. Because the majority of medical cases are unstructured textual data, the deep belief networks stacked by multi-layer limiting Boltzmann machine could automatically extract features from textual cases and learning knowledge in the medical records effectively, and can be efficiently diagnosed.

3.3 Translation

Real-time machine translation [23, 24] is one of the most promising directions for deep learning technology to use under big data. From the machine translation approach originally based on rules based on human compilation to the later statistical-based statistical machine translation (SMT) approach to the present neuro-machined translation (NMT), translation technology has been continuously updated over the past six decades, especially in 2012. The deep learning technology get into people’s perspective, the machine translation accuracy constantly updated. Based on the deep learning technology, translation technology adopts an end-to-end structure, does not require human to abstract features and network structure design is simple, does not require word segmentation, alignment, syntax tree design and other complex design work, is very suitable for work under a lot of data.

Dzmitry and Yoshua [25] mapped an original language sentence into an implicit vector of a fixed length through an encoder, which is the bottleneck to improve the translation effect of NMT. Actually, when the decoder decodes the target language sentence, it only correlates with a part of the input source sentence. Based on this, they put forward a mechanism of “attention mechanism” that allows the model to automatically find the key correspondence between the source language sentence and the target language sentence word and the word, without limiting the length of the hidden vector. A set of content-based attention calculation method is given. The essence is to use a two-way Long Short-Term Memory (LSTM) to learn the importance of none of the words in the source language. This method is very effective and can also be used to study deeply in many other fields under large scale of data, for example Q & A, large-based number of sentence reasoning, entity extraction, huge document generation and so on (Fig. 5).

Fig. 5.
figure 5

The graphical illustration of the proposed model trying to generate the \( t \)-th target word \( y_{t} \) given a source sentence \( (x_{1} ,\,x_{2} ,\, \ldots ,\,x_{T} ) \) [25]

3.4 Playing Strategic Games

When people are still shocked by the impact of ALPHAGO, DeepMind team non-stop brought a new surprise. London local time on October 18, DeepMind [26] team announced the most powerful version of AlphaGo, code-named AlphaGo Zero. AlphaGo Zero Intensive learning self-play. After several days of training, AlphaGo Zero completed almost 5 million self-game, has been able to surpass humans and beat all previous versions of AlphaGo. The neural network takes the checkerboard position \( s \) as an input, outputs a vector of the movement probability \( p \) for the component \( p\_a\, = \,P\_r(a|s) \) for each action \( a \), and scalar value \( v\, \approx \,E[z|s] \) for estimating the expected result \( z \) from the position \( s \). AlphaZero learns the probability of winning these steps entirely from self-play; these results are then used to guide the search of the program.

DeepMind team said in the official blog, Zero with the updated neural network and search algorithm reorganization, as the training deepened, the performance of the system a little bit of progress, through a powerful neural network search algorithm, after the self-game obtained The results are getting better and better, while the neural network has become more accurate.

4 Opportunities and Challenges

Current deep learning algorithms and big data technologies perform far less well than their theoretically achievable performance. Unsupervised learning can solve the real-world identification of very large objects, learn the rules, patterns and features of large-scale data without the intervention of artificial models through direct training using unlabeled data, and understand that ordinary human beings The problem that the brain can not be directly extracted and extracted is used by mankind to solve practical problems. To fully tap the hidden value in big data can serve human life.

Although the existing data volume is already large, it is still not enough. The complexity, dimensions and diversity of common data are not enough to cover all possible boundary conditions in the real world. In the existing distributed system, a large amount of data and parameters need to be transmitted between nodes, and the communication cost is too high. When the number of nodes exceeds a certain number, a continuous acceleration ratio can not be obtained. How to design a distributed system requires DNN algorithm experts and system experts to work together to solve the problem. The solution may need to modify the algorithm to match the underlying hardware architecture, but also requires the system expert to design a powerful computing single machine, but also design high-density Integrated, efficient communication server. Secondly, the model of deep learning in big data, its data volume and computational volume are very large, often need a few weeks or even months of training time, bound to require parallel training to improve training speed, but when training different data between multiple nodes, how to coordinate and synchronize may need to be redesigned from an algorithmic perspective.

5 Conclusion

This study introduces three models of deep learning, and introduces the challenges and applications of each model in big data environments. In this era of massive growth of data, the first issue to be considered in dealing with vast amounts of data is how to effectively analyze and process the data and mine the value of the data. Deep learning methods play a key role in processing big data by adaptively extracting their internal representations from data, minimizing human involvement, and providing greater generalization. If the analogy of artificial intelligence as a rocket, then the deep learning is the rocket engine, big data is the rocket fuel, two parts get together at the same time could be able to successfully launch the rocket into space. In the context of big data, with the continuous deepening of deep learning research, the efficient combination of the two will surely make the computer more intelligent so as to assist human decision-making and bring good news to mankind.