Keywords

1 Introduction

Facing the challenge of Exascale computing, breakthroughs will be made in four aspects [1]: execution model, memory access mode, parallel algorithm and programming interface. As the execution model is concerned, dataflow execution model dramatically increases both scalability and efficiency to drive future computing across the exascale performance region, even possibly to zetaflops. So, as the number of cores in the host system increases (1~2 orders of magnitude higher than the current one), exascale-level computing will gradually migrate to the data flow execution mode.

Dataflow is a parallel execution model originally developed in the 1970s but explored and enhanced as the basis for non-von Neumann computing architecture and techniques. The dataflow model of computation presents a natural choice for achieving concurrency, synchronization, and speculations. In the basic form, activities in a dataflow model are enabled when they receive all the necessary inputs; no other triggers are needed. Thus, all enabled activities can be executed concurrently if functional units are available. And the only synchronization among computations is the flow of data.

The recent research related to dataflow systems can be classified into three categories: 1) dataflow systems on dedicated customized hardware, 2) macro dataflow (hybrid dataflow/controlflow) systems on off-shelf platform or customized hardware and 3) application-related systems based on underlying dataflow execution software engine.

The dedicated hardware can achieve fine-grain dataflow tasks and excellent performance. The Maxeler system contains ×86 CPUs attached to FPGA dataflow engines (DFEs), and speedup the computation by 80/120 times for small/large problem [2]. The Tera-op Reliable Intelligently adaptive Processing System (TRIPS) [3, 4] employs a new instruction set architecture (ISA) called Explicit Data Graph Execution (EDGE). On simple benchmarks, TRIPS outperforms the Intel Core 2 by 10%, and using hand-optimized TRIPS code, TRIPS outperforms the Core 2 by a factor of 3 [3].

Most of the macro dataflow (hybrid dataflow/controlflow) systems used an instruction clustering paradigm; various instructions are clustered together in a thread to be executed in a sequential manner through the use of a conventional program counter. Teraflux [5] is a 4-year project started in 2010 within the Future and Emerging Technologies Large-Scale Projects funded by the European Union. The execution model of Teraflux [6] relies on exchanging dataflow threads in a producer–consumer fashion.

The application-related dataflow softwares include a variety of programs. Labview (Laboratory Virtual Instrument Engineering Workbench) [7] is a system-design platform and development environment for a visual programming. It uses a dataflow model for executing code. Orcc (Open RVC-CAL compiler) [8] can convert the code description in dataflow style into C code to run in a multi-core environment. The deep learning framework, Tensorflow, adopts dataflow engine to perform the forward and backward propagation. The Sunway TaihuLight Supercomputer is equipped with a dataflow execution engine modified from Tensorflow [9].

Besides those dataflow systems, some research focus on lightweight dataflow solution. DSPatch is a C++ library for dataflow framework. StreaMIT is a language for streaming application based on dataflow execution. Programmer of StreaMIT should construct the dataflow graph explicitly in natural textual syntax.

DFC is with a tiny extension of C language to obtain the capacity of describing the dataflow task, without any special underlying hardware [10]. Because it is based on C, it is much easier for programmer. And this make it highly coupled with Unix-like OSs, which means we can build other upper system software layers in DFC environment.

This paper is organized as follows. In Sect. 2, we give an overview of DFC briefly. In Sect. 3, the DFC runtime is introduced. We present a simple evaluation of DFC in Sect. 4. Conclusion is drawn in Sect. 5.

2 Overview of DFC

By extending C language, DataFlow C, short for DFC, provides an efficient method which makes developing a dataflow program with large parallelism easily without sacrificing performance thereby improving programmers’ productivity. To set up a dataflow graph, all it requires is that implementing the DF function which is a new type of function defined in DFC. The DF function’s body explicitly defines how the node of dataflow graph works, and the edges are derived from the output arguments of the preceding nodes and the input arguments of the following nodes. After implementing all DF functions, the compiler will automatically build the directed acyclic graph of dataflow graph during compilation. Before the execution of DAG of dataflow graph, the number of threads specified by the programmer will be created by DFC runtime library. If all input data for a DFC function are ready, this DF function will be triggered/fired and a thread will be responsible for the computing defined by this DF function. What’s more, to pursuit higher performance, if more than one group of input data, corresponding to “different passes”, are ready, there may be several threads executing the same DF function with all these passes’ input data parallelly.

2.1 DF Function

DF function is a unique function in DFC, which is capable to describe dataflow graph. DF function is one of the most important features of DFC and is used to build dataflow graph. The DF function is defined by a statement, as Code 1 shows.

figure a

It can be noted that there exists a semicolon in the argument list of DF function, which is different from C normal function and actually is the only difference between C and DFC in syntax. The semicolon divides DF function’s argument list into two argument lists—input argument list and output argument list.

As mentioned before, it is easy to build a dataflow graph by implementing DF functions. DF function describes the most fine-grained execution step of dataflow graph, so all nodes in dataflow graph should be declared as DF functions. By matching the output arguments and input arguments of the neighboring DF functions or nodes, compiler can connect these conjunctional nodes. What’s more, DF function will implicitly get data from the input channel and push output data to the output channel without explicit coding.

As a dataflow graph shown by Fig. 1 is under consideration, there needs four DF functions. And the corresponding DFC source code framework can be shown as Fig. 2.

Fig. 1.
figure 1

A demo of dataflow graph

Fig. 2.
figure 2

A demo of DFC source code

In the demo of Fig. 2, four DF functions are defined, which corresponding to the four nodes (A, B, C and D) in Fig. 1 respectively. During compilation, by matching the input and output parameters of DF function, the compiler can realize the connections between nodes. If the output argument of the preceding node has the same symbol name as the input argument of the succeeding node, it means that there exists a directed edge from the preceding node to the succeeding node. Comparing to C++-based dataflow framework, DSPatch, in which not only defining the computing nodes, but also connecting the nodes explicitly are required, obviously it is more convenient to build the irregular structured dataflow graph in DFC.

Additionally, as Fig. 2 shown, it can be seen that a DF function does not need to have both input argument and output argument. Generally, DF function can be divided into Source DF function, Sink DF function and normal DF function. Source DF function has no input arguments, which is the start of dataflow graph, usually responsible for getting data from memory or from network. Sink DF function consumes input data and produce no output data.

2.2 Active Data

Active Data, shorted for AD, is the carrier of data that cascades through the dataflow graph. Intuitively, it is AD that flows from a node to another. In DF runtime environment, the flowing data must be wrapped into AD so that it can be transmitted between nodes, no matter what its data size and data type. And programmers have no worries about getting data from AD or wrapping it up, but just manipulate data as usual. DFC runtime will handle the memory allocating/deallocating, data tagging for various passes, synchronizing for concurrent accessing. DFC has the capability to make sure that the data size and data type of real data is correct while DF function getting data from the input AD channel.

DFC allows that a node can generate new output before its previous output is consumed. The data channel, or the directed edge, is implemented as a FIFO queue that new data should be added to its tail and removed from its head. When a node outputs data, it will give a chance to trigger the succeeding nodes if all its input data are ready. And the data in the FIFO queue will be removed if it is consumed by succeeding nodes.

2.3 Program Framework in DFC

Code 2 shows a demo of program framework in DFC, which corresponds the dataflow graph shown in Fig. 1. After implementing all DF functions, programmer just simply call DF_Run function to execute the dataflow graph. DF_Run requires one int argument, Round. If the value of this argument is a positive integer, the dataflow graph will be executed for Round times/passes. Otherwise, the execution of the dataflow graph is stopped by calling the DF_Source_Stop function in Source DF function, when the preset conditions are met. All Source DF functions should call the DF_Source_Stop function at the same pass, therefore the dataflow graph will stop executing without generating new data. For each DF function, an int variable named DF_count is declared implicitly, which records current pass of this DF function. So, programmer can directly manipulate the DF_count without declaring but cannot redeclare another variable with the same symbol name in DF function. What’s more, in order to adapt to different hardware environment, the appropriate number of threads can be specified by macro command “#define THREADNUM” to achieve proper parallelism.

figure b

3 DFC Runtime

Hiding the complex implementation details, DFC runtime library provides the environment supporting dataflow program. DFC runtime is implemented elaborately, because it has a great impact on the performance. DFC runtime will be introduced from three aspects as follows: 1) How DFC manages DAG of the dataflow graph; 2) How DFC triggers the dataflow computing; 3) And how DFC distinguishes data of various passes.

3.1 DAG Management

In DFC, for a directive edge, the output channel of the preceding node and the input channel of the succeeding node share a same AD channel. An AD is a global variable and actually is the forms of the edge in the DAG, it means that even though more than one node inputs the same AD, there only exists one copy of this AD in menory. So, it saves memory of storing data.

There are several important data structures in DFC: DF_AD, DF_FN and DF_TFL. In DFC, AD is implemented by a structure named DF_AD, which records the address of the real data. DF_FN manipulates the node of the DAG, and it records information of corresponding DF function, including the entry address of the DF function, the input AD list and output AD list. So, to connect the two neighboring nodes, just add the pointer of the DF_AD to the output list of the preceding DF_FN and also add it to the input list of the succeeding DF_FN. DF_TFL is a table that records information of dataflow graph and configuration. It stores all nodes’ addresses, so that the DAG of the dataflow graph can be derived.

3.2 Tasks Trigger and Execution

Different from some other dataflow program framework, in DFC, threads and nodes of dataflow graph are independent, or decoupled. A thread is not work for only a specific node but it can execute task defined by any node. Task is pushed into a queue, and a free thread takes a task from the queue. So when calling the DF_Run function to execute the computation defined by dataflow graph, a thread pool that contains specified number threads will be first created. If the task queue is not empty, one free thread will pick a task from the task queue head and remove it. Unless there is no task ready, the thread will be blocked until new task comes.

After thread pool is created, the main thread will call DF_Loop function to keep adding new task to the task queue until the execution stops. Exactly, the main thread is only in charge of adding tasks defined by Source DF function, the start of the dataflow graph, so that new pass can be triggered. Because dataflow program is data driven, tasks defined by other DF functions—Sink DF function and normal DF function, will be triggered when their preceding nodes output data. To improve dataflow program performance and save memory, it is necessary to have a appropriate strategy to decide when to push task defined by Source DF function into the task queue. In DFC, only when the length of the task queue is less than the number of Source DF function and the Source DF function is not stopped, can this task be added. Due to the particularity of the Source DF function comparing to other DF function, a new structure, DF_SI, is specifically used to describe the Source DF function. Members of DF_SI are DF_FN, status flag ‘stop’ and remaining running times ‘count’. As mentioned before, when calling DF_Run with a positive integer argument, for every DF_SI, its ‘stop’ will be set to ‘0’ and ‘count’ will be assigned to this positive argument. Every time a Source DF function is scheduled, its ‘count’ will decrease by 1. When ‘count’ equals to 0, its ‘stop’ will be set to ‘1’ and this Source will not be scheduled any more. Of course, calling DF_Source_Stop in Source DF function will also stop this Source DF function even though its ‘count’ is greater than 0.

When a node finishes computing, it needs to update its output AD. But as mentioned before, because maybe the same DF function with different passes’ input data are being executed, so the current pass’ output data should be written after data of previous passes. In AD, queue storing is realized by a resizable array. The queue is shown by Fig. 3. It’s obvious that an element of the queue is composed of data and Fanout. Fanout is a counter, implying the times that this data can be accessed in one pass. Once AD is going to be updated, the output data should be written into the data buffer area of the queue slot and the Fanout should be initialized by the number of the node that inputs this data. And every time this data is accessed, Fanout should decreases by one. And when Fanout decreases to 0, it means that no nodes will access this data anymore, so this data should be discarded and the head of the queue should point to the next element.

Fig. 3.
figure 3

AD queue

After updating output AD, the preceding nodes should try to trigger their succeeding nodes. As shown in Fig. 4, there is a flag list in DF_FN showing the ready status of the node’s input data. The flag is an integer, and each bit of it corresponds to an AD that the node needs. If a node needs n AD inputs, the n lowest bits of its flag are valid. If the input data is ready, the corresponding bit will be set to ‘1’. Or the bit keeps being ‘0’ until the input data is ready. Only if its flag equals to \( 2^n \)-1, all the node’s input data are ready. For example, if one node needs 8 AD inputs, then the 8 lowest bits record the status of its inputs. And the binary number ‘0001 1011’ shown in Fig. 4, means that the first to the fifth AD are ready, except for the fourth one. When its flag equals to ‘1111 1111’, all the input data are ready and the task defined by this node can be fired. Because there may exist multi groups of input data corresponding to different passes, a flag list is needed to store all the ready status of input data of various passes.

Fig. 4.
figure 4

The ready status list

Before the preceding node finishes working, it need to do the last thing: to modify its succeeding nodes’ flags. Go through the ready status list, then set the corresponding bit of the appropriate flag to 1. Or if there exists no flag that its corresponding bit is 0, then create a new flag and set its corresponding bit to 1. And then check the first flag, if it shows that the input data are all ready, the corresponding task defined by this succeeding node will be push into the task queue.

3.3 Distinguish Data of Different Passes

As mentioned before, to boost performance, DFC allows multi graph of different passes are being executed at the same time. It means that maybe multi tasks defined by the same node are fired thereby it may lead to data risk. So it’s necessary to take measure to ensure the order of data accessing. In DFC, AD is able to record the pass sequence that the first valid element of the data queue corresponds to by an integer variable named Order. At the K-th pass, the node can get data from the (Order-K)th element relative to the first element in the data queue. Of course, when to output data, the data of current pass have to be inserted to the data queue after the data of previous pass.

4 Evaluation

4.1 Experimental Environment

We have compared the performance of DFC and DSPatch on machines with various number of cores. The first one is based on Intel® Xeon® Gold 6130 CPU with 32 cores and 64 hardware threads. And the operating system on it is Ubuntu with kernel version 4.15.0, and the version of GCC is 7.4.0. The second one is based on Intel® Xeon® E5-2620 v4 with 16 cores and 32 hardware threads. Its operating system is Ubuntu with kernel version5.4.0, and the version of GCC is 7.5.0.

4.2 Brief of DSPatch

DSPatch is a powerful cross-platform dataflow framework based on C++, and has easy-to-use object-oriented API. DSPacth is designed by a concept about “circuit” and “components”. The circuit manages the dataflow graph, and component defines the node. Customized component must extends from DSPatch::Component base class and the inherited virtual “Process_()” method must be implemented. A demo of a customized component is shown by Code 3. In this demo, a new “And” component for operating logic “and” is defined. Firstly, call SetInputCount_() and SetOutputCount_() to respectively to set the number of input and output in its constructor. The component’s Process function defines the computing of the node. In the Process(), input data is gotten by calling SignalBus:: GetValue() explicitly and SignalBus::SetValue() is called to set output data. After defining all components, they are added to the circuit by calling Circuit::AddComponent(). To connect components, Circuit::ConnectOutToIn() will be called. Code 4 shows a demo of a dataflow program in DSPatch.

figure c
figure d

There is a little bit difference in tasks trigger between DFC and DSPatch. When calling Circuit::Tick() to run the dataflow graph for a pass, the circuit thread will go through all components and tick them. Maybe the succeeding component will be ticked before the preceding component, so the succeeding component will tick the preceding component. While in DFC, the succeeding node is triggered by the preceding nodes. What’s more, the thread is bound with the component in DSPatch. A thread only works for one component. And to enable multi passes concurrence, multi independent groups of threads are needed.

4.3 Experiment Results

We have compared DFC with DSPatch on the same simple dataflow graph shown by Fig. 5. And Table 1 presents the code lines to form the graph of Fig. 5 in DFC and DSPatch respectively. DFC code has 40 lines, which is much shorter than DSPatch code. The program code of such simple dataflow graph in DFC is more concise than it in DSPatch.

Fig. 5.
figure 5

A simple dataflow graph

Table 1. Comparison DFC with DSPatch

We have compared the code to build a graph of two binary trees connected back-to-back as Fig. 6 shown. The first node randomly generate an integer array, and the nodes of the second layer to the (n-1)th layer simply divide the input array averagely into two arrays. The nodes of the n-th layer will bubble sort the input array, and the other nodes will merge the two input orderly arrays.

Fig. 6.
figure 6

Binary trees connected back-to-back

Table 3, Fig. 7 (a) and Fig. 7 (b) present the total code line number and the assistant code line number, corresponding to various depth of the tree. And the assistant code is to describe the graph. The number n is the depth of the binary tree on the left just as Fig. 6 shown, and the abscissa is a logarithm of n with a base of two. The ordinate is also logarithmic number. In these cases, the DFC programs are coded in static style, which means all the node are declared individually. And the DSPatch (dynamic style) programs are coded in dynamic style, which means each level of that tree is constructed in a “for/loop” iteration. DSPatch (static-multi-declaration) programs means that all classes of nodes are declared respectively and build the graph without a loop, which is for the case of irregular graph and the nodes are various. DSPatch (static style) programs do not declare classes of nodes repeatedly and only build the graph without any loop. So DSPatch (static-mutli-declaration) programs and DSPatch (static style) share the same number of assistant codes.

Fig. 7.
figure 7

(a) The total number of code in DFC and DSPatch (b) the number of assistant code in DFC and DSPatch

For the well-structured graph, such as the graph of back-to-back-connected binary-tree (as Fig. 6 shown), DSPatch can code that graph in a dynamic style and construct the graph by the code of iteration. If DSPatch programs are coded in nesting loop, the code lines will keep as constant for any depth. While, in DFC, that graph should be code statically and make it have a much more code lines. Because the irregular structured graph cannot be programed by loop/iteration code, so, for the irregular structured graph, or for the small scale graph, DFC is preferred for the easy coding and less code lines.

But DFC runtime needs far less threads to fulfill the same parallelism then DSPatch, as Table 4 shown.

Figure 8(a) and Fig. 8(b) respectively present the time that DFC and DSPatch consumed by the problem of Fig. 5, on the machines with 64 cores and 32 cores, for running 4096 times. The thread number of DFC program and the buffer size of DSPatch is from \( 2^0 \) up to \( 2^9 \), which corresponds to their parallelism. And the abscissa, representing the parallelism, is a logarithmic coordinate with a base of two. The ordinate on the left shows the time that programs consumed, and ordinate on the right shows the speedup ratio of the programs comparing to the program of serial version.

Fig. 8.
figure 8

(a) Time consumption on 64 cores machine (b) time consumption on 32 cores machine

The Fig. 8(a) and Fig. 8(b) shows that DFC have a much better performance for the cases of parallelism below 16 (2^4) in the platform of 64-cores or 32-cores. And, as the parallelism increase, DFC and DSPatch show the similar performance. But when parallelism is greater than number of physical cores (32 or 64), DFC performs slightly worse than DSPatch.

5 Conclusion and Future Works

In this paper, we introduce the DFC dataflow language and its runtime environment. Same as the other dataflow language, DFC is convenient to build a dataflow program easily without concerning the concurrency, synchronization and deadlock, while the traditional parallel programming techniques (such as MPI or OpenMP, etc.) should handle these error prone problems explicitly and carefully.

DSpatch program needs shorter code to describe a given regular, or well-structured dataflow graph, than DFC program does. While for the irregular structured graph, or for the small scale graph, DFC is preferred for the easy coding and less code lines.

DFC runtime library is in charge of constructing the DAG of the dataflow graph, firing the DFC tasks and synchronize between the tasks of successive passes. Basing on an elaborately implemented thread pool and queued Active Data, DFC runtime shows an ideal performance comparing with DSPatch. For the problem shown as Fig. 5, DSPatch needs 5n+1 threads to achieve the parallelism of n, while DFC needs n+1 threads for the parallelism of n. So, DFC consumes less system resource and is more preferable for the large scale dataflow graphs. Even for the parallelism below 16, DFC is outperform DSPatch in matric of execution time.

Although DFC has better performance for the cases that the parallelism beneath the cores numbers, DSPatch shows a better scalability than DFC for such simple graph, as DSPatch achieves higher speedup for the case of threads number exceed the physical cores. We are enhancing DFC with tracing function to spot the performance bottleneck by recording the timeline of each DFC function’s duaration.

By now, as DFC is still a prototype language, it is not capable of constructing the graph dynamically, which make it inefficient for those well-structured dataflow graph. We will extend the syntax to make it capable of constructing the graph dynamically in the next version of implementation.