Keywords

1 Introduction

The aim of this paper, and the corresponding tutorial presented at ICWE 2019 in Daejeon/Korea, is to present the most useful tools, e.g. cat, grep, tr, sed, awk, comm, uniq, join, split, etc., and to give an introduction on how they can be used together. So, for example, a wide number of queries which typically will be formulated with SQL, can also be performed using the aforementioned tools, as it will be shown in the tutorial. Also, selective data extraction from different webpages and the recombination of this information (mashups) can easily be performed.

2 Filters and Pipes

The underlying architectural pattern of the stream-based data processing is called filters and pipes [1], originally suggested by McIllroy et al. [2]. The general idea is to utilize a bunch of useful programs (filters) that can be stick together using pipes (loose coupling). The programs themselves are called filters, but beside filtering, all sort of operations (e.g. sorting, aggregation, substitution, merging, etc.) can be implemented. A typical filter takes the input from standard input (STDIN) and/or one or multiple files, performs some operation on this input and generates some result on standard output (STDOUT) or one or more files. By connecting multiple programs via pipes, the output of one filter acts as the input of the next filter (see Fig. 1). Data transferred through the filters is often in ASCII format (but don’t have to be).

Fig. 1.
figure 1

Composing programs with filters and pipes.

Technically, the pipe symbol is represented by the vertical bar character (|). Additionally, it is possible to redirect data from a file to the standard input channel of a filter, as well as redirecting the standard output to a file. This redirection is typically done with the ‘>’ and ‘<’ symbols. Besides the STDIN and STDOUT, there is also the standard error output channel (STDERR). The STDERR is particularly used for error messages and debug outputs, respectively. The idea of composing complex programs from small well defined components allows rapid prototyping, incremental iterations and easy experimentation [3].

3 Classification of Commands

A bunch of tools have been developed, which can be used for the purpose of analyzing, composing, and transforming data-streams and files. These can be classified as follows:

File Inspection:

This includes programs such as less, head and tail, which allow users to view and inspect files of any size. While head and tail can only show the beginning or the end of a file, less allows the users to browse and search the file interactively. The less command is also an exception in that it is the only program, among the presented ones herein, that accepts user input.

Filtering:

There exist tools for line or column-wise filtering of data. If the input is comma or tab separated data, the tool cut can be used to extract single or multiple columns (projection) and output them to the standard output (also awk, which will be described later, has this ability). The grep, sed and awk command-line utilities allow a line by line filtering of the input data. In addition to the typical comparison operators, regular expressions can also be used, which makes the tools very powerful. grep can also be used for unstructured data (text), where arbitrary patterns can be specified and only the data specified by the pattern is returned.

Sorting:

The sort command allows the specification of one or more complex sort keys on which an input file is sorted line-by-line. Internally, the sort is implemented using file based merge-sort, which allows also huge files to be sorted without the need of a big main memory. Some other programs like comm, join, and uniq require sorted input, and thus sort is needed quite often.

Substitution Commands:

sed and tr support the transformation of the input data. Whereas tr acts on the character level, sed allows the specification of complex substitution rules using regular expressions.

Composition and Splitting:

Operations in this category allow the column or row-wise composition or splitting of data. Additionally, the join command allows the column-wise composition based on identical join-fields. Figure 2 compares the functionality of the different operations.

Fig. 2.
figure 2

Composition of input streams/files using cat, split, cut, and paste.

Aggregation:

uniq and wc fall into this category. While wc counts the number of input lines or words, uniq is to handle duplicates in a file or stream. Concrete, uniq reports or omits repeated lines in a file or stream. So most frequently, uniq is used in conjunction with a preceding sort command.

Comparison:

diff and comm are representative of this category. Both programs compare two files line by line. While diff reports the differences found, comm requires that the two input files are sorted and works on a set-semantic. As result it reports entries that only appear in the first, or the second, or in both files. Using this command, set-based intersection and minus operations can be implemented.

4 Programs sed and awk

sed stands for Stream Editor. In contrast to a “normal” editor, sed has no interaction with a user. However, insert, update, substitute, and delete commands must be formulated in a command file or on the command line. This allows the automation of recurring (also complex) editing tasks. As a very powerful feature, sed allows the specification of addresses (ranges of lines) on which the operations to be performed are restricted. Addresses can be specified as line numbers, strings, regular expressions, or a combination of them. So, for example the following command, removes all javascript-sections (from <script to </script> ) from a html-file:

figure a

awk in contrast is a complete programming language, supporting loops, conditions, arrays, and dictionaries. It is mainly used on structured input data and to generate reports. Both, sed as well as awk support regular expressions.

5 Examples

As a more complex example of processing structured data, the re-implementation of a complex SQL-select statement using pipe and filters will be shown. Consider as input a relational table (Table city), resp. csv-file (file city.csv), consisting of city datasets, is shown in Table 1. The statements return all countries which have more than 100 cities in the database, together with the number of related cities. The output is sorted by the decreasing number of cities in a country.

Table 1. Comparison between SQL and piped UNIX commands

Another example of analyzing unstructured data is the following, where a stopword-list, containing the most frequent 20 words (case-insensitive) is created from all text documents in the current directory and stored in a file stopwords.lst .

figure b

More examples will be presented and discussed in the corresponding tutorial [4].