Powerful Data Analysis and Composition with the UNIX-Shell

Schmidt, Andreas; Scholz, Steffen

doi:10.1007/978-3-030-19274-7_49

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11496))

Included in the following conference series:

International Conference on Web Engineering

1633 Accesses

Abstract

In addition to a wide range of commercially available data processing tools for data analysis and knowledge discovery, there are a bundle of Unix-shell scripting and text processing tools practically available on every computer. This paper reports on some of these data processing tools and presents how they can be used together to manipulate and transform data and also to perform some sort of analysis like aggregation, etc. Beside the free availability, these tools have the advantage that they can be used immediately, without prior transformation and loading the data into the target system. Another important point is, that they are typically stream-based and thus, huge amounts of data can be processed without running out of main-memory.

You have full access to this open access chapter, Download conference paper PDF

Lavoisier: High-Level Selection and Preparation of Data for Analysis

Bash Datalog: Answering Datalog Queries with Unix Shell Commands

A Comparison of Data Science Systems

Keywords

1 Introduction

The aim of this paper, and the corresponding tutorial presented at ICWE 2019 in Daejeon/Korea, is to present the most useful tools, e.g. cat, grep, tr, sed, awk, comm, uniq, join, split, etc., and to give an introduction on how they can be used together. So, for example, a wide number of queries which typically will be formulated with SQL, can also be performed using the aforementioned tools, as it will be shown in the tutorial. Also, selective data extraction from different webpages and the recombination of this information (mashups) can easily be performed.

2 Filters and Pipes

The underlying architectural pattern of the stream-based data processing is called filters and pipes [1], originally suggested by McIllroy et al. [2]. The general idea is to utilize a bunch of useful programs (filters) that can be stick together using pipes (loose coupling). The programs themselves are called filters, but beside filtering, all sort of operations (e.g. sorting, aggregation, substitution, merging, etc.) can be implemented. A typical filter takes the input from standard input (STDIN) and/or one or multiple files, performs some operation on this input and generates some result on standard output (STDOUT) or one or more files. By connecting multiple programs via pipes, the output of one filter acts as the input of the next filter (see Fig. 1). Data transferred through the filters is often in ASCII format (but don’t have to be).

Technically, the pipe symbol is represented by the vertical bar character (|). Additionally, it is possible to redirect data from a file to the standard input channel of a filter, as well as redirecting the standard output to a file. This redirection is typically done with the ‘>’ and ‘<’ symbols. Besides the STDIN and STDOUT, there is also the standard error output channel (STDERR). The STDERR is particularly used for error messages and debug outputs, respectively. The idea of composing complex programs from small well defined components allows rapid prototyping, incremental iterations and easy experimentation [3].

3 Classification of Commands

A bunch of tools have been developed, which can be used for the purpose of analyzing, composing, and transforming data-streams and files. These can be classified as follows:

File Inspection:

This includes programs such as less, head and tail, which allow users to view and inspect files of any size. While head and tail can only show the beginning or the end of a file, less allows the users to browse and search the file interactively. The less command is also an exception in that it is the only program, among the presented ones herein, that accepts user input.

Filtering:

There exist tools for line or column-wise filtering of data. If the input is comma or tab separated data, the tool cut can be used to extract single or multiple columns (projection) and output them to the standard output (also awk, which will be described later, has this ability). The grep, sed and awk command-line utilities allow a line by line filtering of the input data. In addition to the typical comparison operators, regular expressions can also be used, which makes the tools very powerful. grep can also be used for unstructured data (text), where arbitrary patterns can be specified and only the data specified by the pattern is returned.

Sorting:

The sort command allows the specification of one or more complex sort keys on which an input file is sorted line-by-line. Internally, the sort is implemented using file based merge-sort, which allows also huge files to be sorted without the need of a big main memory. Some other programs like comm, join, and uniq require sorted input, and thus sort is needed quite often.

Substitution Commands:

sed and tr support the transformation of the input data. Whereas tr acts on the character level, sed allows the specification of complex substitution rules using regular expressions.

Composition and Splitting:

Operations in this category allow the column or row-wise composition or splitting of data. Additionally, the join command allows the column-wise composition based on identical join-fields. Figure 2 compares the functionality of the different operations.

Aggregation:

uniq and wc fall into this category. While wc counts the number of input lines or words, uniq is to handle duplicates in a file or stream. Concrete, uniq reports or omits repeated lines in a file or stream. So most frequently, uniq is used in conjunction with a preceding sort command.

Comparison:

diff and comm are representative of this category. Both programs compare two files line by line. While diff reports the differences found, comm requires that the two input files are sorted and works on a set-semantic. As result it reports entries that only appear in the first, or the second, or in both files. Using this command, set-based intersection and minus operations can be implemented.

4 Programs sed and awk

sed stands for Stream Editor. In contrast to a “normal” editor, sed has no interaction with a user. However, insert, update, substitute, and delete commands must be formulated in a command file or on the command line. This allows the automation of recurring (also complex) editing tasks. As a very powerful feature, sed allows the specification of addresses (ranges of lines) on which the operations to be performed are restricted. Addresses can be specified as line numbers, strings, regular expressions, or a combination of them. So, for example the following command, removes all javascript-sections (from <script to </script> ) from a html-file:

awk in contrast is a complete programming language, supporting loops, conditions, arrays, and dictionaries. It is mainly used on structured input data and to generate reports. Both, sed as well as awk support regular expressions.

5 Examples

As a more complex example of processing structured data, the re-implementation of a complex SQL-select statement using pipe and filters will be shown. Consider as input a relational table (Table city), resp. csv-file (file city.csv), consisting of city datasets, is shown in Table 1. The statements return all countries which have more than 100 cities in the database, together with the number of related cities. The output is sorted by the decreasing number of cities in a country.

Table 1. Comparison between SQL and piped UNIX commands

Full size table

Another example of analyzing unstructured data is the following, where a stopword-list, containing the most frequent 20 words (case-insensitive) is created from all text documents in the current directory and stored in a file stopwords.lst .

More examples will be presented and discussed in the corresponding tutorial [4].

References

Raymond, E.: The Art of Unix Programming. Addison-Wesley Professional, Reading (2003)
Google Scholar
McIlroy, M.D., Pinson, E.N., Tague, B.A.: UNIX time-sharing system: foreword. Bell Syst. Tech. J. 57(6), 1899–1904 (1978)
Article Google Scholar
Kleppmann, M.: Designing Data-Intensive Applications. O’Reilly, Sebastopol (2017)
Google Scholar
Tutorial Homepage. https://www.smiffy.de/icwe-2019. Accessed 25 Mar 2019

Download references

Author information

Authors and Affiliations

University of Applied Sciences, Karlsruhe, Germany
Andreas Schmidt
Karlsruhe Institute of Technology, Karlsruhe, Germany
Andreas Schmidt & Steffen Scholz

Authors

Andreas Schmidt
View author publications
You can also search for this author in PubMed Google Scholar
Steffen Scholz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andreas Schmidt .

Editor information

Editors and Affiliations

Novosibirsk State Technical University, Novosibirsk, Russia
Maxim Bakaev
Erasmus University Rotterdam, Rotterdam, The Netherlands
Flavius Frasincar
Korea Advanced Institute of Science and Technology, Daejeon, Korea (Republic of)
In-Young Ko

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Schmidt, A., Scholz, S. (2019). Powerful Data Analysis and Composition with the UNIX-Shell. In: Bakaev, M., Frasincar, F., Ko, IY. (eds) Web Engineering. ICWE 2019. Lecture Notes in Computer Science(), vol 11496. Springer, Cham. https://doi.org/10.1007/978-3-030-19274-7_49

Download citation

DOI: https://doi.org/10.1007/978-3-030-19274-7_49
Published: 26 April 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-19273-0
Online ISBN: 978-3-030-19274-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Powerful Data Analysis and Composition with the UNIX-Shell

Abstract

Similar content being viewed by others

Lavoisier: High-Level Selection and Preparation of Data for Analysis