Tuesday, January 26, 2016

Running bash commands in parallel

Introduction


A modern server is typically multi-core, perhaps even multi-CPU. That is plenty of computing power to unleash on a given job. However, unless you run a job in parallel, you are not maximizing the use of all that power.


Below are some typical everyday operations we can speed up using parallel computing:

  1. Backup files from multiple source directories to a removable disk.
  2. Resize image files in a directory.
  3. Compress files in a directory.

To execute a job in parallel, you can use any of the following commands:

  • ppss
  • pexec
  • GNU parallel

This post focuses on the GNU parallel command.

Installation of GNU parallel

To install GNU parallel on a Debian/Ubuntu system, run the following command:

$ sudo apt-get install parallel

General Usage

The GNU parallel program provides many options which you can specify to customize its behavior. Interested readers can read its man page to learn more about their usage. In this post, I will narrow the execution of GNU parallel to the following scenario.

My objective is to run a shell command in parallel, but on the same multi-core machine. The command can take multiple options, but only 1 is variable. Specifically, you run concurrent instances of the command by providing a different value for that one variable option. The different values are fed, one per line, to GNU parallel via the standard input.

The rest of this post shows how GNU parallel can backup multiple source directories by running rsync in parallel.

Parallel backup

The following command backs up 2 directories in parallel: /home/peter and /data.

$ echo -e '/home/peter\n/data' | parallel -j-2 -k --eta rsync -R -av {} /media/myBKUP

Standard input

The echo command assembles the 2 source directory locations, separated by a newline character (\n), and pipes it to GNU parallel.

How many jobs?

By default, GNU parallel deploys 1 job per core. You can override the default usint the -j option.

-j specifies the maximum number of parallel jobs that GNU parallel can deploy. The maximum number can be specified in 1 of several ways:

  • -j followed by a number

    -j2 means that up to 2 jobs can run in parallel.

  • -j+ followed by a number

    -j+2 means that the maximum number of jobs is the number of cores plus 2.

  • -j- followed by a number

    -j-2 means that the maximum number of jobs is the number of cores minus 2.

If you don't know how many cores the machine has, run the command below:

$ parallel --number-of-cores
8

Keeping output order

Each job may output lines to the standard output. When multiple jobs are run in parallel, the default behavior is that a job's output is displayed as soon as the job finishes. You may find this confusing because the output order may be different from the input order. The -k option keeps the output sequence the same as the input sequence.

Showing progress

The --eta option reports progress while GNU parallel executes, including the estimated remaining time (in seconds).

Input place-holder

GNU parallel substitutes the {} parameter with the next line in the standard input.

Each input line is a directory location, e.g., /home/peter. Instead of the full location, you can specify other parameters in order to extract a portion thereof - e.g., the directory name(/home) and the basename (peter). Please refer to the man page for details.

Summary

GNU parallel is a tool that Linux administrators should add to their repertoire. Running a job in parallel can only improve one's efficiency. If you are already familiar with xargs, you will find the syntax familiar. Even if you are new to the command, there is a wealth of on-line help on the GNU parallel website.