Introduction
A modern server is typically multi-core, perhaps even multi-CPU. That is plenty of computing power to unleash on a given job. However, unless you run a job in parallel, you are not maximizing the use of all that power.
Below are some typical everyday operations we can speed up using parallel computing:
- Backup files from multiple source directories to a removable disk.
- Resize image files in a directory.
- Compress files in a directory.
To execute a job in parallel, you can use any of the following commands:
ppss
pexec
GNU parallel
This post focuses on the GNU parallel
command.
Installation of GNU parallel
To install GNU parallel
on a Debian/Ubuntu system, run the following command:
$ sudo apt-get install parallel
General Usage
The GNU parallel
program provides many options which you can specify to customize its behavior.
Interested readers can read its man
page to learn more about their usage. In this post, I will narrow the execution of GNU parallel
to the following scenario.
My objective is to run a shell command in parallel, but on the same multi-core machine. The command can take multiple options, but only 1 is variable. Specifically, you run concurrent instances of the command by providing a different value for that one variable option. The different values are fed, one per line, to GNU parallel
via the standard input.
The rest of this post shows how GNU parallel
can backup multiple source directories by running rsync
in parallel.
Parallel backup
The following command backs up 2 directories in parallel: /home/peter
and /data
.
$ echo -e '/home/peter\n/data' | parallel -j-2 -k --eta rsync -R -av {} /media/myBKUP
Standard input
The echo
command assembles the 2 source directory locations, separated by a newline character (\n
), and pipes it to GNU parallel
.
How many jobs?
By default, GNU parallel
deploys 1 job per core. You can override the default usint the -j
option.
-j
specifies the maximum number of parallel jobs that GNU parallel
can deploy. The maximum number can be specified in 1 of several ways:
-j
followed by a number
-j2
means that up to 2 jobs can run in parallel.-j+
followed by a number
-j+2
means that the maximum number of jobs is the number of cores plus 2.-j-
followed by a number
-j-2
means that the maximum number of jobs is the number of cores minus 2.
If you don't know how many cores the machine has, run the command below:
$ parallel --number-of-cores
8
Keeping output order
Each job may output lines to the standard output. When multiple jobs are run in parallel, the default behavior is that a job's output is displayed as soon as the job finishes. You may find this confusing because the output order may be different from the input order. The -k
option keeps the output sequence the same as the input sequence.
Showing progress
The --eta
option reports progress while GNU parallel
executes, including the estimated remaining time (in seconds).
Input place-holder
GNU parallel
substitutes the {}
parameter with the next line in the standard input.
Each input line is a directory location, e.g., /home/peter
. Instead of the full location, you can specify other parameters in order to extract a portion thereof - e.g., the directory name(/home
) and the basename (peter
). Please refer to the man page for details.
Summary
GNU parallel
is a tool that Linux administrators should add to their repertoire. Running a job in parallel can only improve one's efficiency. If you are already familiar with xargs
, you will find the syntax familiar. Even if you are new to the command, there is a wealth of on-line help on the GNU parallel
website.
1 comment:
Hi Peter,
Is this supposed to be faster than just running two rsync instances in two separate terminals? Just wondering why I should really want to use this...
Post a Comment