Thursday, April 17, 2008

Use sed or perl to extract every nth line in a text file

I recently blogged about the use of sed to extract lines in a text file.

As examples, I showed some simple cases of using sed to extract a single line and a block of lines in a file.

An anonymous reader asked how one would extract every nth line from a large file.

Suppose somefile contains the following lines:
$ cat > somefile
line 1
line 2
line 3
line 4
line 5
line 6
line 7
line 8
line 9
line 10

Below, I show 2 ways to extract every 4th line: lines 4 and lines 8 in somefile.
  1. sed
    $ sed -n '0~4p' somefile
    line 4
    line 8

    0~4 means select every 4th line, beginning at line 0.

    Line 0 has nothing, so the first printed line is line 4.

    -n means only explicitly printed lines are included in the output.

  2. perl
    $ perl -ne 'print ((0 == $. % 4) ? $_ : "")'  somefile
    line 4
    line 8

    $. is the current input line number.

    % is the remainder operator.

    $_ is the current line.

    The above perl statement prints out a line if its line number
    can be evenly divided by 4 (remainder = 0).

    $ perl -ne 'print unless (0 != $. % 4)' somefile
    line 4
    line 8

Click here for a more recent post on sed tricks.


tsilver said...

Thank you. I don't use SED enough and this was a good reminder.

Anonymous said...

Note that your last perl example (already much more readable than the 1st) can be further simplified to

perl -ne 'print unless ($. % 4)' somefile

Since in perl, 0 is false in a boolean context, the "0 != " test is redundant.

Anonymous said...

I am sure the author knows that. Adding "0 != " adds clarity to the code and makes it readable, and it doesnt cost any extra machine cycles FYI!

Anonymous said...

I am currently using exactly what you suggest in your sed example. My problem is that my file is quite large - almost 5 million lines. I also need certain blocks of lines, e.g. every other set of say 10 lines. So, I wrote a bash script for it, but it is taking a very long time. I am wondering if it is so, because although -n represses the output of the majority of the lines, it is still traversing them all. I don't know if this is true.
In any case - would you be able to suggest a more efficient way of doing what I am trying to do?

Gopi said...

sed -n '3~3p'

the above command is not working. Its saying Unrecognized command:3~3P

Can you please Help me on this

Unknown said...

Gopi, it is working for me.

[root@dachis-centos ~]# sed -n '3~3p'
for i in {1..3}