Saturday, April 19, 2008

Extracting columns and fields from a text file

I posted about extracting lines from a text file ([1], [2]).

Enough about lines for now. Let's turn our attention to extracting columns and delimited fields in a text file. For instance, one task is to extract columns 5 to 7 in a file. Sometimes, the data you want reside in variable-length fields that are delimited by some character, say ",". A sample task is to extract the second field in a comma-delimited file.

As usual, there are more than 1 way to accomplish the tasks. The tools that we will use are cut, awk, and perl.

The text file is somefile.

$ cat > somefile
1234567890
1234567890
1234567890
1234567890

To extract fixed columns (say columns 5-7 of a file):

$ cut -c5-7 somefile
567
567
567
567

$ perl -pe '$_ = substr($_, 4, 3) . "\n"'  somefile
567
567
567
567

The current line ($_) is replaced with substr($_, 4, 3), the substring starting from column 4 (perl is 0-based) for 3 characters.

To illustrate extracting a particular field, let's use /etc/passwd, a colon-delimited file. Say we extract the 6th field (home directory of users).

$ cut -d: -f6 /etc/passwd

$ awk -F : '{print $6}' /etc/passwd

$ perl -p -e '$_ = (split(/[:\n]/))[5] . "\n"' /etc/passwd

Here, I used the split function to separate out the words delimited by colon and the new line. The output of split is a list, and we assign the 5th element (perl is 0-based) to the current line. \n is necessary as a delimiter [:\n]; otherwise extracting the last field will have an extra new line.

If you think of some simple way to do this, please share with us using comments.

1 comment:

Anonymous said...

its a bikeshed, but the shorter, slightly more readable perl expression is:
perl -F: -anle 'print $F[5]' /etc/passwd
and
perl -F'|' -anle 'print @F[4..6]'