Saturday, February 23, 2013

Splitting up is easy for a PDF file

Occasionally, I needed to extract some pages from a multi-page pdf document. Suppose you have a 6-page pdf document named myoldfile.pdf. You want to extract into a new pdf file mynewfile.pdf containing only pages 1 and 2, 4 and 5 from myoldfile.pdf.

I did exactly that using pdktk, a command-line tool.

If pdftk is not already installed, install it like this on a Debian or Ubuntu-based computer.

$ sudo apt-get update
$ sudo apt-get install pdftk

Then, to make a new pdf with just pages 1, 2, 4, and 5 from the old pdf, do this:

$ pdftk myoldfile.pdf cat 1 2 4 5 output mynewfile.pdf

Note that cat and output are special pdftk keywords. cat specifies the operation to perform on the input file. output signals that what follows is the name of the output pdf file.

You can specify page ranges like this:

$ pdftk myoldfile.pdf cat 1-2 4-5 output mynewfile.pdf

pdftk has a few more tricks in its back pocket. For example, you can specify a burst operation to split each page in the input file into a separate output file.

$ pdftk myoldfile.pdf burst 

By default, the output files are named pg_0001.pdf, pg_0002.pdf, etc.

pdftk is also capable of merging multiple pdf files into one pdf.

$ pdftk pg_0001.pdf pg_0002.pdf pg_0004.pdf pg_0005.pdf output mynewfile.pdf 

That would merge the files corresponding to the first, second, fourth and fifth pages into a single output pdf.

If you know of another easy way to split up pages from a pdf file, please tell us in a comment. Much appreciated.

Two updates (part 2, part 3) are available for this post.


Anand Reddy Pandikunta said...

Oh man... great tutorial. Thank you. keep posting!!

Tarik's Blog said...

Thanks! Straight to the point. Viva Linux!

Anonymous said...

Tried to get free pdf split and merge programs for windows and got warnings from my antivirus that aborted installation.

Linux does it so neatly. Thanks for the excellent post!

Anonymous said...

Great tip. Thanks.

Anonymous said...

thanks for this blog entry. it has proved very useful.

Richard Gravois said...

I split bigfile into pages.
It seems that a big watermark "Sample" shows up in Safari and chrome but not other browsers (mozilla, IE). The watermark is not in bigfile.
What switch adds the watermark?

Anonymous said...

what's the difference with print into pdf file and selecting only the desired pages ?

ChucklingMcArseoff said...

pdftk looks like a pretty neat tool indeed, but if all you're trying to accomplish is splitting a PDF into separate files per page, then you can just open the PDF in Evince (or your favorite PDF viewer capable of printing) and select File > Print... and tell the print dialog which pages you want then select "Print to file".

Nazim Aghabayov said...

Thank dude! Your reference is really helpful. I scripted a small file to split pdf every several pages


#first arg is a file name
export file=$1

#second argument is pages per file
export ppd=$2

pagecount=$(pdfinfo -- "$file" 2> /dev/null | awk '$1 == "Pages:" {print $2}')

echo document $file has $pagecount pages
echo splitting per $ppd pages

while [ "$currentp" -le "$pagecount" ]; do

let modl=$currentp%$ppd

if [ 0 -eq $modl ]; then
let pbeginning=$currentp-$ppd+1
let pend=$currentp
echo " $pbeginning $pend"
pdftk $file cat $pbeginning-$pend output "$file"_"$secn".pdf
let last=$currentp
let secn=$secn+1

#last page
if [ $currentp -eq $pagecount ]; then
if [ $last -ne $currentp ]; then
let secn=$secn+1
let pbeginning=$last+1
let pend=$currentp
echo "last: $pbeginning $pend"
pdftk $file cat $pbeginning-$pend output "$file"_"$secn".pdf

let currentp=$currentp+1


Harris Webb said...

Thank you a lot for sharing this.Besides, I found this PDF split resource, I'm not sure whether it supports Linux?

JR said...

hi! i'm Jose, from Spain

i have tried the Nazim Aghabayov script, but it's like there is a bug...
i saved the script as, and this is what is shown 18: let: not found 20: [: -eq: argument expected 40: let: not found

as far i can know, the message of line 18 is about
let modl=$currentp%$ppd
and the message of line 20 is indeed about $modl

can anybody see where the bug is, if any?

thanks a lot, guys

Anonymous said...

very useful for breaking up pdf books, thanks!

monarch a sadist said...

thanks man helped a lot... i owe u atleast a thaks

Zz Lorreta said...

Here is the link for Split pdf document. Hope this gives you a start for you file pdf program on rasteredge page

Anonymous said...

JUST realized that closing the left side pane containing the thumbnails of each page in the PDF allows for the file to scroll 98-99% smoothly.

Stumbled upon the solution as I was printing PDF files with regards to page ranges and chapters in order to split the book up into smaller file sizes, which was working very goooood too by the way. But simply closing the left side thumb-nails is a lot less work :)

Akom said...

I had to write a script to split the original PDF into pages in order to allow tesseract and imagemagick to handle it without running out of memory, and to overcome the TIFF with alpha channel issues (spp not in set {1,3,4})

Script and write-up are here:

Thanks for the starting point!