Wednesday, January 29, 2014

How to split up PDF files - part 2

In an earlier post, I used the pdftk tool to extract pages from a pdf file. I had no reason to investigate alternative solutions until I encountered the following problem.

I had to extract the first 4 pages of a pdf document. The normally reliable pdftk command generated a Java exception.

$ pdftk T4.pdf cat 1-4  output outputT4.pdf
Unhandled Java Exception:
Unhandled Java Exception:
   at gnu.gcj.runtime.NameFinder.lookup(
   at java.lang.Throwable.getStackTrace(
   at java.lang.Throwable.stackTraceString(
   at java.lang.Throwable.printStackTrace(
   at java.lang.Throwable.printStackTrace(

To troubleshoot the problem, I executed the pdftk command using a different input pdf file. It worked just fine. The problem appears to be the specific input pdf file.

At that point, I started looking for an alternative tool.

gs, aka Ghostscript, is a previewer for PDF as well as PostScript files.

You can direct gs output to various output devices using the -sDEVICE parameter. The pdfwrite device specifies that the output will be in PDF file format.

The page range to extract is defined by -dFirstPage and -dLastPage parameters. The name of the output file is specified using -sOutputFile parameter.

$ gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER -dFirstPage=1 
-dLastPage=4 -sOutputFile=outputT4.pdf T4.pdf
GPL Ghostscript 9.05 (2012-02-08)
Copyright (C) 2010 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
**** Warning: considering '0000000000 XXXXX n' as a free entry.
**** Warning: considering '0000000000 XXXXX n' as a free entry.
**** Warning: considering '0000000000 XXXXX n' as a free entry.
Processing pages 1 through 4.
Page 1
Loading NimbusSanL-Regu font from /usr/share/fonts/type1/gsfonts/n019003l.pfb... 4287624 2669241 2475832 1154775 3 done.
Loading NimbusSanL-Bold font from /usr/share/fonts/type1/gsfonts/n019004l.pfb... 4328616 2778664 2516200 1192102 3 done.
Loading NimbusMonL-Regu font from /usr/share/fonts/type1/gsfonts/n022003l.pfb... 4371912 2946486 2677672 1350807 3 done.
Page 2
Loading NimbusSanL-BoldItal font from /usr/share/fonts/type1/gsfonts/n019024l.pfb... 4431472 2877228 2738224 1120988 3 done.
Loading NimbusSanL-ReguItal font from /usr/share/fonts/type1/gsfonts/n019023l.pfb... 4471488 2998784 2758408 1209901 3 done.
Page 3
Page 4
**** This file had errors that were repaired or ignored.
**** The file was produced by: 
**** >>>> iText 1.4.5 (by <<<<
**** Please notify the author of the software that produced this
**** file that it does not conform to Adobe's published PDF
**** specification.

The above output messages provided a clue on why the input pdf file was problematic. The pdf file does not "conform to Adobe's published PDF specification." To its credit, gs "repaired or ignored" the problem. It continued on to successfully extract the pages. In this particular example, gs is more error tolerant than its counterpart, pdftk.

P.S. You can also use ImageMagick to divide pdf files. See my post.


Anonymous said...

thanks, you saved my day with your idea to replace pdftk by gs

Dominic Raferd said...

Thanks for your tip about pdftk I am now using it so that people in our office can easily split up multi-page pdf documents which have been created by our scanner into individual page-by-page documents, it will save them a lot of time.

Michael said...

Thanks a lot, I've been having issues with a upload/download limit pr. file for a while now on specific server. With this, I can split and merge the files while keeping each file under the limit.

Tomas said...

Thanks for this. I found this which would make things even more convenient:
-o Option