Monday, March 17, 2008

Converting PDF to bitmaps

I've been trying to use ImageMagick's "convert" utility to convert a PDF image into constituent bitmap images -- for OCR and other purposes. If I just run:
convert file.pdf file.tiff
I end up with a really large mostly blank page with a very small bitmap in the lower left corner. That tiny bitmap is also very blocky when the size is blown up. So here is how to fix this problem:

The Difficulty Way
  1. Use the -density option to specify the DPI of the resulting bitmap. This will get rid of the blockiness, but will also make the entire image very large -- including the large blank areas.
  2. Use the -crop option to crop out the tiny portion of the image that we actually want. We'll need to do a little math to get the exact numbers. The origin for the page is the top left corner. Positive number shift right and down. So we need to compute the Y offset to get just the bottom of the page. I haven't seen a way to reorient the coordinate system of the page to make this easier.
In my example, the original image was 2900x3800. You can use ImageMagick's "identify" utility to find out what the native image sizes are within the pdf. When I used convert -density 300 without any cropping the resulting image size was 12083x15833. I just want the lower 3800 pixels so I have to offset by 15833-3800 or 11933 pixels. So the conversion command is:
convert -density 300 -crop 2900x3800+0+11933 file.pdf target.tif
Well, I tried this and doesn't work exactly but is close. It seems that the conversion is using some scaling factor that I can't find. Fortunately, there is a better way.

The Easy Way

Use the -density option along with the -page option to specify page size. This crops the page correctly if you use a page size (e.g., letter, A4, etc.) whose aspect ratio matches your page. In these case, the final conversion command is:
convert -density 300 -page letter file.pdf target.tif
We can also use the "-compress lzw" option to compress the file when we are using TIFF for our bitmap format.