UA-5095748-1

Monday, March 31, 2008

Batch extraction and conversion of PDF to txt and jpg

I found this great open source tool to extract the text and images from a pdf file. The tool provide a set of command line utilities to extract various components out of the pdf. There is a Linux and Win32 version of the tool.

The command line is very easy to use
Extract Text
pdftotext [options] []

options includes:
-layout : maintain original physical layout
-htmlmeta : generate a simple HTML file, including the meta information

Extract Images
pdfimages [options]

This extract all the images as as Portable Pixmap (PPM), Portable Bitmap (PBM), or JPEG files. However it doesn't convert any PPM or PBM files into jpg. You'll need a separate utility for that.

No comments: