[consulting] From Highly Formatted Print to Web -- Best Solution?

Ted ted-drupalists at webfirst.com
Thu Apr 7 18:01:29 UTC 2011


On 4/7/2011 1:06 PM, Shai Gluskin wrote:
> Here is my idea that I want feedback on: They should take a screen 
> shot of the whole page (the pages are small, like 5in. x 7in.) for 
> each day and upload the image. In order to get search-engine traffic 
> for those pages, I thought they should just copy and paste the text 
> from the pdf into a text field, losing all the formatting. Using css, 
> I'll make sure the text field displays off-screen. This way they get 
> the cleanest/easiest data entry and best-looking presentation while 
> still allowing search engines to drive traffic to the page based on 
> the text contents of a field that will be offscreen.
>
> Does that make sense?
>
> Any other ideas?
>

We've used a pdf-based toolchain to emulate the google docs pdf 
quickview function, including "highlighting" and copying text.

http://pdftoxml.sourceforge.net/ - Gives you the coordinates and 
dimensions of each piece of text
pdftoppm - Exports each page to a PNG image (with appropriate arguments)
convert - ImageMagick tools to convert PNGs to indexed format, make 
thumbnails, etc.

You'll most likely want to use some XSLT to cut down the size of the XML 
file from pdftoxml. These tools can be used to automate your idea above 
(with or without support for highlighting/copying the book text).

Ted



More information about the consulting mailing list