by wjw on November 9, 2011

Can anyone recommend a good .pdf to .doc and/or .rtf translator?

I’ve tried a number of them, and they all load the resulting file with massive amounts of formatting clutter, which make it (among other things) impossible to produce a decent result while working in e-formats.

Friend of the blog TJIC managed to fix Hardwired when it was afflicted with this problem, but (1) I’d hate to keep bugging him with one ms after another, and (2) I’d rather the problem never exist in the first place.

So far I’ve used Smart PDF Converter, DocSmartz Platinum, Able2 Extract, Nitro, and some free online services.  All of which had similar problems.

Anyone know anything else?

Zora November 9, 2011 at 6:13 am

I’d run it through ABBYY Finereader to a .txt output. I would have to restore any italics and underlining, but I wouldn’t get formatting junk. Unfortunately, ABBYY isn’t free.

James R. Strickland November 9, 2011 at 6:46 am

I’d suggest feeding it to adobe acrobat and using the OCR functionality.

Ingvar November 9, 2011 at 7:25 am

Unfortunately, getting something sane out of PDF seems to border on “requires human levels of cognition”. There’s (usually) no long chunks of text inside the PDF (there may be word fragments, but usually only between kerning points), so a lot of what’s done is (essentially) OCR.

DataPacRat November 9, 2011 at 10:55 am

For any e-text file that has at least somewhat reasonably sane page-formatting, including at least some PDFs, my first choice is to try using Calibre, http://calibre-ebook.com/ , to convert it to the desired file-format.

TJIC November 9, 2011 at 12:51 pm

Tweaking Hardwired took me about three minutes total, so I really don’t mind doing the rest of the oeuvre.

wjw November 10, 2011 at 1:50 am

So what seems to have happened is that Adobe created a format that is absolutely unusable.

And now everyone uses it.

Turns out the easiest thing to do with this particular book was to go back to the original files from 1991, written on a freeware word processor no one uses any more, and which doesn’t translate particularly well, but at least translates with =some= formatting intact.

“At least it isn’t Adobe,” I keep thinking to myself.

TJIC, I may take you up on that, by an’ by.

Michael_gr November 14, 2011 at 7:11 pm

Walter, don’t be too harsh on Adobe / PDF. your problems with the format occur because you are not using it for its original purpose. Acrobat is a *formatting* file format, not a text format. It was designed to specify the exact placing of elements on a page, to make sure that a file prints precisely as you set it on your machine no matter what. Since each word processor, graphics app, from-producing program or what have you has it own text layout engine, and there is no universal standard of text layout that would hyphenate, break lines, treat kerning and such exactly the same, said application has to break the text into little bits and place them on the page using sets of coordinates.

There is a subset of PDF that was designed for archival use. It’s called PDF/a-1. Most word processors will allow you to save in this format (sometimes it’s called “tagged PDF”). This subformat retains the paragraph structure and you can recover the formatted text from such a file. But most people don’t use it or know about it. Of course, many people don’t want you to be able to recover the text and are using PDF exactly for that reason.

Steinar Bang November 17, 2011 at 10:43 pm

Ran across this one called “pdfmasher”, today:

No idea how good it is.

Personally I’ve tried using calibre for this, with… varying degrees of success.

