Best workflow for pdf to ePub processing and editing (large project)
Colleagues have been asked to convert a large pdf archive (about 150,000 items) to ePub and Mobi (keeping the pdfs, too, of course).
I ran a few newish ones through Calibre to find the pain points: TOCs need deletion and recreation, tables dissolve into lines of text, extra line spaces become flush left periods. And of course the older pdfs are images, so there's no point to the conversion. Might as well retype...
If they do need to figure out how to do this after reporting all of the issues, I wonder if there are any known ways to automate any of the tasks involved. If each document needs individualized editing, the project looks to me like it will take years. Anyone interested?
BTW, I'm the only designer in the group; the others are archivists, data processors. So only I use InDesign (OS X); they use Windows 7 (I believe) and MS Office. No XML pros either as far as I know.
I actually found an answer to the question not long ago that hasn't yet been mentioned, so I'm answering my own question. It's a simple program that rejiggers a larger PDF to mobile/ipad screen sizes, and outputs a new PDF to the dimensions you choose. And it works for scanned PDFs as well.
Quote from the forum where I heard about it: "It reworks PDFs to a format that is Kindle-friendly (optimized for a 6in wide screen, with an option for 4in wide phone screen). The files remain PDFs. See the attached files for comparison (the one with _k2opt is the revised PDF). It even works on scanned PDFs, as you will see. On top of all that, it will convert a whole folder of PDFs at a time."
Using this we could prepare the needed formats with less effort than any other method. We haven't committed to doing this, however we now have a way to do it in less than the 10 years I estimated it might take one person to do it.