03-19-2023, 10:51 AM
I do not think you will have much luck for processing 400+ pages
Do you have a better image ? The one you posted is 743 x 921 pixels which makes that size of text pixelated.
Despeckle with that image does not really work. It also degrades the text. I fall back on the simple solution of colour select the white background with the select threshold upped to about 50. That excludes the text and includes the speckles. Then bucket fill with white. Turn the selection off.
some screenshots. https://i.imgur.com/eeaueUM.jpg (1) (2) (3)
For that sample page it is more than just straightening. Throw in some guides and use the universal transform tool to line up that left-side edge. (1)
For a decent but large PDF I scaled up to USLetter size (8.5" wide @ 300 ppi) That gave an image size 2550 x 3161 pixels - large image. Sharpened that with a plugin gmic which has some good tools (2)
Gimp makes large PDFs so exported as a very much reduced quality jpeg 45 quality / chroma halved and used that with ImageMagick to make that 500 KB PDF that fits this forum max file size. see: attached PDF
Well, do you know a matching font ? Since you use linux, tesseract and a front end for it should be available. Tessereact is very good but it does depend on the image quality. The whole point is to get text that will run through a spell checker without your intervention. (3) is a comparison between small size image (not so good) and scaled up image (better).
best of luck.
Do you have a better image ? The one you posted is 743 x 921 pixels which makes that size of text pixelated.
Quote:Remove all single pixel marks. The document has in the background some single pixels (over the white background). Removing al those would be a large improvement.
Despeckle with that image does not really work. It also degrades the text. I fall back on the simple solution of colour select the white background with the select threshold upped to about 50. That excludes the text and includes the speckles. Then bucket fill with white. Turn the selection off.
Quote:Straighten out the pages. Some pages are not perfectly "vertical", so detecting the pages that are at an angle and rotating them would also be a huge improvement.
some screenshots. https://i.imgur.com/eeaueUM.jpg (1) (2) (3)
For that sample page it is more than just straightening. Throw in some guides and use the universal transform tool to line up that left-side edge. (1)
Quote:PDF to individual images
For a decent but large PDF I scaled up to USLetter size (8.5" wide @ 300 ppi) That gave an image size 2550 x 3161 pixels - large image. Sharpened that with a plugin gmic which has some good tools (2)
Gimp makes large PDFs so exported as a very much reduced quality jpeg 45 quality / chroma halved and used that with ImageMagick to make that 500 KB PDF that fits this forum max file size. see: attached PDF
Quote:Use OCR to read the text and replace the some how "broken" characters with newly rendered ones of the same font and size.
Well, do you know a matching font ? Since you use linux, tesseract and a front end for it should be available. Tessereact is very good but it does depend on the image quality. The whole point is to get text that will run through a spell checker without your intervention. (3) is a comparison between small size image (not so good) and scaled up image (better).
best of luck.