Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Improving quality of scan document
#1
Hey all, this is my first post here, although I have used Gimp for a long time for simple things.

I have problem that I think Gimp could help me solved, but I'm not sure how. I have an old service manual scanned, in a very good shape in general, but would be great if I can bring it to almost perfection. I think there are three things that could improve the current quality from an 80% to a 95% (subjective numbers). These things are:
  • Remove all single pixel marks. The document has in the background some single pixels (over the white background). Removing al those would be a large improvement.
  • Straighten out the pages. Some pages are not perfectly "vertical", so detecting the pages that are at an angle and rotating them would also be a huge improvement.
  • Use OCR to read the text and replace the some how "broken" characters with newly rendered ones of the same font and size.
I think the last one is the hardest one. Anyway, another challenge is that the document is 400+ page long, so whatever I do should be scripted, something like:
  • PDF to individual images
  • Process each image
  • List of images to PDF
Does anyone have an idea on how I could tackle this? If instead of Gimp I should use another tool I'm also opened to suggestions.

Attached is an example of a page.


Attached Files Thumbnail(s)
   
Reply
#2
I do not think you will have much luck for processing 400+ pages Wink

Do you have a better image ? The one you posted is 743 x 921 pixels which makes that size of text pixelated.

Quote:Remove all single pixel marks. The document has in the background some single pixels (over the white background). Removing al those would be a large improvement.

Despeckle with that image does not really work. It also degrades the text. I fall back on the simple solution of colour select the white background with the select threshold upped to about 50. That excludes the text and includes the speckles. Then bucket fill with white. Turn the selection off.

Quote:Straighten out the pages. Some pages are not perfectly "vertical", so detecting the pages that are at an angle and rotating them would also be a huge improvement.

some screenshots. https://i.imgur.com/eeaueUM.jpg (1) (2) (3)

For that sample page it is more than just straightening. Throw in some guides and use the universal transform tool to line up that left-side edge. (1)

Quote:PDF to individual images

For a decent but large PDF I scaled up to USLetter size (8.5" wide @ 300 ppi) That gave an image size 2550 x 3161 pixels - large image. Sharpened that with a plugin gmic which has some good tools (2)

Gimp makes large PDFs so exported as a very much reduced quality jpeg 45 quality / chroma halved and used that with ImageMagick to make that 500 KB PDF that fits this forum max file size. see: attached PDF

Quote:Use OCR to read the text and replace the some how "broken" characters with newly rendered ones of the same font and size.

Well, do you know a matching font ? Since you use linux, tesseract and a front end for it should be available. Tessereact is very good but it does depend on the image quality. The whole point is to get text that will run through a spell checker without your intervention. (3) is a comparison between small size image (not so good) and scaled up image (better).  


best of luck.


Attached Files
.pdf   page-12-98.pdf (Size: 481.35 KB / Downloads: 185)
Reply


Forum Jump: