02-12-2018, 01:13 AM
Hi,
I wrote a C based GIMP plugin to cleanup text scans done with a camera.
Some of these scans, esp. of books tend to have darker and lighter areas on the page , and these don't work well when you try to use a thresholding tool on your text.
The plugin I wrote find the background level dynamically for each small part of the picture, by dividng the picture into squares whose radius is the inner_level parameter. For each square, the average level is calculated on a square whose radius is the kernel_size parameter (and is larger than the inner_size). A histogram is made o all the pixel values within the kernel area, and the most popular one is assumed to be the background level.
The next darker histogram peak is assumed to be the text color, and any pixel brighter than the text brightness (plus the threshold adjust) is squashed to white.
The result is the original text in its original brightness, plus a full white background, suitable for printing.
I forgot to add that the filter first converts everything to grey scale.
The plugin is in: Gimp-clean-text-photos
(Sorry, binaries are windows-only (x86), but full source code is provided).
Demo picture taken from here:
https://pxhere.com/en/photo/745068
Original Photo
Using GIMP threshold tool :
You can easily see that even if you let part of the picture become black, some other section is still too bright.
No global threshold level across all the picture can separate all the text from the background.
Using my plugin (Using a kernel size of 40, inner size 3, and threshold ajust is -12).
Increasing the kernel size increases the are over which the averaging is made, You want it to be at least as large as the text rows height so that the dark letters will never become the majority pixels instead of the background.
Increasing the inner_size makes things faster since more pixels are calculated for each square,
Changing the threshold_asjust controls the offset from the 2nd histogram peak, effectively turning this to a brighness control.
I used it to clean up hundreds of text scans done using a simple pocket camera of open books where some of the pages are darker or have shadows. The objective was to get pictures that have a white background duitable for printing on a B/W laser printer, without losing text clarity.
Hope you find this useful.
I wrote a C based GIMP plugin to cleanup text scans done with a camera.
Some of these scans, esp. of books tend to have darker and lighter areas on the page , and these don't work well when you try to use a thresholding tool on your text.
The plugin I wrote find the background level dynamically for each small part of the picture, by dividng the picture into squares whose radius is the inner_level parameter. For each square, the average level is calculated on a square whose radius is the kernel_size parameter (and is larger than the inner_size). A histogram is made o all the pixel values within the kernel area, and the most popular one is assumed to be the background level.
The next darker histogram peak is assumed to be the text color, and any pixel brighter than the text brightness (plus the threshold adjust) is squashed to white.
The result is the original text in its original brightness, plus a full white background, suitable for printing.
I forgot to add that the filter first converts everything to grey scale.
The plugin is in: Gimp-clean-text-photos
(Sorry, binaries are windows-only (x86), but full source code is provided).
Demo picture taken from here:
https://pxhere.com/en/photo/745068
Original Photo
Using GIMP threshold tool :
You can easily see that even if you let part of the picture become black, some other section is still too bright.
No global threshold level across all the picture can separate all the text from the background.
Using my plugin (Using a kernel size of 40, inner size 3, and threshold ajust is -12).
Increasing the kernel size increases the are over which the averaging is made, You want it to be at least as large as the text rows height so that the dark letters will never become the majority pixels instead of the background.
Increasing the inner_size makes things faster since more pixels are calculated for each square,
Changing the threshold_asjust controls the offset from the 2nd histogram peak, effectively turning this to a brighness control.
I used it to clean up hundreds of text scans done using a simple pocket camera of open books where some of the pages are darker or have shadows. The objective was to get pictures that have a white background duitable for printing on a B/W laser printer, without losing text clarity.
Hope you find this useful.