Gimp-Forum.net
New camera text scan cleanup plugin - Printable Version

+- Gimp-Forum.net (https://www.gimp-forum.net)
+-- Forum: GIMP (https://www.gimp-forum.net/Forum-GIMP)
+--- Forum: Extending the GIMP (https://www.gimp-forum.net/Forum-Extending-the-GIMP)
+--- Thread: New camera text scan cleanup plugin (/Thread-New-camera-text-scan-cleanup-plugin)



New camera text scan cleanup plugin - udif - 02-12-2018

Hi,

I wrote a C based GIMP plugin to cleanup text scans done with a camera.
Some of these scans, esp. of books tend to have darker and lighter areas on the page , and these don't work well when you try to use a thresholding tool on your text.
The plugin I wrote find the background level dynamically for each small part of the picture, by dividng the picture into squares whose radius is the inner_level parameter. For each square, the average level is calculated on a square whose radius is the kernel_size parameter (and is larger than the inner_size). A histogram is made o all the pixel values within the kernel area, and the most popular one is assumed to be the background level.
The next darker histogram peak is assumed to be the text color, and any pixel brighter than the text brightness (plus the threshold adjust) is squashed to white.
The result is the original text in its original brightness, plus a full white background, suitable for printing.
I forgot to add that the filter first converts everything to grey scale.

The plugin is in: Gimp-clean-text-photos
(Sorry, binaries are windows-only (x86), but full source code is provided).

Demo picture taken from here:
https://pxhere.com/en/photo/745068

Original Photo

Using GIMP threshold tool :
You can easily see that even if you let part of the picture become black, some other section is still too bright.
No global threshold level across all the picture can separate all the text from the background.

Using my plugin (Using a kernel size of 40, inner size 3, and threshold ajust is -12).

Increasing the kernel size increases the are over which the averaging is made, You want it to be at least as large as the text rows height so that the dark letters will never become the majority pixels instead of the background.
Increasing the inner_size makes things faster since more pixels are calculated for each square,
Changing the threshold_asjust controls the offset from the 2nd histogram peak, effectively turning this to a brighness  control.

I used it to clean up hundreds of text scans done using a simple pocket camera of open books where some of the pages are darker or have shadows. The objective was to get pictures that have a white background duitable for printing on a B/W laser printer, without losing text clarity.

Hope you find this useful.


RE: New camera text scan cleanup plugin - Ofnuts - 02-12-2018

(02-12-2018, 01:13 AM)udif Wrote: Using GIMP threshold tool :
You can easily see that even if you let part of the picture become black, some other section is still too bright.
No global threshold level across all the picture can separate all the text from the background.

Canonical technique:

- duplicate layer, blur heavily top copy so that text disappears
- set top copy to grain extract
- Layer>New from visible
- Apply threshold

[attachment=1450]


Personal remark: there are two kinds of "camera scans". Those taken with even lighting and good equipment (copy stand, DSLR...) and those shot handheld with a smartphone under available light. In the former the picture is clean and algorithms work well. In the latter, the picture is mercilessly postprocessed by the camera and the generated artifacts although not always visible will often trip algorithms (on the lesser smartphones, this a pixel soup...). So, better test your thing with real-life photos.



RE: New camera text scan cleanup plugin - udif - 02-12-2018

Thanks, every day you learn something new :-)

Regarding your remark, I wrote the plugin after I had to process a few hundred pictures shot by a (not so bad) pocket camera in non-ideal lighting conditions, where the resulting documents are not evenly lit. (Pictures not taken by me).
Has I known this technique I might have not bothered writing this filter ...
Still, your suggestion does 2-level thresholding with a local threshold based on average background levels.
My plugin behaves slightly different. Anything above the threshold is turned into white, but anything below is kept as-is, so it has some antialiasing effect.
Maybe I could try doing something similar with the curve tool.