Thread Rating:
  • 1 Vote(s) - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Erase / remove borders from a table (scanned pdf)
#1
Hello,
I use GIMP 2.8.22 on Linux Mint 19. I discovered new software for OCR from a scanned pdf file: Scan2pdf and Tesseract. I would like to use OCR but the scanned pdf file is a table with borders, and some data are missing. It seems to me that we can remove / erase automatically with appropriate tools the borders (threesolds, contrast, light...) with Gimp. Please could you help doing this?
Cheers
Seb


Attached Files
.xcf   how to remove borders.xcf (Size: 918.32 KB / Downloads: 343)
Reply
#2
Photo 
I used the Eraser Tool only.
Create a new layer, fill it with the same one as the document (white) and set it as a background (at the bottom of the layer stack).
With the Shift key filled in it is possible to draw straight lines.

When erasing the border lines automatically, the transparency will be filled by the white of the bottom layer.
[Image: tS1JbxX.png]
at https://i.imgur.com/tS1JbxX.png
Reply
#3
Is your example one page of a whole series ? Do you want to batch process the pdf's in some way, remove the table lines. No easy way that I can think of unless each page has the table lines in exactly the same place, which is most unlikely.  Removing lines is just as likely to ruin the text for OCR

I usually scan straight into Gimp using the xsane plugin, Nothing wrong scanning to PDF as long as you understand matching the ppi  in the open pdf dialogue. Once into Gimp it is just another graphic. As far as I know Tesseract does not retain any formatting other than new lines *, although there might be a page segmentation mode for blocks of text.  Your image is (nearly) A4 @ 300 ppi and OCR's very well without modification.
 The only correction needed is painting over the vertical between the column of numbers and the text, otherwise it is OCR'd as a '|' The other table lines are ignored.

This the output I get showing Gimp 2.8 and a front-end (gImageReader) for Tesseract (4.0.0-beta.1).  Character recognition is good but formatting lost. 

   

* Expanding on that:
tesseract howtoremoveborders.png out -c preserve_interword_spaces=1 is supposed to preserve formatting, but I can't get it to work. Or more probably I am using the wrong command Wink
or another tesseract howtoremoveborders.png out -psm 6 segmentation mode - also not working.
Maybe need to update tesseract
Reply
#4
A technique with one of my scripts: ofn-path-filter-strokes that you will find there.

  1. Do a color selection on the text (using black of course). This selects the text, and the table
  2. Select>To path. This creates a path where each "stroke" encloses an area of white or black
  3. When you look at your page, the areas inside the strokes generated by the grid are much larger than the areas of the strokes generated by the text.
  4. Enters the script, that can filter strokes on various size criteria, one of them being area.
  5. Your text is about 50px high, so if you filter on an area that is 2500 pixels-square, you eliminate the strokes from the text, and only keep the ones from the grid.
  6. The script creates a new path with the small strokes removed. In the Path list, right click on it and Path to selection. This should create a selection on the table only.
  7. Select>Grow that selection by one pixel
  8. Bucket-fill with white
Come to think of it I could even make a script from this, but I'm currently smack in the center of Alex's path, so I have other things to worry about right now Smile
Reply
#5
OK, so I made the script. Still fairly experimental and not fully documented
  • Get ofn-remove-grid from here (warning: not the same location as the other script) and install as indicated at the bottom of the page.
  • The script is reached by Filters>Enhance>Remove table/grid around text
  • The script works as above:
    • Does a color selection (uses the foreground color)
    • Creates a path from the selection
    • Removes the areas below a given size from that path
    • Creates a selection from the result
    • Bucket-fills the selection with the background color.
  • The parameters:
    • Grow selection before: amount to grow the selection after the color selection. Can be used to compensate for small discontinuities in the grid. Don't push that too much as it can make large blocks of letters contiguous and bigger than your smaller grid rectangle.
    • Area threshold: the area under which the selection parts are removed. A good way to estimate: slightly smaller than  the smallest rectangle in your grid (the 1.01 in yours). How to find out:
      • Make a rectangle selection slightly inside this area
      • Open the histogram (Windows>Dockable dialogs>Histogram) and look a the Pixels value which is the number of pixels included in the selection (about 18000 in your image)
    • Grow selection After: amount to grow the selection before the bucket fill. Add a 1 or 2 pixels to cover pixels that may be missed d to path inaccuracies.

Before:
   

After:
   


Enjoy.

Now trying to get some sleep before the roof flies off Smile
Reply
#6
Very good!
I had already removed the borders with the tips from the previous post using ofn-path-filter-strokes.
Now with the tips on how to calculate the threshold area just got even easier.
Ofn-remove-grid worked well on my Gimp 2.10.21 win10 64.
Thx ofnuts.
Reply
#7
ofn-remove-grid now has its own thread.

Made a last minor improvement: if there is a selection, its area is used as the area threshold, in other words, I automated the hint above, no need to explicitly use the histogram and copy values.
Reply
#8
The ofn-remove-grid.py plugin works really well.  It answers the topic question on removing that table. Using a more usual scan, greyscale 300 ppi

   

1. Scan straight into Gimp. Not great, uneven colour, slightly skewed. A typical scan.
2. Apply levels to get even colour
3. A bit faint so apply threshold to get more contrast
4. Apply the plugin, (grow selection = 1) and the dividing lines gone. A few speckles 
5. Export that to a png and run through Tesseract - text is recognised
but
6. Run the original scan through Tesseract and get a better result. Tesseract ignores the lines.

I am gradually adding to my kubuntu 20.04 desktop, I think I will give Tesseract 5 a go, see if YAGF works with that version. There are PPA's for Tesseract.
Reply
#9
(10-02-2020, 08:38 AM)rich2005 Wrote: The ofn-remove-grid.py plugin works really well.  It answers the topic question on removing that table. Using a more usual scan, greyscale 300 ppi



1. Scan straight into Gimp. Not great, uneven colour, slightly skewed. A typical scan.
2. Apply levels to get even colour
3. A bit faint so apply threshold to get more contrast
4. Apply the plugin, (grow selection = 1) and the dividing lines gone. A few speckles 
5. Export that to a png and run through Tesseract - text is recognised
but
6. Run the original scan through Tesseract and get a better result. Tesseract ignores the lines.

I am gradually adding to my kubuntu 20.04 desktop, I think I will give Tesseract 5 a go, see if YAGF works with that version. There are PPA's for Tesseract.

IMHO you have been lucky because ofn-remove-grid looks at areas, and the area of the single lines must not be very big (at least not much bigger than other features). However looking at your image the sheer width of the feature could be a good criterion. I will consider this for the next version.
Reply
#10
Hello everybody,
I apologize for my late reply. I read the first message and wanted to test before answering (and I was a long time ago in a far away galaxy...) and thereby I can read a lot of interesting method now.
Thanks a lot !
@rich2005: first page of a the 65 pages pdf files.  Table lines are almoste in exactly the same placee. After OCR with Tesseract, I struggled with LibreOffice Calc and the " | ", which changed some time with other symbols $ ]...
@ofnuts: thanks for your time and work on the script. Currently I don't know how to use script but the result is so awesome that I must learn how to do

I'm a newby with this software and I need to learn the good hints that you gave.
Thanks again a lot. You rock Smile
Cheers
Reply


Forum Jump: