Erase / remove borders from a table (scanned pdf) - Printable Version +- Gimp-Forum.net (https://www.gimp-forum.net) +-- Forum: GIMP (https://www.gimp-forum.net/Forum-GIMP) +--- Forum: General questions (https://www.gimp-forum.net/Forum-General-questions) +--- Thread: Erase / remove borders from a table (scanned pdf) (/Thread-Erase-remove-borders-from-a-table-scanned-pdf) Pages:
1
2
|
Erase / remove borders from a table (scanned pdf) - odeonseb - 10-01-2020 Hello, I use GIMP 2.8.22 on Linux Mint 19. I discovered new software for OCR from a scanned pdf file: Scan2pdf and Tesseract. I would like to use OCR but the scanned pdf file is a table with borders, and some data are missing. It seems to me that we can remove / erase automatically with appropriate tools the borders (threesolds, contrast, light...) with Gimp. Please could you help doing this? Cheers Seb RE: Erase / remove borders from a table (scanned pdf) - Krikor - 10-01-2020 I used the Eraser Tool only. Create a new layer, fill it with the same one as the document (white) and set it as a background (at the bottom of the layer stack). With the Shift key filled in it is possible to draw straight lines. When erasing the border lines automatically, the transparency will be filled by the white of the bottom layer. at https://i.imgur.com/tS1JbxX.png RE: Erase / remove borders from a table (scanned pdf) - rich2005 - 10-01-2020 Is your example one page of a whole series ? Do you want to batch process the pdf's in some way, remove the table lines. No easy way that I can think of unless each page has the table lines in exactly the same place, which is most unlikely. Removing lines is just as likely to ruin the text for OCR I usually scan straight into Gimp using the xsane plugin, Nothing wrong scanning to PDF as long as you understand matching the ppi in the open pdf dialogue. Once into Gimp it is just another graphic. As far as I know Tesseract does not retain any formatting other than new lines *, although there might be a page segmentation mode for blocks of text. Your image is (nearly) A4 @ 300 ppi and OCR's very well without modification. The only correction needed is painting over the vertical between the column of numbers and the text, otherwise it is OCR'd as a '|' The other table lines are ignored. This the output I get showing Gimp 2.8 and a front-end (gImageReader) for Tesseract (4.0.0-beta.1). Character recognition is good but formatting lost. [attachment=5009] * Expanding on that: tesseract howtoremoveborders.png out -c preserve_interword_spaces=1 is supposed to preserve formatting, but I can't get it to work. Or more probably I am using the wrong command or another tesseract howtoremoveborders.png out -psm 6 segmentation mode - also not working. Maybe need to update tesseract RE: Erase / remove borders from a table (scanned pdf) - Ofnuts - 10-01-2020 A technique with one of my scripts: ofn-path-filter-strokes that you will find there.
RE: Erase / remove borders from a table (scanned pdf) - Ofnuts - 10-01-2020 OK, so I made the script. Still fairly experimental and not fully documented
Before:
[attachment=5014]
After:
[attachment=5015]
Enjoy. Now trying to get some sleep before the roof flies off RE: Erase / remove borders from a table (scanned pdf) - Krikor - 10-02-2020 Very good! I had already removed the borders with the tips from the previous post using ofn-path-filter-strokes. Now with the tips on how to calculate the threshold area just got even easier. Ofn-remove-grid worked well on my Gimp 2.10.21 win10 64. Thx ofnuts. RE: Erase / remove borders from a table (scanned pdf) - Ofnuts - 10-02-2020 ofn-remove-grid now has its own thread. Made a last minor improvement: if there is a selection, its area is used as the area threshold, in other words, I automated the hint above, no need to explicitly use the histogram and copy values. RE: Erase / remove borders from a table (scanned pdf) - rich2005 - 10-02-2020 The ofn-remove-grid.py plugin works really well. It answers the topic question on removing that table. Using a more usual scan, greyscale 300 ppi [attachment=5016] 1. Scan straight into Gimp. Not great, uneven colour, slightly skewed. A typical scan. 2. Apply levels to get even colour 3. A bit faint so apply threshold to get more contrast 4. Apply the plugin, (grow selection = 1) and the dividing lines gone. A few speckles 5. Export that to a png and run through Tesseract - text is recognised but 6. Run the original scan through Tesseract and get a better result. Tesseract ignores the lines. I am gradually adding to my kubuntu 20.04 desktop, I think I will give Tesseract 5 a go, see if YAGF works with that version. There are PPA's for Tesseract. RE: Erase / remove borders from a table (scanned pdf) - Ofnuts - 10-02-2020 (10-02-2020, 08:38 AM)rich2005 Wrote: The ofn-remove-grid.py plugin works really well. It answers the topic question on removing that table. Using a more usual scan, greyscale 300 ppi IMHO you have been lucky because ofn-remove-grid looks at areas, and the area of the single lines must not be very big (at least not much bigger than other features). However looking at your image the sheer width of the feature could be a good criterion. I will consider this for the next version. RE: Erase / remove borders from a table (scanned pdf) - odeonseb - 10-12-2020 Hello everybody, I apologize for my late reply. I read the first message and wanted to test before answering (and I was a long time ago in a far away galaxy...) and thereby I can read a lot of interesting method now. Thanks a lot ! @rich2005: first page of a the 65 pages pdf files. Table lines are almoste in exactly the same placee. After OCR with Tesseract, I struggled with LibreOffice Calc and the " | ", which changed some time with other symbols $ ]... @ofnuts: thanks for your time and work on the script. Currently I don't know how to use script but the result is so awesome that I must learn how to do I'm a newby with this software and I need to learn the good hints that you gave. Thanks again a lot. You rock Cheers |