Thread Rating:
  • 1 Vote(s) - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Erase / remove borders from a table (scanned pdf)
#3
Is your example one page of a whole series ? Do you want to batch process the pdf's in some way, remove the table lines. No easy way that I can think of unless each page has the table lines in exactly the same place, which is most unlikely.  Removing lines is just as likely to ruin the text for OCR

I usually scan straight into Gimp using the xsane plugin, Nothing wrong scanning to PDF as long as you understand matching the ppi  in the open pdf dialogue. Once into Gimp it is just another graphic. As far as I know Tesseract does not retain any formatting other than new lines *, although there might be a page segmentation mode for blocks of text.  Your image is (nearly) A4 @ 300 ppi and OCR's very well without modification.
 The only correction needed is painting over the vertical between the column of numbers and the text, otherwise it is OCR'd as a '|' The other table lines are ignored.

This the output I get showing Gimp 2.8 and a front-end (gImageReader) for Tesseract (4.0.0-beta.1).  Character recognition is good but formatting lost. 

   

* Expanding on that:
tesseract howtoremoveborders.png out -c preserve_interword_spaces=1 is supposed to preserve formatting, but I can't get it to work. Or more probably I am using the wrong command Wink
or another tesseract howtoremoveborders.png out -psm 6 segmentation mode - also not working.
Maybe need to update tesseract
Reply


Messages In This Thread
RE: Erase / remove borders from a table (scanned pdf) - by rich2005 - 10-01-2020, 01:33 PM

Forum Jump: