Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
A script solution requested: conversion from text-only pdf to pagewise image files
#1
I have a text-only pdf.

Can a GIMP batch script be designed to open one page at a time, in the increasing order of page numbering, save that page as a low-resolution image, say 50dpi with file name as Filename-PageNumber with a particular file extension of one colour, i.e., black and transparent?

Moreover, advice is also requested on
(1) Whether such files could be saved with .svg extension?
(2) The benefit of svg extension over bmp in this particular case please?

This question has been posted on the either of the gimp-forum.net and gimpchat.com.

Request to post all your suggestions on the gimp-forum.net as all of you are probably subscribed to the either fora, to avoid replication of efforts.

Apologies for posting on the either fora. Only for wider visibility.

Best wishes
Reply
#2
Is the text real text (PDF produced as output of word processor) or just a bitmap image of text (scanner output). In the first case, don't use Gimp, use a real PDF editor (LibreOffice Writer can do things, IIRC).

"Low resolution" is just metadata, do you mean to also scale the ilage down?

If you have only one image to process, you can:
  • Load it in Gimp wit the appropriate options (one layer, per page, transparency, maybe even the right definition)
  • Scale it down & change the print definition (if not done in previous step)
  • Run a script to export all the layers individually (see for instance ofn-export-layers)
Reply
#3
(07-01-2023, 07:11 AM)Ofnuts Wrote: ...
Is the text real text (PDF produced as output of word processor ...

Of course, a multi-page Unicode text file with rich formatting.

(07-01-2023, 07:11 AM)Ofnuts Wrote: ... If you have only one image to process, ...

Fortunately Sir,though I am a woke from the perspective of a programmer, but I am not that woke! Does your nation have such wokes who could act similarly? I am doubtful if such persons could even use a computer, especially Linux systems, intelligibly.

(07-01-2023, 07:11 AM)Ofnuts Wrote: ... use a real PDF editor (LibreOffice Writer can do things, IIRC). ...
No Sir. LibreOffice Draw (not writer) can't faithfully edit a pdf file. The Layout is definitely altered. I need the pages of the pdf to stay as WYSIWYG as in a pdf viewer and remain unaltered.

I need to generate images of 50 dpi for each page of the pdf document. That only GIMP can do, manually, page-wise. I can do it. But highly monotonous. Import, set the dpi then convert each page and save them with a specific page number individually.
Reply
#4
Quote:Moreover, advice is also requested on
(1) Whether such files could be saved with .svg extension?

Remember Gimp is a bitmap (raster) editor. You cannot export an image in vector (SVG) format.

When Gimp opens a PDF, any vector 'objects' including text are rasterised. The image is now composed of pixels.

Quote:I need to generate images of 50 dpi for each page of the pdf document. That only GIMP can do, manually, page-wise. I can do it. But highly monotonous. Import, set the dpi then convert each page and save them with a specific page number individually.

For a faster method use ImageMagick (IM). Same as Gimp the images are rasterised. You are using linux so IM might already be installed.

For a single pdf say, imagename.pdf

Code:
convert -density 50 imagename.pdf imagename.jpg

produces a set of files imagename-0.jpg imagename-1.jpg ....imagename-n.jpg
All at 50 ppi with correspondingly smaller pixel size. A4 size becomes 413x585 pix You can of course render other than jpeg, change the extension in the command line.

EDIT: If the PDF is as subject line, text only. Get a smaller file size using grayscale rendering. Just a slight addition to the IM command.
Code:
convert -density 50 imagename.pdf -colorspace Gray imagename-gray.jpg

No difference with what you see when the jpeg is opened in an image viewer. These are low quality images (50 ppi), do not expect anything wonderful but depending on the original text size, should be legible.
Reply
#5
Thank you, Mr. rich2005 for replying to my email. You remain my emancipator.
(07-01-2023, 08:25 AM)rich2005 Wrote:
Quote:Moreover, advice is also requested on
(1) Whether such files could be saved with .svg extension?

Remember Gimp is a bitmap (raster) editor. You cannot export an image in vector (SVG) format.

When Gimp opens a PDF, any vector 'objects' including text are rasterised. The image is now composed of pixels. ...
Yes, my bad. I lost my rational senses. True. How can a raster system create a .svg file format? So how could I post that request. I apologise.


(07-01-2023, 08:25 AM)rich2005 Wrote: ... Get a smaller file size using grayscale rendering. Just a slight addition to the IM command.
Code:
convert -density 50 imagename.pdf -colorspace Gray imagename-gray.jpg

No difference with what you see when the jpeg is opened in an image viewer. These are low quality images (50 ppi), do not expect anything wonderful but depending on the original text size, should be legible. ...
As usual you whetted my appetite. You made me find that imagemagick or popplar-utils could be used to join those 50 downgraded images into a pdf file, each image occupying one page.
Either,
Code:
convert image*.jpg output.pdf
Or
Code:
img2pdf *.jpg -o output.pdf
Reply
#6
OK, some comparisons which might affect your workflow.

A 3page pdf produced by the DTP program Scribus. Scribus makes good pdf's This one is all text, various fonts and comes with the font metrics embedded. The size is 196 kb

Note: for text PDF's without embedded fonts, expect the non-installed font replaced by the system default font. That is usually the cause of your scrambled LibreOffice imported PDF.

You can check using your PDF viewer as shown (evince or maybe okular).

   

I do use IM to join several raster images and put into a PDF wrapper. It makes smaller PDF's than Gimp and a PDF file size slightly larger than the combined size of the jpg files.  Using the grayscale jpgs for the 3page example 323 kb - however worse quality and larger than the 'pure' PDF.

   

Bottom line, just saying, be aware of what you are converting, be it with Gimp or IM.
Reply
#7
As always, Mr. rich2005, you never expect a thank you, but having observed the world from a society limited by a 1000 years of oppressions and exploitations, my 'thank you' comes automatically. Dear moderators, this is not 'politics', but history. Look at Marseilles today and you will have an idea. So please, if needed, delete the contentious sentences, but please spare the rest post.

Now to your post:
(07-01-2023, 01:18 PM)rich2005 Wrote: OK, some comparisons which might affect your workflow.

A 3page pdf produced by the DTP program Scribus. Scribus makes good pdf's This one is all text, various fonts and comes with the font metrics embedded. The size is 196 kb ...

Yes, but my advantage is that my file was created with LibreOffice Writer and I embedded the font with my LibreOffice text file. So the created pdf file also has the fonts embedded by default.

But before we proceed further, there is another issue that needs to be addressed. With the code line:
Code:
convert -density 92 imagename.pdf -colorspace Gray imagename-gray.jpg

The image files that are created begin from 1,2,3, ...,11, 12, 13 and so on. Not 001, 002, 003, ..., 011, 012, 013, ...

So what happens is that after 1, comes 10, 11, 12, ..., 19, then 2, 21, 22, and so on, when I use:
Code:
convert image*.jpg output.pdf
to rejoin the image files into one pdf file. In other words, the ordering is not as intended.

So, if the image creation could be modulated for imagemagick to create files from 000 and count up to 999, this problem could be addressed.
Reply
#8
For IM try

Code:
convert  -density 50 name.pdf -colorspace Gray  name-gray-%03d.jpg
Reply
#9
(07-01-2023, 03:04 PM)rich2005 Wrote: For IM try ...
I marvel at your level of expertise!
How did you accomplish this extraordinary height of excellence? Simply reading the man page of imagemagick and experimenting? Or skimming the binary for keywords and the way how they interconnect? How could I not ask to such a percipient individual!
One last question: if I want to use Navy Blue rather than Black, would I have to use gimp to convert the black into Navy Blue, or could this be accomplished by imagemagick itself?
Reply
#10
I am trying for a preliminary solution with a transparent background:
Code:
convert -density 100 -quality 10 -fill "navy" textfile.pdf imagefile%03d.jpeg
To have a white background:
Code:
convert -density 100 -quality 10  -background "white" -fill "navy" -alpha remove textfile.pdf imagefile%03d.jpeg

I will do an extensive check and report back. Thank you,Mr. rich2005. You rock!
Reply


Forum Jump: