Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
PDF to TXT
#1
Question 
In short, I scan a PDF document and then try to edit it as text.

My problem is that I can only do this using Microsoft Word, not with Libre Office Writer or OpenOffice Writer.

With LibreOffice the file opens but does not allow me to edit it.

With OpenOffice version 4.1.7 (currently the most current), editing is allowed, but the text appears encoded. I've tried several Character Set options, but a page with ascii codes always appears.
[Image: N8duLGJ.png]
I didn't want to depend on having to use Word.

Does anyone know how to use either LibreOffice or OpenOffice to obtain the scanned PDF result in an editable text file?

I thank you for your help.
Reply
#2
(10-01-2020, 06:05 PM)Krikor Wrote: In short, I scan a PDF document and then try to edit it as text.

My problem is that I can only do this using Microsoft Word, not with Libre Office Writer or OpenOffice Writer.

With LibreOffice the file opens but does not allow me to edit it.

With OpenOffice version 4.1.7 (currently the most current), editing is allowed, but the text appears encoded. I've tried several Character Set options, but a page with ascii codes always appears.
[Image: N8duLGJ.png]
I didn't want to depend on having to use Word.

Does anyone know how to use either LibreOffice or OpenOffice to obtain the scanned PDF result in an editable text file?

I thank you for your help.

The "JFIF" at the beginning hints that there is a JPEG.

This said the right LibreOffice app to open PDF is Draw, not Writer. It can locate text parts, but considers each line as an independent text part, so don't expect to reflow paragraphs.

Now, If you scan documents, the scanner output is an image, not text, so your PDF is just a bunch of JPEGs, and MS-Office is doing some OCR. In which case you need an OCR program, which in the FOSS world is usually Tesseract.
Reply
#3
(10-01-2020, 06:43 PM)Ofnuts Wrote: The "JFIF" at the beginning hints that there is a JPEG.

This said the right LibreOffice app to open PDF is Draw, not Writer. It can locate text parts, but considers each line as an independent text part, so don't expect to reflow paragraphs.

Now, If you scan documents, the scanner output is an image, not text, so your PDF is just a bunch of JPEGs, and MS-Office is doing some OCR. In which case you need an OCR program, which in the FOSS world is usually Tesseract.

In fact, several times the file opened directly in Draw, even though I selected Writer. But he edits as an image, not as text.
How could I be generating an image?
I configured the scanner to generate a .pdf
[Image: bIXhKVf.jpg]


Quote:Now, If you scan documents, the scanner output is an image, not text, so your PDF is just a bunch of JPEGs, and MS-Office is doing some OCR. In which case you need an OCR program, which in the FOSS world is usually Tesseract.

Hummmm ok...
This Tesseract I remember trying to use it, but I couldn't, I think that either I didn't know or there was another problem, I'll try to download it and try again.

Thx Ofnuts!
Reply
#4
(10-01-2020, 06:58 PM)Krikor Wrote: In fact, several times the file opened directly in Draw, even though I selected Writer. But he edits as an image, not as text.
How could I be generating an image?
I configured the scanner to generate a .pdf
A scanner only produces images. PDF is a format for text and images, if the scanner creates a PDF, it is just a PDF with images in it, the scanner has no OCR capability by itself.
Reply
#5
Scanner has no ability to recognise text. It scans image and then saves it to whatever format you ask. A scanned PDF is output embedded inside PDF, which in your case is an image. You will now need an Optical Character Recognition (OCR) software like OmniPage or Abbyy Fine Reader. There are a few online services too.
Reply
#6
I downloaded and am trying gImageReader with Tesseract.
Thanks to this video https://youtu.be/GMAZtpWQF0U -Extracting text from images with gImageReader and Tesseract OCR on Windows - I was able to download, install and take the first steps in this software.
OCR conversion is good, but it lacks formatting.

I don't know if the version of Tesseract used in this application would be the most current or not, nor if there would be how to download the most current version and use it, as we do in gimp with plugins.

I found good OCR APPs for cell phones, but I will have to create a type of tripod to fix the camera of the device if I have to deal with a large number of images at some point.
The advantage in cell phone OCRs is better text formatting than I found so far in gImageReader.
Reply


Forum Jump: