Do you have a PDF document that you want to extract all text from? What about the scanned document image files that you want to convert to editable text? These are some of the most common problems I have encountered in the workplace when working with files.
In this article, I will introduce several different ways to extract text from a PDF file or image. Extraction results will vary depending on the type and quality of the text in the PDF file or image. Also, your results will depend on which tool you use, so it’s best to try as many options as possible to get the best results.
Extract text from image or PDF
The easiest and quickest way to get started is to try an online PDF text extraction service. They are usually free and can give you exactly what you are looking for without having to install anything on your computer. Here are two that I have used with very good or excellent results:
ExtractPDF
Extract PDF
ExtractPDF is a free tool to extract images, text and fonts from a PDF file. The only limitation is the maximum PDF file size is 10 MB. It’s a little shallow; so if you have a larger file, try the other methods below. Select the file and click the Send File button. Results are usually very fast and you should see a preview of the text when you click on the Text tab.
Another nice side benefit is extracting images from the PDF file in case you need them! Overall, the online tool works great, but I came across a couple of PDFs that give me a funny result. The text is extracted fine, but for some reason there will be a line break after each word! Not a big problem for a short PDF file, but definitely a problem for files with a lot of text. If this happens to you, try the next tool.
OCR on the Internet
Online OCR
Online OCR usually works on documents that have not been converted correctly with ExtractPDF, so it is recommended that you try both services to see which gives the best result. Online OCR also has some nicer features that can be useful for anyone with a large PDF file who only needs to convert text on a few pages rather than the entire document.
The first thing you need to do is create a free account. It’s a little annoying, but if you don’t create a free account, it only converts part of your PDF, not the entire document. Also, instead of only uploading a 5MB document, you can upload up to 100MB per file using an account.
First select a language and then select the type of output formats you want to use for the converted file. You have several options, and you can choose more than one if you like. In the Multi-Page Document section, you can select Page Numbers and then select only the pages you want to convert. Then you select the file and click “Convert”!
After the conversion, you will be taken to the “Documents” section (if you are logged in), where you can see how many free pages you have left and links to download the converted files. You only seem to have 25 free pages per day, so if you need more, you’ll either have to wait a bit or buy more pages.
Online OCR did a great job converting my PDFs as it could support the actual layout of the text. In my test, I took a Word document that used bullets, different font sizes, etc., and converted it to PDF. Then I used Online OCR to convert it back to Word format and it was about 95% the same as the original. It impresses me.
Plus, if you want to convert an image to text, Online OCR can do it as easily as extracting text from PDF files.
Free Internet OCR
OCR is free online
While we talked about OCR images to text, let me mention another good website that works really well with images. The free online OCR was very good and very accurate when extracting text from my test images. I took a couple of photos from my iPhone of pages from books, brochures, etc. and was surprised at how well he was able to convert the text.
Select the file and click the Upload button. The next screen has several options and an image preview. You can crop it if you don’t want to recognize everything. Then just click the OCR button and the converted text will appear below the image preview. He also has no restrictions, which is very nice.
In addition to online services, there are two free PDF converters that I want to mention if you need software running locally on your computer to perform conversions. With online services, you always need an internet connection, and this may not be available to everyone. However, I noticed that the conversion quality of the freeware programs was significantly worse than that of the websites.
A-PDF Text Extractor
A-PDF text extractor
A-PDF Text Extractor is a free program that does a pretty good job of extracting text from PDF files. After downloading and installing, click the “Open” button to select the PDF file. Then click on Extract Text to start the process.
You will be prompted for a location to store the output text file and then it will start extracting it. You can also click the Options button, which allows you to select only specific pages to extract and the type of extraction. The second option is interesting because it extracts text in different layouts, and it’s worth trying all three to see which one gives the best result.
PDF2Text Pilot
PDF2Text Pilot
PDF2Text Pilot does a great job of extracting text. He has no options; you just add files or folders, convert and hope for the best. It worked well with some PDFs, but most of them had a lot of problems.
Just click on Add Files and then Convert. After the conversion is complete, click Browse to open the file. Your mileage will vary when using this program, so don’t expect too much.
It is also worth mentioning that if you work in a corporate environment or can get a copy of Adobe Acrobat from work, then you can actually achieve much better results. Acrobat is obviously not free, but it does have the ability to convert PDF to Word, Excel, and HTML format. It also maintains the structure of the original document best and converts complex text.
–