Sometimes you might find PDF files whose texts are heavily surrounded by images. In my case, I usually find this pattern on PDF ebooks discussing about photography. While some images are useful to provide visual explanations, some are just there to make the content looks appeal to be read. Some pages even put the text right within a big image even though it isn’t actually needed. For pages like these, you might want to extract the content out of the image. If you don’t feel like want to install anything just for the extraction, you can use Google Docs as an alternative. If you already have a Gmail account, then you can directly access Google Docs.
How to extract text from image and PDF file with Google Docs
- We use the upload tool for the extraction. Click the upload button and select “Settings”. Make sure that “Convert uploaded files to Google Docs format” and “Convert text from uploaded PDF and image files”.
- Click the upload button again and select “Files..”. Select the file and click the Open button to start uploading it to Google Docs.
- The upload setting will be displayed again before the uploading and conversion process starts. Just uncheck “Confirm settings before each upload” if you don’t want to see the window whenever you upload a file. Click the start upload button and the conversion process will be started immediately.
- Once the file is converted, click on the file name at the “upload complete” section that appears at the bottom right corner. The file will open in a new tab.
Note that you will get the original files as well as the converted ones. The originals will be on the first pages, followed by the extracted text.
Here you are the result I’ve got after converting an image. The first image is the original and the latter one is the extracted text.
- Instead of converting the whole PDF document, you can convert the pages that you want to convert into images first using image converter software. Then, you can upload the images to Google Docs to be converted. This way you can extract text only from specific PDF pages. Google Docs isn’t a dedicated OCR tool so it doesn’t have an option to let you choose which pages from a PDF file that needs to be converted. The file size is limited to only around 2MB, after all.
- You can select multiple images or PDF files for the upload. This way you can save time for the upload. Still, the files will be processed one by one.
Due to its limitation on the file size, you cannot use Google Docs to extract text from an image or PDF file with size more than 2MB. If your file is bigger than the allowed size, you can search for other online tools or use a dedicated OCR software.