devarena logo
Reading Time: 7 minutes

Ever tried extracting data from PDFs? While you could still extract text from PDFs by copy-pasting, extracting tables from a PDF is way more complicated & cumbersome!


Organisational workflows today largely involve the exchange of PDF documents. And most data-rich business documents present complex information in tables.

You can find tables in financial documents such as invoices, receipts, insurance documents, bills of lading, bank statements, reports etc.  

Businesses often look for solutions to convert tabular data stored in such PDFs into editable tables.

The manual approach of copy-pasting rarely maintains the table structure. Columns & rows aren’t maintained. And a lot of verification & reformatting is necessary to restore the data to its original organized form.

Fortunately, there are various tools, like Nanonets, that can extract tables from PDF documents efficiently.

Extracting tables from a PDF
Extracting tables from documents with Nanonets

While they all perform the same function, these tools use fundamentally different techniques that have their own pros and cons.

In this article, we will review various solutions to extract tables from PDFs and compare their pros and cons to select the best fit for specific use cases.

Want to extract tabular data from invoices, receipts or any other type of document? Check out Nanonets’ PDF table extractor to extract tabular data. Schedule a demo to learn more about Nanonets’ table extraction feature.

Here are some of the most popular solutions to extract data from PDFs to tables:

1. Online PDF to Excel converters

 basic extraction

2. Tabula

 works best on simple tables

3. Camelot or Excalibur

customisable table extraction

4. PDFTables

secure & scalable table extraction API

5. Docparser

cloud-based table parser

6. Nanonets

no code automated table extraction


Online PDF to Excel converters

Online PDF to Excel converters like smallpdf and cometdocs among others offer the most basic PDF table extraction capabilities.

These simple utility tools are free to use, but might require a mandatory sign up. Just upload a PDF and download the output.

Unlike the more advanced alternatives below, such tools typically convert the entire PDF to XML or csv files. This often results in jumbled outputs that might require quite some editing and clean-up.


  • Simple drag-and-drop interface.


  • Can’t handle PDF files with complex table structures.
  • Doesn’t support batch processing. You can only work on one document at a time!
  • Sometimes characters or numbers aren’t identified correctly.
  • Limited use.
  • Not an automated process.
  • Can’t be customized.

Need an AI-based online OCR to convert PDF to XML or PDF to database entries, extract data from PDF, extract text from image, or extract text from PDF? Schedule a demo to learn more about Nanonets.


Running on the Tabula-Java library, Tabula is an open-source software that can be downloaded onto Mac, Linux or Windows PCs. Created by a bunch of journalists, Tabula seeks to “liberate data tables locked inside PDF files”.

Upload a PDF file to Tabula, select a table by drawing a box around it, preview the selection of rows and columns, and export the verified table. Tabula works best on small simple table formats.  


  • Tabula works wonderfully on PDF files that are predominantly text-based.
  • It is easy to use, robust and can be embedded into other software.


  • Tabula only works on text-based PDFs, not scanned images or documents.
  • It often gets tripped up by multi-line or merged cells.
  • Doesn’t support batch processing. You can only work on one document at a time!
  • Sometimes characters or numbers aren’t identified correctly.
  • Can’t support OCR requirements.
  • Not an automated process.

Camelot or Excalibur

Licensed under the MIT License, Camelot is a Python library that enables table extraction from PDFs. It also powers Excalibur, a web interface to extract tabular data from PDF documents.

Unlike other libraries which oscillate between accurate outputs or complete failures, Camelot gives you the power to greatly customize table extraction to get the best results.


  • Auto detects tables.
  • Camelot works very well on text-based PDF files.
  • Flexible & customizable to a large extent.
  • Exports tables to multiple formats like CSV, Excel, JSON, HTML & Sqlite.
  • Bad tables can be automatically discarded based on metrics like accuracy and whitespace.
  • Each table can be converted to a pandas DataFrame which can be used for further analysis or processing.


  • Camelot only works on text-based PDFs, not scanned images or documents.
  • Can’t handle complex PDF documents with multi-line tables and merged cells.
  • When using Stream, the whole page is treated as a single table. This affects the output when there are multiple tables on the same page.
  • Can’t support OCR requirements.
  • Not an automated process.

Does your business deal with data or text recognition in digital documents, PDFs or images? Have you wondered how to extract tabular data, extract text from images , extract data from PDF or extract text from PDF accurately & efficiently?


PDFTables is a secure and scalable PDF to Excel converter and table extraction API. It’s driven completely by internal algorithms with no room for customizations or tweaks. Simply upload your document and download the table output in an Excel, CSV, XML or JSON format.


  • Works across small and large data sets.
  • Automated table extraction.
  • Exports tables to multiple formats like CSV, Excel, JSON, & XML.
  • Free for up to 25 pages.
  • Handles multiple files at the same time.


  • Can’t tweak or customize the table extraction algorithm.
  • Doesn’t perform Optical Character Recognition (OCR).
  • Complete reliance on the underlying algorithm for accuracy and performance.
  • Doesn’t support any cloud integration.


Docparser is a robust cloud-based parsing app that can extract data & tables from documents, images or PDFs. Like Tabula, it runs on the Tabula-Java library but has more advanced features.

Once you upload a file, you will be required to set parsing rules to teach the software to identify the regions of interest(with tables) in your document. The software then remembers and applies these rules for similar documents in the future.

With built-in OCR capabilities, Docparser can also help automate business workflows to some extent. (Here’s a detailed explainer on what is OCR software)


  • Supports batch processing of multiple documents.
  • Built-in OCR.
  • Allows custom parsing rules.
  • Exports tables to multiple formats like CSV, Excel, JSON, & XML.
  • Supports some neat integration options.


  • Parsing rules can get complicated for complex tables & documents.
  • You need to define the coordinates and boundaries for each table.
  • Runs on a template identification model. So not truly automated!
  • Can’t automatically handle new document types & formats.
  • Might require separate parsing rules for tables or data that come in different regions within the same document.
  • Only works accurately on documents with fixed region formatting or known templates.
  • Might require some level of verification and rework.

Want to scrape data from PDF documents, convert PDF table to Excel or automate table extraction? Find out how Nanonets PDF scraper or PDF parser can power your business to be more productive.


Nanonets Intro

Nanonets is an OCR software that leverages AI & ML capabilities to automatically extract tables from PDF documents, images and scanned files. Unlike other solutions, Nanonets doesn’t require separate rules and templates for each new document type.

Relying on AI-driven cognitive intelligence, Nanonets can handle semi-structured and even unseen documents while improving over time. You can also customize the output, to only extract table or data entries of your interest.

It is fast, accurate, easy to use, allows users to build custom OCR models from scratch and has some neat Zapier integrations. Digitize documents, extract tables or data-fields, and integrate with your everyday apps via APIs in a simple, intuitive interface.

The Nanonets algorithm & OCR models learn continuously. They can be trained or retrained multiple times and are very customizable. While offering a great API & documentation for developers, the software is also ideal for organizations with no in-house team of developers.


  • Cognitive data & table extraction with OCR.
  • High accuracy even on semi-structured or unseen document formats.
  • Automatically detects tables including structured row-column information within its response.
  • Provides a blitz-scaling, modern UI that processes documents up to 10 times faster than other software.
  • Easy to use and set up. Can be integrated and set up in a couple of days.
  • Supports batch processing of multiple documents.
  • Exports tables to multiple formats like CSV, Excel, & JSON.
  • Seamless 2-way integration with multiple accounting software. (Learn more about Accounting OCR)
  • Almost no post-processing required
  • Works with non-English or multiple languages
  • Wide choice of integration options


  • Can’t handle very high volume spikes!
  • Only offers 100 free document/credits per month.

Nanonets has many interesting use cases that could optimize your business performance, save costs and boost growth. Find out how Nanonets’ use cases can apply to your product.

Nanonets offers a pre-trained Table extractor model that runs out-of-the-box. Check out a quick demo:

Nanonets Table Extractor

You can also activate the table extraction feature in the other pre-trained models offered by Nanonets:

  • Invoices
  • Receipts
  • Driver’s license (US)
  • Passports

Just add your files, activate table extraction, test & verify the extracted table data, and export as an Excel or csv file.

Please note that you will have to signup for a free trial to the Pro plan to activate the table extraction feature!

How to train your Model for Accurate Table Extraction
The Nanonets Invoice Model performing Table Extraction

Nanonets has many interesting use cases that could optimize your business performance, save costs and boost growth. Find out how Nanonets’ use cases can apply to your product.

Nanonets Documentation

If you’re looking to train your own OCR models to build a PDF to database or PDF to table converter, check out the Nanonets API. In the documentation, you will find ready to fire code samples in Shell, Ruby, Golang, Java, C# and Python, as well as detailed API specs for different endpoints.

Update December 2021: this post was originally published in April 2021 and has since been updated multiple times.

This table extraction tool was launched on Product Hunt.

Here’s a slide summarizing the findings in this article. Here’s an alternate version of this post.

Source link

Spread the Word!