pd3f – PDF Text Extractor

Installation

Mon, 08 Mar 2021 00:00:00 +0000

You need to setup Docker, fetch the pd3f repository and a start script:

git clone https://github.com/pd3f/pd3f
cd pd3f
./dev.sh # or ./prod.sh

Adapt the different docker-compose.yml files for your needs.

The first time the pd3f starts it will download (and build) the Docker images. Alltogether you need ~8 GB of space to store all the required components.

Usage

Mon, 08 Mar 2021 00:00:00 +0000

Using the GUI

The first time you upload a PDF, pd3f will download some large languages models. After it’s finished access the Web-based GUI at http://localhost:1616.

After uploading a PDF you will get redirected to a web page displaying progress / results of the job.

Using the API

import time

import requests

files = {'pdf': ('test.pdf', open('/dir/test.pdf', 'rb'))}
response = requests.post('http://localhost:1616', files=files, data={'lang': 'de'})
id = response.json()['id']

while True:
    r = requests.get(f"http://localhost:1616/update/{id}")
    j = r.json()
    if 'text' in j:
        break
    print('waiting...')
    time.sleep(1)
print(j['text'])

Post params:

lang: set the language (options: ‘de’, ’en’, ’es’, ‘fr’)
fast: whether to check for tables (default: False)
tables: whether to check for tables (default: False)
experimental: whether to extract text in experimental mode (footnotes to endnotes, depuplicate page header / footer) (default: False)
check_ocr: whether to check first if all pages were OCRd (default: True, cannot be modified in GUI)

You have to poll for /update/<uuid> to keep up with the progress. The responding JSON tells you about the status of the processing job.

Fields:

log: always present, text output from the job.
text and tables and filename: only present when the job finished successfully
position: present if on waiting list, returns position as integer
running: present if job is running
failed: present if job has failed

Scaling

You can also run more parsr workers with this:

docker-compose up --scale worker=3

To increase the frontend workers, adapt the env WEB_CONCURRENCY in docker-compose.yml:

WEB_CONCURRENCY=2

You may as well create a new docker-compose.yml to override certain settings. Take a look at docker-compose.prod.yml

docker-compose -f docker-compose.yml -f docker-compose.prod.yml up --scale worker=2

House Keeping

Docker uses three volumes:

pd3f-data-uploads: input & output files, mapped to ./data/pd3f-data-uploads/
pd3f-data-cache: internal volume, storing data so you don’t have to download model files over and over again
pd3f-data-to-ocr: internal volume, temporary location for PDFs to get OCRd. Files get deleted but logs get kept.

Results are kept for 24 hours per default. But no files get deleted automatically (only the results in the queue (e.g. the extracted text)).

Run this command from time to time to schedule jobs in order to delete files in pd3f-data-uploads and pd3f-data-to-ocr.

docker-compose run --rm worker rqscheduler --host redis --burst

Experimental mode

Sun, 05 May 2019 00:00:00 +0100

The experimental mode tries to redefine PDFs even further. In experimental mode, pd3f may throw certain parts of a PDF away, e.g., duplicate headers. So the output needs to get checked manually.

Features happen all the time

Dehyphenation of Lines

Check if two lines can be joined by removing hyphens (’-’).

Reasonable Joining of Lines

Decide between adding a simple space (’ ‘) or a new line (’\n’) when joining lines.

Experimental Mode

Use it with care.

Reverse Page Break

Check if the last paragraph of a page und the first paragraph of the following page can be joined.

Footnote to Endnotes

In order to join paragraphs (and reverse page breaks), detect footnotes and turn them into endnotes. For now, the footnotes are pulled to the end of a file.

If the header or the footer are the same for all pages, only display them once. Headers are pulled to the start of the document and footer to the end. Some heuristic based on the similarity of footers are used. (Jaccard distance for text, and compare overlapping shapes)

It’s for OCRd PDFs hard to decide when a Header / Footer are duplicates. The text in the header / footer is small the OCR text is faulty.

Special case for OCRd PDFs: Choose the Header / Footer with the best Flair score to display.

FAQ

Sun, 05 May 2019 00:00:00 +0100

Parsr can also do OCR, why do use OCRmyPDF?

OCRmyPDF uses various image preprocessing to improve the image quality. This improves the output of Tesseract. Parsr uses raw Tesseract so the results are worse.

Parsr can also extract text, why is your tool needed?

The text output of Parsr is scrambled, i.e. hyphens are not removed. pd3f improves the overall text quality by reconstructing the original text with language models. See pd3f-core for details.

Overall Parsr is a great tool, but it still has rough edges. pd3f improves the output with various (opinionated) hand-crafted rules. pd3f mainly focuses on formal letters and official documents for now. Based on this assumption we can simplify certain things. It was developed mainly for German documents but it should work for other languages as well.

Running pd3f in production with nginx

Sun, 05 May 2019 00:00:00 +0100

An example config for nginx to run in conjunction with docker-compose.prod.yml:

# prevent guessing of file names / URLs
limit_req_zone $binary_remote_addr zone=limitfiles:10m rate=1r/s;

server {
    server_name changeme.com;
    client_max_body_size 50M;

    if ($request_method !~ ^(GET|HEAD|POST)$ )
    {
        return 405;
    }

    location /files/ {
        limit_req zone=limitfiles burst=2 nodelay;

        add_header Content-disposition "attachment";
        alias /var/pd3f/data/pd3f-data-uploads/;
    }

    location ~ /(update|result)/ {
        limit_req zone=limitfiles burst=2 nodelay;

        proxy_pass http://127.0.0.1:1616;
    }

    location / {
        # simple frontend caching
        expires 10m;
        add_header Cache-Control "public";

        proxy_pass http://127.0.0.1:1616;
    }

    # restrict access to dashboard to IP address (or subnet) + protect with Basic Auth
    location /dashboard/ {
        limit_req zone=limitfiles burst=20 nodelay;

        allow xx.xx.xx.xx;
        deny all;

        auth_basic "Private Area";
        auth_basic_user_file /path/to/.htpasswd;

        proxy_pass http://127.0.0.1:9181;
    }
}

Make sure set to set the correct permission to let nginx serve the static files (in /var/pd3f/data/pd3f-data-uploads/).

Common Problems

If there are problems with rq (the queue), just remove the image:

docker rm root_redis_1

This will clean all data in redis.

Release notes

Sun, 05 May 2019 00:00:00 +0100

v0.5.x

make parsr config customizable. check out https://github.com/pd3f/pd3f/blob/master/example_api.py and https://pd3f.github.io/pd3f-core/export.html#pd3f.export.extract

v0.4.x

improve Docker-Compose setup
add Italian support
update to Parsr v1.2.2, OCRmyPdf v11.6.2
make RQ-Dashboard v0.6.3 work by pinning RQ to v1.5.2
add optional timeout for rq jobs to mark started jobs that wont finish as failed

v0.3.x

first public release

Examples

Mon, 01 Jan 0001 00:00:00 +0000

Here are three exemplary outputs to illustrate the performance of pd3f. The text was extracted in ’experimental’ mode.

Example #1

Text extracted from an image-based PDF.

This browser does not support PDFs. Please download the PDF to view it: Download PDF.

Raw extracted text here

Example #2

Text extracted from a digital PDF.

This browser does not support PDFs. Please download the PDF to view it: Download PDF.

Raw extracted text here

Example #3

Another digital PDF, the footnotes were turned in endnotes.

This browser does not support PDFs. Please download the PDF to view it: Download PDF.

Raw extracted text here

pd3f – PDF Text Extractor

Installation

Usage

Using the GUI

Using the API

Scaling

House Keeping

Experimental mode

Features happen all the time

Dehyphenation of Lines

Reasonable Joining of Lines

Experimental Mode

Reverse Page Break

Footnote to Endnotes

Deduplication of Pager Header / Footer

FAQ

Parsr can also do OCR, why do use OCRmyPDF?

Parsr can also extract text, why is your tool needed?

Running pd3f in production with nginx

Common Problems

Release notes

v0.5.x

v0.4.x

v0.3.x

Examples

Example #1

Example #2

Example #3