<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>pd3f – PDF Text Extractor</title>
    <link>https://pd3f.com/</link>
      <atom:link href="https://pd3f.com/index.xml" rel="self" type="application/rss+xml" />
    <description>pd3f – PDF Text Extractor</description>
    <generator>Source Themes Academic (https://sourcethemes.com/academic/)</generator><language>en-us</language><lastBuildDate>Mon, 08 Mar 2021 00:00:00 +0000</lastBuildDate>
    <image>
      <url>https://pd3f.com/images/icon_hua3b88bdf4967d9b1eae9a9ccf0ae4ac1_44230_512x512_fill_lanczos_center_3.png</url>
      <title>pd3f – PDF Text Extractor</title>
      <link>https://pd3f.com/</link>
    </image>
    
    <item>
      <title>Installation</title>
      <link>https://pd3f.com/docs/pd3f/installation/</link>
      <pubDate>Mon, 08 Mar 2021 00:00:00 +0000</pubDate>
      <guid>https://pd3f.com/docs/pd3f/installation/</guid>
      <description>&lt;p&gt;You need to setup 
&lt;a href=&#34;https://docs.docker.com/get-docker/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Docker&lt;/a&gt;, fetch the &lt;code&gt;pd3f&lt;/code&gt; repository and a start script:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;git clone https://github.com/pd3f/pd3f
cd pd3f
./dev.sh # or ./prod.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Adapt the different &lt;code&gt;docker-compose.yml&lt;/code&gt; files for your needs.&lt;/p&gt;
&lt;p&gt;The first time the &lt;code&gt;pd3f&lt;/code&gt; starts it will download (and build) the Docker images.
Alltogether you need ~8 GB of space to store all the required components.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Usage</title>
      <link>https://pd3f.com/docs/pd3f/usage/</link>
      <pubDate>Mon, 08 Mar 2021 00:00:00 +0000</pubDate>
      <guid>https://pd3f.com/docs/pd3f/usage/</guid>
      <description>&lt;h2 id=&#34;using-the-gui&#34;&gt;Using the GUI&lt;/h2&gt;
&lt;p&gt;The first time you upload a PDF, &lt;code&gt;pd3f&lt;/code&gt; will download some large languages models.
After it&amp;rsquo;s finished access the Web-based GUI at 
&lt;a href=&#34;http://localhost:1616&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;http://localhost:1616&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;After uploading a PDF you will get redirected to a web page displaying progress / results of the job.&lt;/p&gt;
&lt;h2 id=&#34;using-the-api&#34;&gt;Using the API&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;import time

import requests

files = {&#39;pdf&#39;: (&#39;test.pdf&#39;, open(&#39;/dir/test.pdf&#39;, &#39;rb&#39;))}
response = requests.post(&#39;http://localhost:1616&#39;, files=files, data={&#39;lang&#39;: &#39;de&#39;})
id = response.json()[&#39;id&#39;]

while True:
    r = requests.get(f&amp;quot;http://localhost:1616/update/{id}&amp;quot;)
    j = r.json()
    if &#39;text&#39; in j:
        break
    print(&#39;waiting...&#39;)
    time.sleep(1)
print(j[&#39;text&#39;])
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Post params:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;lang&lt;/code&gt;: set the language (options: &amp;lsquo;de&amp;rsquo;, &amp;rsquo;en&amp;rsquo;, &amp;rsquo;es&amp;rsquo;, &amp;lsquo;fr&amp;rsquo;)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;fast&lt;/code&gt;: whether to check for tables (default: False)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;tables&lt;/code&gt;: whether to check for tables (default: False)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;experimental&lt;/code&gt;: whether to extract text in experimental mode (footnotes to endnotes, depuplicate page header / footer) (default: False)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;check_ocr&lt;/code&gt;: whether to check first if all pages were OCRd (default: True, cannot be modified in GUI)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You have to poll for &lt;code&gt;/update/&amp;lt;uuid&amp;gt;&lt;/code&gt; to keep up with the progress. The responding JSON tells you about the status of the processing job.&lt;/p&gt;
&lt;p&gt;Fields:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;log&lt;/code&gt;: always present, text output from the job.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;text&lt;/code&gt; and &lt;code&gt;tables&lt;/code&gt; and &lt;code&gt;filename&lt;/code&gt;: only present when the job finished successfully&lt;/li&gt;
&lt;li&gt;&lt;code&gt;position&lt;/code&gt;: present if on waiting list, returns position as integer&lt;/li&gt;
&lt;li&gt;&lt;code&gt;running&lt;/code&gt;: present if job is running&lt;/li&gt;
&lt;li&gt;&lt;code&gt;failed&lt;/code&gt;: present if job has failed&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;scaling&#34;&gt;Scaling&lt;/h3&gt;
&lt;p&gt;You can also run more parsr workers with this:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;docker-compose up --scale worker=3
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To increase the frontend workers, adapt the env &lt;code&gt;WEB_CONCURRENCY&lt;/code&gt; in &lt;code&gt;docker-compose.yml&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-yml&#34;&gt;WEB_CONCURRENCY=2
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You may as well create a new &lt;code&gt;docker-compose.yml&lt;/code&gt; to override certain settings. Take a look at 
&lt;a href=&#34;./docker-compose.prod.yml&#34;&gt;&lt;code&gt;docker-compose.prod.yml&lt;/code&gt;&lt;/a&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;docker-compose -f docker-compose.yml -f docker-compose.prod.yml up --scale worker=2
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&#34;house-keeping&#34;&gt;House Keeping&lt;/h3&gt;
&lt;p&gt;Docker uses three volumes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;pd3f-data-uploads&lt;/code&gt;: input &amp;amp; output files, mapped to &lt;code&gt;./data/pd3f-data-uploads/&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;pd3f-data-cache&lt;/code&gt;: internal volume, storing data so you don&amp;rsquo;t have to download model files over and over again&lt;/li&gt;
&lt;li&gt;&lt;code&gt;pd3f-data-to-ocr&lt;/code&gt;: internal volume, temporary location for PDFs to get OCRd. Files get deleted but logs get kept.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Results are kept for 24 hours per default. But no files get deleted automatically (only the results in the queue (e.g. the extracted text)).&lt;/p&gt;
&lt;p&gt;Run this command from time to time to schedule jobs in order to delete files in &lt;code&gt;pd3f-data-uploads&lt;/code&gt; and &lt;code&gt;pd3f-data-to-ocr&lt;/code&gt;.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;docker-compose run --rm worker rqscheduler --host redis --burst
&lt;/code&gt;&lt;/pre&gt;
</description>
    </item>
    
    <item>
      <title>Experimental mode</title>
      <link>https://pd3f.com/docs/pd3f/experimental/</link>
      <pubDate>Sun, 05 May 2019 00:00:00 +0100</pubDate>
      <guid>https://pd3f.com/docs/pd3f/experimental/</guid>
      <description>&lt;p&gt;The experimental mode tries to redefine PDFs even further.
In experimental mode, pd3f may throw certain parts of a PDF away, e.g., duplicate headers.
So the output needs to get checked manually.&lt;/p&gt;
&lt;h2 id=&#34;features-happen-all-the-time&#34;&gt;Features happen all the time&lt;/h2&gt;
&lt;h3 id=&#34;dehyphenation-of-lines&#34;&gt;Dehyphenation of Lines&lt;/h3&gt;
&lt;p&gt;Check if two lines can be joined by removing hyphens (&amp;rsquo;-&amp;rsquo;).&lt;/p&gt;
&lt;h3 id=&#34;reasonable-joining-of-lines&#34;&gt;Reasonable Joining of Lines&lt;/h3&gt;
&lt;p&gt;Decide between adding a simple space (&amp;rsquo; &amp;lsquo;) or a new line (&amp;rsquo;\n&amp;rsquo;) when joining lines.&lt;/p&gt;
&lt;h2 id=&#34;experimental-mode&#34;&gt;Experimental Mode&lt;/h2&gt;
&lt;p&gt;Use it with care.&lt;/p&gt;
&lt;h3 id=&#34;reverse-page-break&#34;&gt;Reverse Page Break&lt;/h3&gt;
&lt;p&gt;Check if the last paragraph of a page und the first paragraph of the following page can be joined.&lt;/p&gt;
&lt;h3 id=&#34;footnote-to-endnotes&#34;&gt;Footnote to Endnotes&lt;/h3&gt;
&lt;p&gt;In order to join paragraphs (and reverse page breaks), detect footnotes and turn them into endnotes.
For now, the footnotes are pulled to the end of a file.&lt;/p&gt;
&lt;h3 id=&#34;deduplication-of-pager-header--footer&#34;&gt;Deduplication of Pager Header / Footer&lt;/h3&gt;
&lt;p&gt;If the header or the footer are the same for all pages, only display them once.
Headers are pulled to the start of the document and footer to the end.
Some heuristic based on the similarity of footers are used. (Jaccard distance for text, and compare overlapping shapes)&lt;/p&gt;
&lt;p&gt;It&amp;rsquo;s for OCRd PDFs hard to decide when a Header / Footer are duplicates.
The text in the header / footer is small the OCR text is faulty.&lt;/p&gt;
&lt;p&gt;Special case for OCRd PDFs: Choose the Header / Footer with the best Flair score to display.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>FAQ</title>
      <link>https://pd3f.com/docs/pd3f/faq/</link>
      <pubDate>Sun, 05 May 2019 00:00:00 +0100</pubDate>
      <guid>https://pd3f.com/docs/pd3f/faq/</guid>
      <description>&lt;h3 id=&#34;parsr-can-also-do-ocr-why-do-use-ocrmypdf&#34;&gt;Parsr can also do OCR, why do use OCRmyPDF?&lt;/h3&gt;
&lt;p&gt;OCRmyPDF uses various image preprocessing to improve the image quality.
This improves the output of Tesseract.
Parsr uses raw Tesseract so the results are worse.&lt;/p&gt;
&lt;h3 id=&#34;parsr-can-also-extract-text-why-is-your-tool-needed&#34;&gt;Parsr can also extract text, why is your tool needed?&lt;/h3&gt;
&lt;p&gt;The text output of Parsr is scrambled, i.e. hyphens are not removed.
&lt;code&gt;pd3f&lt;/code&gt; improves the overall text quality by reconstructing the original text with language models.
See 
&lt;a href=&#34;https://github.com/pd3f/pd3f-core&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;pd3f-core&lt;/a&gt; for details.&lt;/p&gt;
&lt;p&gt;Overall Parsr is a great tool, but it still has rough edges.
&lt;code&gt;pd3f&lt;/code&gt; improves the output with various (opinionated) hand-crafted rules.
&lt;code&gt;pd3f&lt;/code&gt; mainly focuses on formal letters and official documents for now.
Based on this assumption we can simplify certain things.
It was developed mainly for German documents but it should work for other languages as well.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Running pd3f in production with nginx</title>
      <link>https://pd3f.com/docs/pd3f/prod/</link>
      <pubDate>Sun, 05 May 2019 00:00:00 +0100</pubDate>
      <guid>https://pd3f.com/docs/pd3f/prod/</guid>
      <description>&lt;p&gt;An example config for nginx to run in conjunction with 
&lt;a href=&#34;./docker-compose.prod.yml&#34;&gt;&lt;code&gt;docker-compose.prod.yml&lt;/code&gt;&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-nginx&#34;&gt;# prevent guessing of file names / URLs
limit_req_zone $binary_remote_addr zone=limitfiles:10m rate=1r/s;

server {
    server_name changeme.com;
    client_max_body_size 50M;

    if ($request_method !~ ^(GET|HEAD|POST)$ )
    {
        return 405;
    }

    location /files/ {
        limit_req zone=limitfiles burst=2 nodelay;

        add_header Content-disposition &amp;quot;attachment&amp;quot;;
        alias /var/pd3f/data/pd3f-data-uploads/;
    }

    location ~ /(update|result)/ {
        limit_req zone=limitfiles burst=2 nodelay;

        proxy_pass http://127.0.0.1:1616;
    }

    location / {
        # simple frontend caching
        expires 10m;
        add_header Cache-Control &amp;quot;public&amp;quot;;

        proxy_pass http://127.0.0.1:1616;
    }

    # restrict access to dashboard to IP address (or subnet) + protect with Basic Auth
    location /dashboard/ {
        limit_req zone=limitfiles burst=20 nodelay;

        allow xx.xx.xx.xx;
        deny all;

        auth_basic &amp;quot;Private Area&amp;quot;;
        auth_basic_user_file /path/to/.htpasswd;

        proxy_pass http://127.0.0.1:9181;
    }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Make sure set to set the correct permission to let nginx serve the static files (in &lt;code&gt;/var/pd3f/data/pd3f-data-uploads/&lt;/code&gt;).&lt;/p&gt;
&lt;h2 id=&#34;common-problems&#34;&gt;Common Problems&lt;/h2&gt;
&lt;p&gt;If there are problems with rq (the queue), just remove the image:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;docker rm root_redis_1
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This will clean all data in redis.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Release notes</title>
      <link>https://pd3f.com/docs/pd3f/release_notes/</link>
      <pubDate>Sun, 05 May 2019 00:00:00 +0100</pubDate>
      <guid>https://pd3f.com/docs/pd3f/release_notes/</guid>
      <description>&lt;h2 id=&#34;v05x&#34;&gt;v0.5.x&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;make parsr config customizable. check out 
&lt;a href=&#34;https://github.com/pd3f/pd3f/blob/master/example_api.py&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;https://github.com/pd3f/pd3f/blob/master/example_api.py&lt;/a&gt; and 
&lt;a href=&#34;https://pd3f.github.io/pd3f-core/export.html#pd3f.export.extract&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;https://pd3f.github.io/pd3f-core/export.html#pd3f.export.extract&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;v04x&#34;&gt;v0.4.x&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;improve Docker-Compose setup&lt;/li&gt;
&lt;li&gt;add Italian support&lt;/li&gt;
&lt;li&gt;update to Parsr v1.2.2, OCRmyPdf v11.6.2&lt;/li&gt;
&lt;li&gt;make RQ-Dashboard v0.6.3 work by pinning RQ to v1.5.2&lt;/li&gt;
&lt;li&gt;add optional timeout for rq jobs to mark started jobs that wont finish as failed&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;v03x&#34;&gt;v0.3.x&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;first public release&lt;/li&gt;
&lt;/ul&gt;
</description>
    </item>
    
    <item>
      <title>Examples</title>
      <link>https://pd3f.com/examples/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://pd3f.com/examples/</guid>
      <description>&lt;p&gt;Here are three exemplary outputs to illustrate the performance of pd3f. The text was extracted in &amp;rsquo;experimental&amp;rsquo; mode.&lt;/p&gt;
&lt;h2 id=&#34;example-1&#34;&gt;Example #1&lt;/h2&gt;
&lt;p&gt;Text extracted from an image-based PDF.&lt;/p&gt;
&lt;object data=&#34;/examples/00020_08112014_Stellungnahme_RAK_Koeln_RefE_Bekaempfung_Korruption.pdf&#34; type=&#34;application/pdf&#34; width=&#34;100%&#34; height=&#34;700px&#34;&gt;
    &lt;embed src=&#34;https://pd3f.com/examples/00020_08112014_Stellungnahme_RAK_Koeln_RefE_Bekaempfung_Korruption.pdf&#34;&gt;
        &lt;p&gt;This browser does not support PDFs. Please download the PDF to view it: &lt;a href=&#34;https://pd3f.com/examples/00020_08112014_Stellungnahme_RAK_Koeln_RefE_Bekaempfung_Korruption.pdf&#34;&gt;Download PDF&lt;/a&gt;.&lt;/p&gt;
    &lt;/embed&gt;
&lt;/object&gt;
&lt;p&gt;
&lt;a href=&#34;https://pd3f.com/examples/00020_08112014_Stellungnahme_RAK_Koeln_RefE_Bekaempfung_Korruption.txt&#34;&gt;Raw extracted text here&lt;/a&gt;&lt;/p&gt;
&lt;h2 id=&#34;example-2&#34;&gt;Example #2&lt;/h2&gt;
&lt;p&gt;Text extracted from a digital PDF.&lt;/p&gt;
&lt;object data=&#34;/examples/00004_09212018_bstbk_Unwandlungsgesetz.pdf&#34; type=&#34;application/pdf&#34; width=&#34;100%&#34; height=&#34;700px&#34;&gt;
    &lt;embed src=&#34;https://pd3f.com/examples/00004_09212018_bstbk_Unwandlungsgesetz.pdf&#34;&gt;
        &lt;p&gt;This browser does not support PDFs. Please download the PDF to view it: &lt;a href=&#34;https://pd3f.com/examples/00004_09212018_bstbk_Unwandlungsgesetz.pdf&#34;&gt;Download PDF&lt;/a&gt;.&lt;/p&gt;
    &lt;/embed&gt;
&lt;/object&gt;
&lt;p&gt;
&lt;a href=&#34;https://pd3f.com/examples/00004_09212018_bstbk_Unwandlungsgesetz.txt&#34;&gt;Raw extracted text here&lt;/a&gt;&lt;/p&gt;
&lt;h2 id=&#34;example-3&#34;&gt;Example #3&lt;/h2&gt;
&lt;p&gt;Another digital PDF, the footnotes were turned in endnotes.&lt;/p&gt;
&lt;object data=&#34;/examples/00001_112018_FU_Berlin_Richtlinie_2017_1371.pdf&#34; type=&#34;application/pdf&#34; width=&#34;100%&#34; height=&#34;700px&#34;&gt;
    &lt;embed src=&#34;https://pd3f.com/examples/00001_112018_FU_Berlin_Richtlinie_2017_1371.pdf&#34;&gt;
        &lt;p&gt;This browser does not support PDFs. Please download the PDF to view it: &lt;a href=&#34;https://pd3f.com/examples/00001_112018_FU_Berlin_Richtlinie_2017_1371.pdf&#34;&gt;Download PDF&lt;/a&gt;.&lt;/p&gt;
    &lt;/embed&gt;
&lt;/object&gt;
&lt;p&gt;
&lt;a href=&#34;https://pd3f.com/examples/00001_112018_FU_Berlin_Richtlinie_2017_1371.txt&#34;&gt;Raw extracted text here&lt;/a&gt;&lt;/p&gt;
</description>
    </item>
    
  </channel>
</rss>
