Skip to content

Commit a48d7e4

Browse files
committed
Merge pull request 'Release v24.07' (#12) from release_24.07 into master
2 parents b539366 + 20f2559 commit a48d7e4

File tree

8 files changed

+469
-20
lines changed

8 files changed

+469
-20
lines changed

README.md

+17-18
Original file line numberDiff line numberDiff line change
@@ -2,39 +2,38 @@
22

33
<img src="https://goobi.io/wp-content/uploads/logo_goobi_plugin.png" align="right" style="margin:0 0 20px 20px;" alt="Plugin for Goobi workflow" width="175" height="109">
44

5-
Goobi workflow plugin to automatically read information from PDF files.
5+
This Step plugin for Goobi workflow automatically reads information from PDF files and extracts images and full text with coordinates to store these inside of the OCR and images folders.
66

77
This is a plugin for Goobi workflow, the open source workflow tracking software for digitisation projects. More information about Goobi workflow is available under https://goobi.io. If you want to get in touch with the user community simply go to https://community.goobi.io.
88

99
## Plugin details
1010

1111
More information about the functionality of this plugin and the complete documentation can be found in the central documentation area at https://docs.goobi.io
1212

13-
Detail | Description
14-
--- | ---
13+
Detail | Description
14+
--------------------------- | ----------------------
1515
**Plugin identifier** | intranda_step_pdf-extraction
16-
**Plugin type** | Step plugin
17-
**Licence** | GPL 2.0 or newer
18-
**Documentation (German)** | https://docs.goobi.io/goobi-workflow-plugins-de/step/intranda_step_pdf-extraction
19-
**Documentation (English)** | https://docs.goobi.io/goobi-workflow-plugins-en/step/intranda_step_pdf-extraction
16+
**Plugin type** | step
17+
**Licence** | GPL 2.0 or newer
18+
**Documentation (German)** | https://docs.goobi.io/workflow-plugins/v/eng/step/goobi-plugin-step-pdf-extraction
19+
**Documentation (English)** | https://docs.goobi.io/workflow-plugins/v/ger/step/goobi-plugin-step-pdf-extraction
2020

2121
## Goobi details
2222

2323
Goobi workflow is an open source web application to manage small and large digitisation projects mostly in cultural heritage institutions all around the world. More information about Goobi can be found here:
2424

25-
Detail | Description
26-
--- | ---
27-
**Goobi web site** | https://www.goobi.io
28-
**Twitter** | https://twitter.com/goobi
29-
**Goobi community** | https://community.goobi.io
25+
Detail | Description
26+
--------------------------- | ---------------------------
27+
**Goobi web site** | https://www.goobi.io
28+
**Goobi community** | https://community.goobi.io
29+
**Goobi documentation** | https://docs.goobi.io
3030

3131
## Development
3232

3333
This plugin was developed by intranda. If you have any issues, feedback, question or if you are looking for more information about Goobi workflow, Goobi viewer and all our other developments that are used in digitisation projects please get in touch with us.
3434

35-
Contact | Details
36-
--- | ---
37-
**Company name** | intranda GmbH
38-
**Address** | Bertha-von-Suttner-Str. 9, 37085 Göttingen, Germany
39-
**Web site** | https://www.intranda.com
40-
**Twitter** | https://twitter.com/intranda
35+
Contact | Details
36+
--------------------------- | ----------------------------------------------------
37+
**Company name** | intranda GmbH
38+
**Address** | Bertha-von-Suttner-Str. 9, 37085 Göttingen, Germany
39+
**Web site** | https://www.intranda.com

docs/index_de.md

+225
Large diffs are not rendered by default.

docs/index_en.md

+225
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,225 @@
1+
---
2+
title: Split PDFs, extract full text and read table of contents
3+
identifier: intranda_step_pdf-extraction
4+
published: true
5+
description: This is the technical documentation for the Goobi plugin for automatically reading information from PDF files.
6+
---
7+
## Introduction
8+
This documentation describes how to install, configure and use a plugin to extract images, full texts and the table of contents from PDF files. The plugin always extracts only what is present in the PDF and does not write an error message if no full text or table of contents can be found.
9+
10+
11+
## Installation
12+
To use the plugin, it must be copied to the following location:
13+
14+
```bash
15+
/opt/digiverso/goobi/plugins/step/plugin_intranda_step_pdf-extraction-base.jar
16+
```
17+
18+
The configuration of the plugin is expected under the following path:
19+
20+
```bash
21+
/opt/digiverso/goobi/config/plugin_intranda_step_pdf-extraction.xml
22+
```
23+
24+
The command line program `Ghostscript` and/or the tool `pdftoppm` from the package `poppler-utils` will also be required, depending on the configuration of the tag `<generator>`. They can be installed from the system's package sources.
25+
26+
27+
## Overview and functionality
28+
Once the plugin has been installed, it can be configured in the user interface in a workflow step as shown in this screenshot.
29+
30+
![Integration within a work step in the workflow](screen1_en.png)
31+
32+
To use the plugin, there must be a PDF file in the master folder of the process at the time of execution. This is then automatically split into individual pages. In addition, (if available) the full text is extracted and the table of contents of the PDF file is read in order to be entered as structural elements in the METS file.
33+
34+
It is therefore recommended that the workflow step with this plugin is preceded by another workflow step in which files are loaded into the master folder. This can be done by linking the process folder to the user's home folder or, for example, in the file upload plugin.
35+
36+
37+
## Configuration
38+
An example configuration could look like this:
39+
40+
```xml
41+
<config>
42+
43+
<project>*</project>
44+
<step>OCR-Extraktion</step>
45+
46+
<validation>
47+
<!-- set to false to skip this step if no PDF files exist in the source folder -->
48+
<!-- DEFAULT true -->
49+
<failOnMissingPDF>true</failOnMissingPDF>
50+
</validation>
51+
52+
<!-- if true then all old data from tifFolder, pdfFolder, textFolder and altoFolder will be deleted, and all file references will be removed if current book is not empty -->
53+
<!-- DEFAULT true -->
54+
<overwriteExistingData>true</overwriteExistingData>
55+
56+
<mets>
57+
<!-- DEFAULT true -->
58+
<write>false</write>
59+
<!-- DEFAULT true -->
60+
<failOnError>false</failOnError>
61+
<!-- Settings for writing Mets-Structure -->
62+
<docType>
63+
<!-- If this element exists and is not empty then for each imported pdf
64+
a new StructElement of the given type is created within the top StructElement. -->
65+
<parent>Chapter</parent>
66+
<!-- If this element exists and is not empty then the table-of-content structure of the pdf is
67+
written into the Mets file. Each structure element of the PDF is written as a StructElement of the given type. -->
68+
<children>Chapter</children>
69+
</docType>
70+
</mets>
71+
72+
<images>
73+
<!-- DEFAULT true -->
74+
<write>false</write>
75+
<!-- DEFAULT true -->
76+
<failOnError>true</failOnError>
77+
<!-- The resolution with which to scan the PDF file. This has a large impact on both image file size and quality. DEFAULT 300. -->
78+
<resolution>300</resolution>
79+
<!-- The image format for the image files written. DEFAULT tif. -->
80+
<!-- Allowed formats for the generator pdftoppm are png, jpg, jpeg, jpegcmyk, tif, tiff. -->
81+
<format>tif</format>
82+
<!-- Select the command line tool which should be used to create the images. Either 'ghostscript' or 'pdftoppm'. -->
83+
<generator>pdftoppm</generator>
84+
<!-- A parameter to add to the generator call. Repeatable -->
85+
<generatorParameter>-cropbox</generatorParameter>
86+
<!-- Hardcoded parameters for ghostscript are: -dUseCropBox, -SDEVICE, -r<res>, -sOutputFile, -dNOPAUSE, -dBATCH.
87+
Useful parameters for configuration are:
88+
===================================================
89+
-q `quiet`, fewer messages
90+
...................................................
91+
-g<width>x<height> page size in pixels
92+
===================================================
93+
-->
94+
<!-- Hardcoded parameters for pdftoppm are: -{format}, -r.
95+
Useful parameters for configuration are:
96+
======================================================================================================
97+
-f <int> first page to print
98+
......................................................................................................
99+
-l <int> last page to print
100+
......................................................................................................
101+
-o print only odd pages
102+
......................................................................................................
103+
-e print only even pages
104+
......................................................................................................
105+
-singlefile write only the first page and do not add digits
106+
......................................................................................................
107+
-scale-dimension-before-rotation for rotated pdf, resize dimensions before the rotation
108+
......................................................................................................
109+
-rx <fp> X resolution, in DPI
110+
......................................................................................................
111+
-ry <fp> Y resolution, in DPI
112+
......................................................................................................
113+
-scale-to <int> scales each page to fit within scale-to*scale-to pixel box
114+
......................................................................................................
115+
-scale-to-x <int> scales each page horizontally to fit in scale-to-x pixels
116+
......................................................................................................
117+
-scale-to-y <int> scales each page vertically to fit in scale-to-y pixels
118+
......................................................................................................
119+
-x <int> x-coordinate of the crop area top left corner
120+
......................................................................................................
121+
-y <int> y-coordinate of the crop area top left corner
122+
......................................................................................................
123+
-W <int> width of crop area in pixels (DEFAULT 0)
124+
......................................................................................................
125+
-H <int> height of crop area in pixels (DEFAULT 0)
126+
......................................................................................................
127+
-sz <int> size of crop square in pixels (sets W and H)
128+
......................................................................................................
129+
-cropbox use the crop box rather than media box
130+
......................................................................................................
131+
-hide-annotations do not show annotations
132+
......................................................................................................
133+
-mono generate a monochrome PBM file
134+
......................................................................................................
135+
-gray generate a grayscale PGM file
136+
......................................................................................................
137+
-sep <string> single character separator between name and page number (DEFAULT -)
138+
......................................................................................................
139+
-forcenum force page number even if there is only one page
140+
......................................................................................................
141+
-overprint enable overprint
142+
......................................................................................................
143+
-freetype <string> enable FreeType font rasterizer: yes, no
144+
......................................................................................................
145+
-thinlinemode <string> set thin line mode: none, solid, shape. DEFAULT none.
146+
......................................................................................................
147+
-aa <string> enable font anti-aliasing: yes, no
148+
......................................................................................................
149+
-aaVector <string> enable vector anti-aliasing: yes, no
150+
......................................................................................................
151+
-opw <string> owner password (for encrypted files)
152+
......................................................................................................
153+
-upw <string> user password (for encrypted files)
154+
......................................................................................................
155+
-q don't print any messages or errors
156+
......................................................................................................
157+
-progress print progress info
158+
......................................................................................................
159+
-tiffcompression <string> set TIFF compression: none, packbits, jpeg, lzw, deflate
160+
======================================================================================================
161+
-->
162+
</images>
163+
164+
<plaintext>
165+
<!-- DEFAULT true -->
166+
<write>true</write>
167+
<!-- DEFAULT true -->
168+
<failOnError>false</failOnError>
169+
</plaintext>
170+
171+
<alto>
172+
<!-- DEFAULT true -->
173+
<write>true</write>
174+
<!-- DEFAULT true -->
175+
<failOnError>false</failOnError>
176+
</alto>
177+
178+
<pagePdfs>
179+
<!-- DEFAULT true -->
180+
<write>true</write>
181+
<!-- DEFAULT true -->
182+
<failOnError>true</failOnError>
183+
</pagePdfs>
184+
185+
<properties>
186+
<!-- Write this process property after extraction is done. The value depends on whether any ocr files with content have been written. -->
187+
<!-- If there exist some properties named so, then the first one will be picked up to accept the value.
188+
Otherwise a new process property will be created for this purpose. ONLY 1 fulltext tag is allowed. -->
189+
<fulltext>
190+
<!-- process property name. If blank, no property will be written -->
191+
<name>OCRDone</name>
192+
<!-- property value when there are alto contents or text contents created. DEFAULT TRUE. -->
193+
<value exists="true">YES</value>
194+
<!-- property value when there are neither contents nor text contents created. DEFAULT FALSE. -->
195+
<value exists="false">NO</value>
196+
</fulltext>
197+
</properties>
198+
199+
</config>
200+
```
201+
202+
Any number of configurations for projects or work steps with specific names can be defined in the configuration file. Various `<config>` blocks can be used for this purpose, whereby the `<project>` and `<step>` properties must be specified in each one. The `<config>` blocks are applied to a specific step in the following order:
203+
204+
1) `<project>` and `<step>` correspond to the current project and step
205+
2) `<step>` corresponds to the current step and `<project>` is set to `*`
206+
3) `<project>` corresponds to the current project and `<step>` is set to `*`
207+
4) `<project>` and `<step>` are set to `*`
208+
209+
The `<failOnMissingPDF>` element within the `<validation>` element can be set to `true` to issue a warning if no PDF files could be found. Warnings are then written to the journal and the server-internal log files. If this option is deactivated with `false`, the case that no PDF files may exist is ignored.
210+
211+
The `<overwriteExistingData>` element can be used to set globally for this plugin whether existing PDF files may be overwritten.
212+
213+
The `<docType>` controls which structure types the entries extracted from the PDF content directory are given in the METS file. The `<parent>` element is the main element in which all other table of contents entries are stored. If it is omitted, all entries are entered directly in the main element of the METS file. The `<children>` element is used to specify the structure type of the sub-elements of the entry extracted from the PDF table of contents.
214+
215+
The `<pagePdfs>`, `<alto>`, `<plaintext>`, `<images>` and `<mets>` elements each have a `<write>` and `<failOnError>` property. In accordance with the XML element for PDF files, ALTO files, TXT files, general image files and the METS file, this allows you to set whether files of these types should be written or overwritten and whether an error message should be issued and further execution cancelled if they could not be written.
216+
217+
In the `<images>` element, some further settings for image files are possible. The values in `<resolution>` and `<format>` can be used to specify the image resolution (in DPI) and the output file format for the extracted images.
218+
219+
The sub-element `<generator>` within `<images>` specifies which executable programme is to be used on the server to extract the images. Valid values are usually `pdftoppm` and `ghostscript`. The element `<generatorParameter>` can be used multiple times and contains a command line parameter for the programme specified in `<generator>`.
220+
221+
The `<mets>` element controls the generation of METS files and allows various configurations. For example, `<docType>` can be used to control which structure types are to be generated for the entries extracted from the PDF content directory. The `<parent>` element is the main element in which all other table of contents entries are stored. If it is omitted, all entries are entered directly in the main element of the METS file.
222+
223+
The elements `<plaintext>`, `<alto>` and `<pagePdfs>` control the generation of the text files, the alto files and the Pdf files of all individual pages.
224+
225+
Process properties are written with `<properties>` depending on the result of the extraction. The configuration used here as an example writes the process property `OCRDone` with the value `YES` if full text was found within the PDF file and the value `NO` if there was no full text in the PDF file. This is particularly helpful if the workflow is to be changed retrospectively, for example to omit an OCR step if full text already exists.

docs/screen1_de.png

935 KB
Loading

docs/screen1_en.png

913 KB
Loading

docs/screen2.png

124 KB
Loading

module-base/pom.xml

+1-1
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
<parent>
44
<groupId>io.goobi.workflow.plugin</groupId>
55
<artifactId>plugin-step-pdf-extraction</artifactId>
6-
<version>24.06</version>
6+
<version>24.07</version>
77
</parent>
88
<artifactId>plugin-step-pdf-extraction-base</artifactId>
99
<packaging>jar</packaging>

pom.xml

+1-1
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
<parent>
44
<groupId>io.goobi.workflow</groupId>
55
<artifactId>workflow-base</artifactId>
6-
<version>24.06</version>
6+
<version>24.07</version>
77
<relativePath />
88
</parent>
99
<groupId>io.goobi.workflow.plugin</groupId>

0 commit comments

Comments
 (0)