View on GitHub

workflow-configuration

a makefilization for OCR-D workflows, with configuration examples

OCR-D workflow configurations based on makefiles

This provides an attempt at running OCR-D workflows configured and controlled via makefiles using GNU bash, GNU make and GNU parallel.

Makefilization offers the following advantages:

Nevertheless, there are also some disadvantages:

Contents:

Dependencies

To install system dependencies for this package, run…

make deps-ubuntu

…in a privileged context for Ubuntu (like a Docker container).

Or equivalently, install the following packages:

Additionally, you must of course install ocrd itself along with its dependencies in the current shell environment. Moreover, depending on the specific configurations you want to use (i.e. the processors it contains), additional modules must be installed. See OCR-D setup guide for instructions.

(Yes, workflow-configuration is already part of ocrd_all, which is also available on Dockerhub.)

Installation

Run:

make install

… if you are in a (Python) virtual environment. Otherwise specify the installation prefix directory via environment variable VIRTUAL_ENV.

Assuming $VIRTUAL_ENV/bin is in your PATH, you can then call:

cd WORKSPACE && make [OPTIONS] -f WORKFLOW-CONFIG.mk
make -C WORKSPACE [OPTIONS] -f WORKFLOW-CONFIG.mk

… for processing single workspace directory, or …

ocrd-make [OPTIONS] -f WORKFLOW-CONFIG.mk WORKSPACE...

… for processing multiple workspaces at once (with the same interface as above).

Where:

Calling workflows is possible from anywhere in your filesystem, but for the WORKFLOW_CONFIG.mk you may need to:

(The previous version of ocrd-make tried to copy or symlink all makefiles to the runtime directory. You can still use those, but should remove the old Makefile.)

Usage

Workflows are processed like software builds: File groups (depending on one another) are the targets to be built in each workspace, and all workspaces are built recursively. A build is finished when all targets exist and none are older than their respective prerequisites (e.g. image files).

To run a configuration…

  1. Activate working environment (virtualenv) and change to the target directory.
  2. Choose (or create) a workflow configuration makefile.
    (Yes, you can have to look inside and browse its rules!)
  3. Execute:

     cd WORKSPACE && make [OPTIONS] -f WORKFLOW-CONFIG.mk # or
     make -C WORKSPACE [OPTIONS] -f WORKFLOW-CONFIG.mk
    

    … for processing single workspace directory, or …

     ocrd-make [OPTIONS] -f WORKFLOW-CONFIG.mk all
    

    (The special target all (which is also the default goal) will search for all workspaces in the current directory recursively.) You can also run on a subset of workspaces by passing these as goals on the command line…

     ocrd-make -f WORKFLOW-CONFIG.mk PATH/TO/WORKSPACE1 PATH/TO/WORKSPACE2 ...
    

To get help:

[ocrd-]make help

To get a short description of the chosen configuration:

[ocrd-]make -f CONFIGURATION.mk info

To see the command sequence that would be executed for the chosen configuration (in the format of ocrd process):

[ocrd-]make -f CONFIGURATION.mk show

To run a workflow server for the command sequence that would be executed for the chosen configuration (to be controlled via ocrd workflow client or HTTP):

[ocrd-]make -f CONFIGURATION.mk server

To create workspaces from directories which contain image files:

ocrd-import DIRECTORY

To get help for the import tool:

ocrd-import --help

To perform various tasks via XSLT on PAGE-XML files (these all share the same options, including --help):

page-add-nsprefix-pc # adds namespace prefix 'pc:'
page-remove-metadataitem # remove all MetadataItem entries
page-remove-dead-regionrefs # remove non-existing regionRefs
page-remove-empty-readingorder # remove empty ReadingOrder or groups
page-remove-words # remove all Word (and Glyph) entries
page-remove-glyphs # remove all Glyph entries
page-fix-coords # replace negative values in coordinates by zero
page-move-alternativeimage-below-page # try to push page-level AlternativeImage back to subsegments
page-textequiv-lines-to-regions # project text from TextLines to TextRegions (concat with LF in between)
page-textequiv-words-to-lines # project text from Words to TextLines (concat with spaces in between)
page-extract-lines # extract TextLine/TextEquiv/Unicode consequtively
page-extract-words # extract Word/TextEquiv/Unicode consequtively
page-extract-glyphs # extract Glyph/TextEquiv/Unicode consequtively

To perform the same transformations, but as a workspace processor:

ocrd-page-transform -P xsl page-remove-words.xsl
cat <<'EOF' > my-transform.xsl
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:pc="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15">
  <xsl:output method="xml" standalone="yes" encoding="UTF-8" omit-xml-declaration="no"/>
  <xsl:template match="//pc:Word"/>
  <xsl:template match="node()|text()|@*">
    <xsl:copy>
      <xsl:apply-templates select="node()|text()|@*"/>
    </xsl:copy>
  </xsl:template>
</xsl:stylesheet>
EOF
ocrd-page-transform -P xsl my-transform.xsl

To spawn a new configuration file, in the directory of the source repository, do:

make NEW-CONFIGURATION.mk

Furthermore, you can add any options that make understands (see make --help or info make 'Options Summary'). For example,

For example, to rebuild anything after the fileGrp OCR-D-BIN, do:

ocrd-make -f CONFIGURATION.mk -W OCR-D-BIN all

You can also use that pattern to specify any fileGrp other than the .DEFAULT_GOAL of your configuration as the overall target. For example, to build anything up to the fileGrp OCR-D-SEG-LINE, do:

ocrd-make -f CONFIGURATION.mk .DEFAULT_GOAL=OCR-D-SEG-LINE all

If you run make in the workspace directly instead of having ocrd-make do it recursively, then no all target exists and you can directly set the target fileGrp to replace .DEFAULT_GOAL:

make -C WORKSPACE -f CONFIGURATION.mk -W OCR-D-BIN
make -C WORKSPACE -f CONFIGURATION.mk OCR-D-SEG-LINE

There are 2 special variables. To process only a subset of pages in all fileGrps, use PAGES. For example, to only consider pages PHYS_0005 through PHYS_0007, do:

ocrd-make -f CONFIGURATION.mk all PAGES=PHYS_0005..PHYS_0007
make -C WORKSPACE -f CONFIGURATION.mk PAGES=PHYS_0005..PHYS_0007

And to override the default (or configured) log levels for all processors and libraries, use LOGLEVEL. For example, to get debugging everywhere, do:

ocrd-make -f CONFIGURATION.mk all LOGLEVEL=DEBUG
make -C WORKSPACE -f CONFIGURATION.mk LOGLEVEL=DEBUG

Customisation

To write new configurations, first choose a (sufficiently descriptive) makefile name, and spawn a new file for that: make -C workflow-configuration NEW-CONFIGURATION.mk (or copy from an existing configuration).

Next, edit the file to your needs: Write rules using file groups as prerequisites/targets in the normal GNU make syntax. The first target defined must be the default goal that builds the very last file group for that configuration, or else a variable .DEFAULT_GOAL pointing to that target must be set anywhere in the makefile.

Recommendations

Example

INPUT = OCR-D-GT-SEG-LINE

$(INPUT):
	ocrd workspace find -G $@ --download
	ocrd workspace find -G OCR-D-IMG --download # just in case

# You can use variables for file group names to keep the rules brief:
BIN = $(INPUT)-BINPAGE

# This is how you use the pattern rule from Makefile (included below):
# The prerequisite will become the input file group,
# the target will become the output file group,
# the recipe will call the executable given by TOOL,
# also generating a JSON parameter file from PARAMS:
$(BIN): $(INPUT)
$(BIN): TOOL = ocrd-olena-binarize
$(BIN): PARAMS = "impl": "sauvola-ms-split"
# or equivalently:
$(BIN): OPTIONS = -P impl sauvola-ms-split

# You can also use the file group names directly:
OCR-D-OCR-TESS: $(BIN)
OCR-D-OCR-TESS: TOOL = ocrd-tesserocr-recognize
OCR-D-OCR-TESS: PARAMS = "textequiv_level": "glyph", "model": "frk+deu"
# or equivalently:
OCR-D-OCR-TESS: OPTIONS = -P textequiv_level glyph -P model frk+deu

# This uses more than 1 input file group and no output file group,
# which works with the standard recipe as well (but mind the ordering):
EVAL: $(INPUT) OCR-D-OCR-TESS
EVAL: TOOL = ocrd-cor-asv-ann-evaluate

# Because the first target in this file was $(BIN),
# we must override the default goal to be our desired overall target:
.DEFAULT_GOAL = EVAL

# ALWAYS necessary:
include Makefile

Results

OCR-D ground truth

:construction: these results are no longer meaningful and should be updated!

For the data_structure_text/dta repository, which includes both layout and text annotation down to the textline level, but very coarse segmentation, the following character error rate (CER) was measured:

pipeline configuration CER
OCR-D-OCR-OCRO-fraktur-BINPAGE-sauvola-CLIP-RESEG-DEWARP .243
OCR-D-OCR-OCRO-fraktur-BINPAGE-sauvola-DESKEW-ocropy-CLIP-RESEG-DEWARP .241
OCR-D-OCR-OCRO-fraktur-BINPAGE-sauvola-DESKEW-ocropy-CLIP-DESKEW-tesseract-RESEG-DEWARP .255
OCR-D-OCR-OCRO-fraktur-BINPAGE-sauvola-DENOISE-ocropy-CLIP-RESEG-DEWARP .252
OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DENOISE-ocropy-CLIP-RESEG-DEWARP .263
OCR-D-OCR-OCRO-fraktur-BINPAGE-sauvola-DENOISE-ocropy-DESKEW-ocropy-CLIP-RESEG-DEWARP .248
OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DENOISE-ocropy-DESKEW-ocropy-CLIP-RESEG-DEWARP .262
OCR-D-OCR-OCRO-fraktur-BINPAGE-sauvola-DENOISE-ocropy-DESKEW-ocropy-CLIP-DESKEW-tesseract-RESEG-DEWARP .273
OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DENOISE-ocropy-DESKEW-ocropy-CLIP-DESKEW-tesseract-RESEG-DEWARP .266
   
OCR-D-OCR-OCRO-frakturjze-BINPAGE-sauvola-CLIP-RESEG-DEWARP .290
OCR-D-OCR-OCRO-frakturjze-BINPAGE-sauvola-DESKEW-ocropy-CLIP-RESEG-DEWARP .287
OCR-D-OCR-OCRO-frakturjze-BINPAGE-sauvola-DESKEW-ocropy-CLIP-DESKEW-tesseract-RESEG-DEWARP .301
OCR-D-OCR-OCRO-frakturjze-BINPAGE-sauvola-DENOISE-ocropy-CLIP-RESEG-DEWARP .296
OCR-D-OCR-OCRO-frakturjze-BINPAGE-wolf-DENOISE-ocropy-CLIP-RESEG-DEWARP .317
OCR-D-OCR-OCRO-frakturjze-BINPAGE-sauvola-DENOISE-ocropy-DESKEW-ocropy-CLIP-RESEG-DEWARP .292
OCR-D-OCR-OCRO-frakturjze-BINPAGE-wolf-DENOISE-ocropy-DESKEW-ocropy-CLIP-RESEG-DEWARP .314
OCR-D-OCR-OCRO-frakturjze-BINPAGE-sauvola-DENOISE-ocropy-DESKEW-ocropy-CLIP-DESKEW-tesseract-RESEG-DEWARP .325
OCR-D-OCR-OCRO-frakturjze-BINPAGE-wolf-DENOISE-ocropy-DESKEW-ocropy-CLIP-DESKEW-tesseract-RESEG-DEWARP .318
   
OCR-D-OCR-TESS-Fraktur-BINPAGE-sauvola-CLIP-RESEG-DEWARP .114
OCR-D-OCR-TESS-Fraktur-BINPAGE-sauvola-DESKEW-ocropy-CLIP-RESEG-DEWARP .113
OCR-D-OCR-TESS-Fraktur-BINPAGE-sauvola-DESKEW-ocropy-CLIP-DESKEW-tesseract-RESEG-DEWARP .127
OCR-D-OCR-TESS-Fraktur-BINPAGE-sauvola-DENOISE-ocropy-CLIP-RESEG-DEWARP .121
OCR-D-OCR-TESS-Fraktur-BINPAGE-wolf-DENOISE-ocropy-CLIP-RESEG-DEWARP .122
OCR-D-OCR-TESS-Fraktur-BINPAGE-sauvola-DENOISE-ocropy-DESKEW-ocropy-CLIP-RESEG-DEWARP .118
OCR-D-OCR-TESS-Fraktur-BINPAGE-wolf-DENOISE-ocropy-DESKEW-ocropy-CLIP-RESEG-DEWARP .122
OCR-D-OCR-TESS-Fraktur-BINPAGE-sauvola-DENOISE-ocropy-DESKEW-ocropy-CLIP-DESKEW-tesseract-RESEG-DEWARP .124
OCR-D-OCR-TESS-Fraktur-BINPAGE-wolf-DENOISE-ocropy-DESKEW-ocropy-CLIP-DESKEW-tesseract-RESEG-DEWARP .123
   
OCR-D-OCR-TESS-Fraktur+Latin-BINPAGE-sauvola-CLIP-RESEG-DEWARP .117
OCR-D-OCR-TESS-Fraktur+Latin-BINPAGE-sauvola-DESKEW-ocropy-CLIP-RESEG-DEWARP .116
OCR-D-OCR-TESS-Fraktur+Latin-BINPAGE-sauvola-DESKEW-ocropy-CLIP-DESKEW-tesseract-RESEG-DEWARP .131
OCR-D-OCR-TESS-Fraktur+Latin-BINPAGE-sauvola-DENOISE-ocropy-CLIP-RESEG-DEWARP .121
OCR-D-OCR-TESS-Fraktur+Latin-BINPAGE-wolf-DENOISE-ocropy-CLIP-RESEG-DEWARP .126
OCR-D-OCR-TESS-Fraktur+Latin-BINPAGE-sauvola-DENOISE-ocropy-DESKEW-ocropy-CLIP-RESEG-DEWARP .122
OCR-D-OCR-TESS-Fraktur+Latin-BINPAGE-wolf-DENOISE-ocropy-DESKEW-ocropy-CLIP-RESEG-DEWARP .124
OCR-D-OCR-TESS-Fraktur+Latin-BINPAGE-sauvola-DENOISE-ocropy-DESKEW-ocropy-CLIP-DESKEW-tesseract-RESEG-DEWARP .128
OCR-D-OCR-TESS-Fraktur+Latin-BINPAGE-wolf-DENOISE-ocropy-DESKEW-ocropy-CLIP-DESKEW-tesseract-RESEG-DEWARP .126
   
OCR-D-OCR-TESS-frk-BINPAGE-sauvola-CLIP-RESEG-DEWARP .110
OCR-D-OCR-TESS-frk-BINPAGE-sauvola-DESKEW-ocropy-CLIP-RESEG-DEWARP .109
OCR-D-OCR-TESS-frk-BINPAGE-sauvola-DESKEW-ocropy-CLIP-DESKEW-tesseract-RESEG-DEWARP .126
OCR-D-OCR-TESS-frk-BINPAGE-sauvola-DENOISE-ocropy-CLIP-RESEG-DEWARP .119
OCR-D-OCR-TESS-frk-BINPAGE-wolf-DENOISE-ocropy-CLIP-RESEG-DEWARP .118
OCR-D-OCR-TESS-frk-BINPAGE-sauvola-DENOISE-ocropy-DESKEW-ocropy-CLIP-RESEG-DEWARP .115
OCR-D-OCR-TESS-frk-BINPAGE-wolf-DENOISE-ocropy-DESKEW-ocropy-CLIP-RESEG-DEWARP .116
OCR-D-OCR-TESS-frk-BINPAGE-sauvola-DENOISE-ocropy-DESKEW-ocropy-CLIP-DESKEW-tesseract-RESEG-DEWARP .120
OCR-D-OCR-TESS-frk-BINPAGE-wolf-DENOISE-ocropy-DESKEW-ocropy-CLIP-DESKEW-tesseract-RESEG-DEWARP .119
   
OCR-D-OCR-TESS-frk+deu-BINPAGE-sauvola-CLIP-RESEG-DEWARP .106
OCR-D-OCR-TESS-frk+deu-BINPAGE-sauvola-DESKEW-ocropy-CLIP-RESEG-DEWARP .106
OCR-D-OCR-TESS-frk+deu-BINPAGE-sauvola-DESKEW-ocropy-CLIP-DESKEW-tesseract-RESEG-DEWARP .122
OCR-D-OCR-TESS-frk+deu-BINPAGE-sauvola-DENOISE-ocropy-CLIP-RESEG-DEWARP .114
OCR-D-OCR-TESS-frk+deu-BINPAGE-wolf-DENOISE-ocropy-CLIP-RESEG-DEWARP .113
OCR-D-OCR-TESS-frk+deu-BINPAGE-sauvola-DENOISE-ocropy-DESKEW-ocropy-CLIP-RESEG-DEWARP .111
OCR-D-OCR-TESS-frk+deu-BINPAGE-wolf-DENOISE-ocropy-DESKEW-ocropy-CLIP-RESEG-DEWARP .112
OCR-D-OCR-TESS-frk+deu-BINPAGE-sauvola-DENOISE-ocropy-DESKEW-ocropy-CLIP-DESKEW-tesseract-RESEG-DEWARP .117
OCR-D-OCR-TESS-frk+deu-BINPAGE-wolf-DENOISE-ocropy-DESKEW-ocropy-CLIP-DESKEW-tesseract-RESEG-DEWARP .115
   
OCR-D-OCR-TESS-gt4histocr-BINPAGE-sauvola-CLIP-RESEG-DEWARP .078
OCR-D-OCR-TESS-gt4histocr-BINPAGE-sauvola-DESKEW-ocropy-CLIP-RESEG-DEWARP .081
OCR-D-OCR-TESS-gt4histocr-BINPAGE-sauvola-DESKEW-ocropy-CLIP-DESKEW-tesseract-RESEG-DEWARP .094
OCR-D-OCR-TESS-gt4histocr-BINPAGE-sauvola-DENOISE-ocropy-CLIP-RESEG-DEWARP .085
OCR-D-OCR-TESS-gt4histocr-BINPAGE-wolf-DENOISE-ocropy-CLIP-RESEG-DEWARP .089
OCR-D-OCR-TESS-gt4histocr-BINPAGE-sauvola-DENOISE-ocropy-DESKEW-ocropy-CLIP-RESEG-DEWARP .084
OCR-D-OCR-TESS-gt4histocr-BINPAGE-wolf-DENOISE-ocropy-DESKEW-ocropy-CLIP-RESEG-DEWARP .090
OCR-D-OCR-TESS-gt4histocr-BINPAGE-sauvola-DENOISE-ocropy-DESKEW-ocropy-CLIP-DESKEW-tesseract-RESEG-DEWARP .091
OCR-D-OCR-TESS-gt4histocr-BINPAGE-wolf-DENOISE-ocropy-DESKEW-ocropy-CLIP-DESKEW-tesseract-RESEG-DEWARP .094
   
OCR-D-OCR-CALA-gt4histocr-BINPAGE-sauvola-CLIP-RESEG-DEWARP .081
OCR-D-OCR-CALA-gt4histocr-BINPAGE-sauvola-DESKEW-ocropy-CLIP-RESEG-DEWARP .074
OCR-D-OCR-CALA-gt4histocr-BINPAGE-sauvola-DESKEW-ocropy-CLIP-DESKEW-tesseract-RESEG-DEWARP .087
OCR-D-OCR-CALA-gt4histocr-BINPAGE-sauvola-DENOISE-ocropy-CLIP-RESEG-DEWARP .084
OCR-D-OCR-CALA-gt4histocr-BINPAGE-wolf-DENOISE-ocropy-CLIP-RESEG-DEWARP .085
OCR-D-OCR-CALA-gt4histocr-BINPAGE-sauvola-DENOISE-ocropy-DESKEW-ocropy-CLIP-RESEG-DEWARP .086
OCR-D-OCR-CALA-gt4histocr-BINPAGE-wolf-DENOISE-ocropy-DESKEW-ocropy-CLIP-RESEG-DEWARP .109
OCR-D-OCR-CALA-gt4histocr-BINPAGE-sauvola-DENOISE-ocropy-DESKEW-ocropy-CLIP-DESKEW-tesseract-RESEG-DEWARP .090
OCR-D-OCR-CALA-gt4histocr-BINPAGE-wolf-DENOISE-ocropy-DESKEW-ocropy-CLIP-DESKEW-tesseract-RESEG-DEWARP .110

Hence, it appears that consistently (across different OCRs) …

However, this result is still preliminary. Both the processor implementations evolve and the GT annotations get fixed over time.

Implementation

To make writing (and reading) configurations as simple as possible, they are expressed as rules operating on METS file groups (i.e. workspace-local). For convenience, the most common recipe pattern involving only 1 input and 1 output file group via some OCR-D CLI is available via static pattern rule, which merely takes the target-specific variables TOOL (the CLI executable) and optionally PARAMS (a JSON-formatted list of parameter assignments) or OPTIONS (a white-space separated list of parameter assignments). Custom rules are possible as well. If the makefile does not start with the overall target, it must specify its .DEFAULT_GOAL, so callers can run without knowledge of the target names.

Rules that are not configuration-specific (like the static pattern rule) are all shared by including a common Makefile at the end of configuration makefiles (which gets copied from workflow.mk at install time).

make always operates on the level of the workspace directory (i.e. only one at a time), where targets are fileGrps and the default goal is the maximum fileGrp.

For running entire collections of workspaces (possibly in parallel), recursive make has been abandoned in favour of the parallel-based bash script ocrd-make. Its command-line interface looks like make, but the targets are workspaces and the default goal is all (which recursively finds all workspaces).

GPU vs CPU parallelism

When executing workflows in parallel across workspaces (with --jobs) on multiple CPUs, it must be ensured that not too many OCR-D processors which use GPU resources are running concurrently (to prevent over-allocation of GPU memory). Thus, make needs to know:

  1. which processors (have/want to) use GPU resources, and
  2. how many such processors can run in parallel.

It can then synchronize these processors with a semaphore. This is achieved by expanding the static pattern rule with a synchronisation mechanism (based on GNU parallel). Workflow configurations can use that by setting the target-specific variable GPU to a non-empty value for the respective rules. (Custom recipes will have to use sem --id OCR-D-GPUSEM.)

That way, races are prevented, but also GPUs cannot become the bottleneck: When all GPUs are busy, processors will fall back to CPU.

workspace vs page parallelism

When executing workflows in parallel across workspaces (with --jobs) on multiple CPUs, it must be ensured that OCR-D processors do not use local multiprocessing facilities themselves (to prevent over-allocation of CPUs).

In the current state of affairs, OCR-D processors cannot be run in parallel across pages via multiprocessing. (At least, they are never implemented that way.) That may change in the future with a new OCR-D API. But still, many processors do already use libraries like OpenMP or OpenBLAS which use multiprocessing locally within pages. This can be controlled via environment variables like OMP_THREAD_LIMIT.

This is achieved by exporting these variables to all recipes with a value of 1 when -j is used, or half the number of physical CPUs (unless NTHREADS is explicitly given) otherwise.