Scope: Processing of images, training and evaluation of ViT model, input file/directory processing, class π·οΈ (category) results of top N predictions output, predictions summarizing into a tabular format, HF π hub support for the model, multiplatform (Win/Lin) data preparation scripts for PDF to PNG conversion
- Model description π
- How to install π§
- How to run
βΆοΈ - Results π
- For developers π οΈ
- Data preparation π¦
- Contacts π§
- Acknowledgements π
- Appendix π€
π² Fine-tuned model repository: UFAL's vit-historical-page 1 π
π³ Base model repository: Google's vit-base-patch16-224 2 π
The model was trained on the manually annotated dataset of historical documents, in particular, images of pages from the archival documents with paper sources that were scanned into digital form. The images contain various combinations of texts οΈπ, tables π, drawings π, and photos π - categories π·οΈ described below were formed based on those archival documents.
The key use case of the provided model and data processing pipeline is to classify an input PNG image from PDF scanned paper source into one of the categories - each responsible for the following content-specific data processing pipeline. In other words, when several APIs for different OCR subtasks are at your disposal - run this classifier first to mark the input data as machine typed (old style fonts) / hand-written βοΈ / just printed plain οΈπ text or structured in tabular π format text, as well as to mark presence of the printed π or drawn π graphic materials yet to be extracted from the page images.
Training set of the model: 8950 images
Evaluation set (10% of the all, with the same proportions as below) model_EVAL.csv π: 995 images
Manual β annotation were performed beforehand and took some time β, the categories π·οΈ were formed from different sources of the archival documents from year 1920 to year 2020. Disproportion of the categories π·οΈ is NOT intentional, but rather a result of the source data nature.
In total, several hundred of separate PDF files were selected and split into PNG pages, some scanned documents were one-page long and some were much longer (dozens and hundreds of pages). The specific content and language of the source data is irrelevant considering the model's vision resolution, however all of the data samples were from archaeological reports which may somehow affect the drawings detection due to common form objects being ceramic pieces, arrowheads, and rocks firstly drawn by hand and later illustrated with digital tools.
LabelοΈ | Ratio | Description |
---|---|---|
DRAW | 11.89% | π - drawings, maps, paintings with text |
DRAW_L | 8.17% | ππ - drawings, etc with a table legend or inside tabular layout / forms |
LINE_HW | 5.99% | βοΈπ - handwritten text lines inside tabular layout / forms |
LINE_P | 6.06% | π - printed text lines inside tabular layout / forms |
LINE_T | 13.39% | π - machine typed text lines inside tabular layout / forms |
PHOTO | 10.21% | π - photos with text |
PHOTO_L | 7.86% | ππ - photos inside tabular layout / forms or with a tabular annotation |
TEXT | 8.58% | π° - mixed types of printed and handwritten texts |
TEXT_HW | 7.36% | βοΈπ - only handwritten text |
TEXT_P | 6.95% | π - only printed text |
TEXT_T | 13.53% | π - only machine typed text |
The categories were chosen to sort the pages by the following criterion:
- presence of graphical elements (drawings π OR photos π)
- type of text π (handwritten βοΈοΈ OR printed OR typed OR mixed π°)
- presence of tabular layout / forms π
The reasons for such distinction are different processing pipelines for different types of pages, that would be applied after the classification.
The easiest way to obtain the model would be to use the HF π hub repository 1 π that can be easily accessed vie this project. Step-by-step instructions on this program installation are provided below.
Warning
Make sure you have Python version 3.10+ installed on your machine π». Then create a separate virtual environment for this project
How to π
Clone this project to your local machine π₯οΈ via:
cd /local/folder/for/this/project
git init
git clone https://github.com/ufal/atrium-page-classification.git
Follow the Unix* / Windows-specific instruction at the venv docs 3 ππ if you don't know how to. After creating the venv folder, activate the environment via:
source <your_venv_dir>/bin/activate
and then inside your virtual environment, you should install python libraries (takes time β)
Note
Up to 1 GB of space for model files and checkpoints is needed, and up to 7 GB of space for the python libraries (pytorch and its dependencies, etc)
Can be done via:
pip install -r requirements.txt
To test that everything works okay and see the flag descriptions call for --help β:
python3 run.py -h
To pull the model from the HF π hub repository directly, load the model via:
python3 run.py --hf
You should see a message about loading the model from hub and then saving it locally. Only after you have obtained the trained model files (takes less time β than installing dependencies), you can play with any commands provided below.
Important
Unless you already have the model files in the 'model/model_version' directory next to this file, you must use the --hf flag to download the model files from the HF π repo 1 π
After the model is downloaded, you should see a similar file structure:
Full project tree π³ files structure π
/local/folder/for/this/project
βββ model
βββ model_version
βββ config.json
βββ model.safetensors
βββ preprocessor_config.json
βββ checkpoint
βββ models--google--vit-base-patch16-224
βββ blobs
βββ snapshots
βββ refs
βββ .locs
βββ models--google--vit-base-patch16-224
βββ model_output
βββ checkpoint-version
βββ config.json
βββ model.safetensors
βββ trainer_state.json
βββ optimizer.pt
βββ scheduler.pt
βββ rng_state.pth
βββ training_args.bin
βββ ...
βββ data_scripts
βββ windows
βββ move_single.bat
βββ pdf2png.bat
βββ sort.bat
βββ unix
βββ move_single.sh
βββ pdf2png.sh
βββ sort.sh
βββ result
βββ plots
βββ date-time_conf_mat.png
βββ ...
βββ tables
βββ date-time_TOP-N.csv
βββ date-time_TOP-N_EVAL.csv
βββ date-time_EVAL_RAW.csv
βββ ...
βββ run.py
βββ classifier.py
βββ utils.py
βββ requirements.txt
βββ config.txt
βββ README.md
βββ ...
Some of the listed above folders may be missing, like model_output which is created after training the model.
There are two main ways to run the program:
- Single PNG file classification π
- Directory with PNG files classification π
To begin with, open config.txt β and change folder path in the [INPUT] section, then optionally change top_N and batch in the [SETUP] section.
Note
οΈ Top-3 is enough to cover most of the images, setting Top-5 will help with a small number of difficult to classify samples.
Caution
Do not try to change base_model and other section contents unless you know what you are doing
The following prediction should be run using -f or --file flag with the path argument. Optionally, you can use -tn or --topn flag with the number of guesses you want to get, and also -m or --model flag with the path to the model folder argument.
How to π
Run the program from its starting point run.py π with optional flags:
python3 run.py -tn 3 -f '/full/path/to/file.png' -m '/full/path/to/model/folder'
for exactly TOP-3 guesses
OR if you are sure about default variables set in the config.txt β:
python3 run.py -f '/full/path/to/file.png'
to run single PNG file classification - the output will be in the console.
Note
Console output and all result tables contain normalized scores for the highest N class π·οΈ scores
The following prediction type does nor require explicit directory path setting with the -d or --directory, since its default value is set in the config.txt β file and awaken when the --dir flag is used. The same flags for the number of guesses, and the model folder path as for the single page processing can be used. In addition, 2 directory-specific flags --inner and --raw are available.
Caution
You must either explicitly set -d flag's argument or use --dir flag (calling for the preset default value of the input directory) to process PNG files on the directory level, otherwise nothing will happen
How to π
python3 run.py -tn 3 -d '/full/path/to/directory' -m '/full/path/to/model/folder'
for exactly TOP-3 guesses from all images found in the given directory.
OR if you are really sure about default variables set in the config.txt β:
python3 run.py --dir
The classification results of PNG pages collected from the directory will be saved πΎ to related results π folders defined in [OUTPUT] section of config.txt β file.
Tip
To additionally get raw class π·οΈ probabilities from the model along with the TOP-N results, use --raw flag when processing the directory
Tip
To process all PNG files in the directory AND its subdirectories use the --inner flag when processing the directory
There are accuracy performance measurements and plots of confusion matrices for the evaluation dataset and tables with results in the results π folder.
Evaluation set's accuracy (Top-3): 99.6% π
Evaluation set's accuracy (Top-1): 97.3% π
By running tests on the evaluation dataset after training you can generate the following output files:
- data-time_model_TOP-N_EVAL.csv - results of the evaluation dataset with TOP-N guesses
- data-time_conf_mat_TOP-N.png - confusion matrix plot for the evaluation dataset also with TOP-N guesses
- data-time_model_EVAL_RAW.csv - raw probabilities for all classes of the evaluation dataset
Note
Generated tables will be sorted by FILE and PAGE number columns in ascending order.
General result tables π
Demo files:
-
Manually β checked (small): model_TOP-5.csv π
-
Manually β checked evaluation dataset (TOP-3): model_TOP-3_EVAL.csv π
-
Manually β checked evaluation dataset (TOP-1): model_TOP-1_EVAL.csv π
-
Unchecked with TRUE values: model_TOP-3.csv π
With the following columns π:
- FILE - name of the file
- PAGE - number of the page
- CLASS-N - label of the category π·οΈ, guess TOP-N
- SCORE-N - score of the category π·οΈ, guess TOP-N
and optionally
- TRUE - actual label of the category π·οΈ
Raw result tables π
Demo files:
-
Manually β checked evaluation dataset RAW: model_RAW_EVAL.csv π
-
Unchecked with TRUE values RAW: model_RAW.csv π
With the following columns π:
- FILE - name of the file
- PAGE - number of the page
- <CATEGORY_LABEL> - separate columns for each of the defined classes π·οΈ
- TRUE - actual label of the category π·οΈ
The reason to use --raw flag is possible convenience of results review, since the most ambiguous cases are expected to be at the bottom of the table sorted in descending order by all <CATEGORY_LABEL> columns, while the most obvious (for the model) cases are expected to be at the top.
Use this project code as a base for your own image classification tasks. Instructions on the key phases of the process are provided below.
Project files description ππ
File Name | Description |
---|---|
classifier.py |
Model-specific classes and related functions including predefined values for training arguments |
utils.py |
Task-related algorithms |
run.py |
Starting point of the program with its main function - can be edited for flags and function argument extensions |
config.txt |
Changeable variables for the program - should be edited |
Most of the changeable variables are in the config.txt β file, specifically, in the [TRAIN], [HF], and [SETUP] sections.
For more detailed training process adjustments refer to the related functions in classifier.py π file, where you will find some predefined values not used in the run.py π file.
To train the model run:
python3 run.py --train
To evaluate the model and create a confusion matrix plot π run:
python3 run.py --eval
Important
In both cases, you must make sure that training data directory is set right in the config.txt β and it contains category π·οΈ subdirectories with images inside. Names of the category π·οΈ subdirectories become actual label names, and replaces the default categories π·οΈ list.
During training image transformations were applied sequentially with a 50% chance.
Image preprocessing steps π
- transforms.ColorJitter(brightness 0.5)
- transforms.ColorJitter(contrast 0.5)
- transforms.ColorJitter(saturation 0.5)
- transforms.ColorJitter(hue 0.5)
- transforms.Lambda(lambda img: ImageEnhance.Sharpness(img).enhance(random.uniform(0.5, 1.5)))
- transforms.Lambda(lambda img: img.filter(ImageFilter.GaussianBlur(radius=random.uniform(0, 2))))
Note
No rotation, reshaping, or flipping was applied to the images, manly colors manipulations were used. The reason behind this are pages containing specific form types, general text orientation on the pages, and the default reshape of the model input to the square 224x224 resolution images.
Training hyperparameters π
- eval_strategy "epoch"
- save_strategy "epoch"
- learning_rate 5e-5
- per_device_train_batch_size 8
- per_device_eval_batch_size 8
- num_train_epochs 3
- warmup_ratio 0.1
- logging_steps 10
- load_best_model_at_end True
- metric_for_best_model "accuracy"
Above are the default hyperparameters used in the training process that can be changed in the classifier.py π file, where the model is defined and trained.
There are useful multiplatform scripts in the data_scripts π folder for the whole process of data preparation.
Note
The .sh scripts are adapted for Unix OS and .bat scripts are adapted for Windows OS
On Windows you must also install the following software before converting PDF documents to PNG images:
- ImageMagick 4 π - download and install latest version
- Ghostscript 5 π - download and install latest version (32 or 64 bit) by AGPL
The source set of PDF documents must be converted to page-specific PNG images before processing. The following steps describe procedure of converting PDF documents to PNG images suitable for both training, evaluation, and prediction.
Firstly, copy the PDF-to-PNG converter script to the directory with PDF documents.
How to π
Windows:
move \local\folder\for\this\project\data_scripts\pdf2png.bat \full\path\to\your\folder\with\pdf\files
Unix:
cp /local/folder/for/this/project/data_scripts/pdf2png.sh /full/path/to/your/folder/with/pdf/files
Now check the content and comments in pdf2png.sh π or pdf2png.bat π script, and run it. You can optionally comment out the removal of processed PDF files from the script, yet it's not recommended in case you are going to launch the program several times from the same location.
How to π
Windows:
cd \full\path\to\your\folder\with\pdf\files
pdf2png.bat
Unix:
cd /full/path/to/your/folder/with/pdf/files
pdf2png.sh
After the program is done, you will have a directory full of document-specific subdirectories containing page-specific images with a similar structure:
Unix folder tree π³ structure π
/full/path/to/your/folder/with/pdf/files
βββ PdfFile1Name
βββ PdfFile1Name-001.png
βββ PdfFile1Name-002.png
βββ ...
βββ PdfFile2Name
βββ PdfFile2Name-01.png
βββ PDFFile2Name-02.png
βββ ...
βββ PdfFile3Name
βββ PdfFile3Name-1.png
βββ PdfFile4Name
βββ ...
Note
The page numbers are padded with zeros (on the left) to match the length of the last page number in each PDF file, this is done automatically by the pdftoppm command used on Unix. While ImageMagick's convert command used on Windows does not pad the page numbers.
Windows folder tree π³ structure π
\full\path\to\your\folder\with\pdf\files
βββ PdfFile1Name
βββ PdfFile1Name-1.png
βββ PdfFile1Name-2.png
βββ ...
βββ PdfFile2Name
βββ PdfFile2Name-1.png
βββ PDFFile2Name-2.png
βββ ...
βββ PdfFile3Name
βββ PdfFile3Name-1.png
βββ PdfFile4Name
βββ ...
Optionally you can use the move_single.sh π or move_single.bat π script to move all PNG files from directories with a single PNG file inside to the common directory of one-pagers.
How to π
Windows:
move \local\folder\for\this\project\data_scripts\move_single.bat \full\path\to\your\folder\with\pdf\files
cd \full\path\to\your\folder\with\pdf\files
move_single.bat
Unix:
cp /local/folder/for/this//project/data_scripts/move_single.sh /full/path/to/your/folder/with/pdf/files
cd /full/path/to/your/folder/with/pdf/files
move_single.sh
The reason for such movement is simply convenience in the following annotation process. These changes are cared for in the next sort.sh π and sort.bat π scripts as well.
Prepare a CSV table with such columns:
- FILE - name of the PDF document which was the source of this page
- PAGE - number of the page (NOT padded with 0s)
- CLASS - label of the category π·οΈ
Tip
Prepare equal in size categories π·οΈ if possible, so that the model will not be biased towards the over-represented labels π·οΈ
It takes time β to collect at least several hundred of examples per category.
Cluster the annotated data into separate folders using the sort.sh π or sort.bat π script to copy data from the source folder to the training folder where each category π·οΈ has its own subdirectory.
How to π
Windows:
sort.bat
Unix:
sort.sh
Warning
It does NOT matter from which directory you launch the sorting script, but you must check the top of the script for (1) the path to the CSV table with annotations, (2) path to the directory containing document-specific subdirectories of page-specific PNG pages, and (3) path to the directory where you want to store the training data of label-specific directories with annotated page images.
After the program is done, you will have a directory full of label-specific subdirectories containing document-specific pages with a similar structure:
Unix folder tree π³ structure π
/full/path/to/your/folder/with/train/pages
βββ Label1
βββ PdfFileAName-00N.png
βββ PdfFileBName-0M.png
βββ ...
βββ Label2
βββ Label3
βββ Label4
βββ ...
Windows folder tree π³ structure π
\full\path\to\your\folder\with\train\pages
βββ Label1
βββ PdfFileAName-N.png
βββ PdfFileBName-M.png
βββ ...
βββ Label2
βββ Label3
βββ Label4
βββ ...
Before running the training, make sure to check the config.txt βοΈ file for the [TRAIN] section variables, where you should set a path to the data folder.
Tip
In the config.txt βοΈ file tweak the parameter of max_categ for maximum number of samples per category π·οΈ, in case you have over-represented labels significantly dominating in size. Set max_categ higher than the number of samples in the largest category π·οΈ to use all data samples.
For support write to: lutsai.k@gmail.com responsible for this repository 6
- Developed by UFAL 7 π₯
- Funded by ATRIUM 8 π°
- Shared by ATRIUM 8 & UFAL 7
- Model type: fine-tuned ViT with a 224x224 resolution size 2
Β©οΈ 2022 UFAL & ATRIUM
README emoji codes π
- π₯ - your computer
- π·οΈ - label/category/class
- π - page/file
- π - folder/directory
- π - generated diagrams or plots
- π³ - tree of file structure
- β - time-consuming process
- β - manual action
- π - performance measurement
- π - Hugging Face (HF)
- π§ - contacts
- π - click to see
- βοΈ - configuration/settings
- π - link to the internal file
- π - link to the external website
Content specific emoji codes π
- π - table content
- π - drawings/paintings/diagrams
- π - photos
- βοΈ - hand-written content
- π - text content
- π° - mixed types of text content, maybe with graphics
Decorative emojis π
- πππ§βΆπ οΈπ¦ππππ₯π¬π€ - decorative purpose only