Image classification using fine-tuned ViT - for historical documents sorting

Goal: solve a task of archive page images sorting (for their further content-based processing)

Scope: Processing of images, training and evaluation of ViT model, input file/directory processing, class 🏷️ (category) results of top N predictions output, predictions summarizing into a tabular format, HF 😊 hub support for the model, multiplatform (Win/Lin) data preparation scripts for PDF to PNG conversion

Table of contents 📑

Model description 📇
- Data 📜
- Categories 🏷️
How to install 🔧
How to run ▶️
- Page processing 📄
- Directory processing 📁
Results 📊
- Result tables and their columns 📏📋
For developers 🛠️
Data preparation 📦
Contacts 📧
Acknowledgements 🙏
Appendix 🤓

Model description 📇

🔲 Fine-tuned model repository: UFAL's vit-historical-page ¹ 🔗

🔳 Base model repository: Google's vit-base-patch16-224 ² 🔗

The model was trained on the manually annotated dataset of historical documents, in particular, images of pages from the archival documents with paper sources that were scanned into digital form. The images contain various combinations of texts ️📄, tables 📏, drawings 📈, and photos 🌄 - categories 🏷️ described below were formed based on those archival documents.

The key use case of the provided model and data processing pipeline is to classify an input PNG image from PDF scanned paper source into one of the categories - each responsible for the following content-specific data processing pipeline. In other words, when several APIs for different OCR subtasks are at your disposal - run this classifier first to mark the input data as machine typed (old style fonts) / hand-written ✏️ / just printed plain ️📄 text or structured in tabular 📏 format text, as well as to mark presence of the printed 🌄 or drawn 📈 graphic materials yet to be extracted from the page images.

Data 📜

Training set of the model: 8950 images

Evaluation set (10% of the all, with the same proportions as below) model_EVAL.csv 📎: 995 images

Manual ✍ annotation were performed beforehand and took some time ⌛, the categories 🏷️ were formed from different sources of the archival documents from year 1920 to year 2020. Disproportion of the categories 🏷️ is NOT intentional, but rather a result of the source data nature.

In total, several hundred of separate PDF files were selected and split into PNG pages, some scanned documents were one-page long and some were much longer (dozens and hundreds of pages). The specific content and language of the source data is irrelevant considering the model's vision resolution, however all of the data samples were from archaeological reports which may somehow affect the drawings detection due to common form objects being ceramic pieces, arrowheads, and rocks firstly drawn by hand and later illustrated with digital tools.

Categories 🏷️

Label️	Ratio	Description
DRAW	11.89%	📈 - drawings, maps, paintings with text
DRAW_L	8.17%	📈📏 - drawings, etc with a table legend or inside tabular layout / forms
LINE_HW	5.99%	✏️📏 - handwritten text lines inside tabular layout / forms
LINE_P	6.06%	📏 - printed text lines inside tabular layout / forms
LINE_T	13.39%	📏 - machine typed text lines inside tabular layout / forms
PHOTO	10.21%	🌄 - photos with text
PHOTO_L	7.86%	🌄📏 - photos inside tabular layout / forms or with a tabular annotation
TEXT	8.58%	📰 - mixed types of printed and handwritten texts
TEXT_HW	7.36%	✏️📄 - only handwritten text
TEXT_P	6.95%	📄 - only printed text
TEXT_T	13.53%	📄 - only machine typed text

The categories were chosen to sort the pages by the following criterion:

presence of graphical elements (drawings 📈 OR photos 🌄)
type of text 📄 (handwritten ✏️️ OR printed OR typed OR mixed 📰)
presence of tabular layout / forms 📏

The reasons for such distinction are different processing pipelines for different types of pages, that would be applied after the classification.

How to install 🔧

The easiest way to obtain the model would be to use the HF 😊 hub repository ¹ 🔗 that can be easily accessed vie this project. Step-by-step instructions on this program installation are provided below.

Warning

Make sure you have Python version 3.10+ installed on your machine 💻. Then create a separate virtual environment for this project

How to 👀

Clone this project to your local machine 🖥️ via:

cd /local/folder/for/this/project
git init
git clone https://github.com/ufal/atrium-page-classification.git

Follow the Unix* / Windows-specific instruction at the venv docs ³ 👀🔗 if you don't know how to. After creating the venv folder, activate the environment via:

source <your_venv_dir>/bin/activate

and then inside your virtual environment, you should install python libraries (takes time ⌛)

Note

Up to 1 GB of space for model files and checkpoints is needed, and up to 7 GB of space for the python libraries (pytorch and its dependencies, etc)

Can be done via:

pip install -r requirements.txt

To test that everything works okay and see the flag descriptions call for --help ❓:

python3 run.py -h

To pull the model from the HF 😊 hub repository directly, load the model via:

python3 run.py --hf

You should see a message about loading the model from hub and then saving it locally. Only after you have obtained the trained model files (takes less time ⌛ than installing dependencies), you can play with any commands provided below.

Important

Unless you already have the model files in the 'model/model_version' directory next to this file, you must use the --hf flag to download the model files from the HF 😊 repo ¹ 🔗

After the model is downloaded, you should see a similar file structure:

Full project tree 🌳 files structure 👀

/local/folder/for/this/project
├── model
    └── model_version 
        ├── config.json
        ├── model.safetensors
        └── preprocessor_config.json
├── checkpoint
        ├── models--google--vit-base-patch16-224
            ├── blobs
            ├── snapshots
            └── refs
        └── .locs
            └── models--google--vit-base-patch16-224
├── model_output
    ├── checkpoint-version
        ├── config.json
        ├── model.safetensors
        ├── trainer_state.json
        ├── optimizer.pt
        ├── scheduler.pt
        ├── rng_state.pth
        └── training_args.bin
    └── ...
├── data_scripts
    ├── windows
        ├── move_single.bat
        ├── pdf2png.bat
        └── sort.bat
    └── unix
        ├── move_single.sh
        ├── pdf2png.sh
        └── sort.sh
├── result
    ├── plots
        ├── date-time_conf_mat.png
        └── ...
    └── tables
        ├── date-time_TOP-N.csv
        ├── date-time_TOP-N_EVAL.csv
        ├── date-time_EVAL_RAW.csv
        └── ...
├── run.py
├── classifier.py
├── utils.py
├── requirements.txt
├── config.txt
├── README.md
└── ...

Some of the listed above folders may be missing, like model_output which is created after training the model.

How to run ▶️

There are two main ways to run the program:

Single PNG file classification 📄
Directory with PNG files classification 📁

To begin with, open config.txt ⚙ and change folder path in the [INPUT] section, then optionally change top_N and batch in the [SETUP] section.

Note

️ Top-3 is enough to cover most of the images, setting Top-5 will help with a small number of difficult to classify samples.

Caution

Do not try to change base_model and other section contents unless you know what you are doing

Page processing 📄

The following prediction should be run using -f or --file flag with the path argument. Optionally, you can use -tn or --topn flag with the number of guesses you want to get, and also -m or --model flag with the path to the model folder argument.

How to 👀

Run the program from its starting point run.py 📎 with optional flags:

python3 run.py -tn 3 -f '/full/path/to/file.png' -m '/full/path/to/model/folder'

for exactly TOP-3 guesses

OR if you are sure about default variables set in the config.txt ⚙:

python3 run.py -f '/full/path/to/file.png'

to run single PNG file classification - the output will be in the console.

Note

Console output and all result tables contain normalized scores for the highest N class 🏷️ scores

Directory processing 📁

The following prediction type does nor require explicit directory path setting with the -d or --directory, since its default value is set in the config.txt ⚙ file and awaken when the --dir flag is used. The same flags for the number of guesses, and the model folder path as for the single page processing can be used. In addition, 2 directory-specific flags --inner and --raw are available.

Caution

You must either explicitly set -d flag's argument or use --dir flag (calling for the preset default value of the input directory) to process PNG files on the directory level, otherwise nothing will happen

How to 👀

python3 run.py -tn 3 -d '/full/path/to/directory' -m '/full/path/to/model/folder'

for exactly TOP-3 guesses from all images found in the given directory.

OR if you are really sure about default variables set in the config.txt ⚙:

python3 run.py --dir

The classification results of PNG pages collected from the directory will be saved 💾 to related results 📁 folders defined in [OUTPUT] section of config.txt ⚙ file.

Tip

To additionally get raw class 🏷️ probabilities from the model along with the TOP-N results, use --raw flag when processing the directory

Tip

To process all PNG files in the directory AND its subdirectories use the --inner flag when processing the directory

Results 📊

There are accuracy performance measurements and plots of confusion matrices for the evaluation dataset and tables with results in the results 📁 folder.

Evaluation set's accuracy (Top-3): 99.6% 🏆

Confusion matrix 📊 TOP-3 👀

Evaluation set's accuracy (Top-1): 97.3% 🏆

Confusion matrix 📊 TOP-1 👀

By running tests on the evaluation dataset after training you can generate the following output files:

data-time_model_TOP-N_EVAL.csv - results of the evaluation dataset with TOP-N guesses
data-time_conf_mat_TOP-N.png - confusion matrix plot for the evaluation dataset also with TOP-N guesses
data-time_model_EVAL_RAW.csv - raw probabilities for all classes of the evaluation dataset

Note

Generated tables will be sorted by FILE and PAGE number columns in ascending order.

Result tables and their columns 📏📋

General result tables 👀

Demo files:

Manually ✍ checked (small): model_TOP-5.csv 📎
Manually ✍ checked evaluation dataset (TOP-3): model_TOP-3_EVAL.csv 📎
Manually ✍ checked evaluation dataset (TOP-1): model_TOP-1_EVAL.csv 📎
Unchecked with TRUE values: model_TOP-3.csv 📎

With the following columns 📋:

FILE - name of the file
PAGE - number of the page
CLASS-N - label of the category 🏷️, guess TOP-N
SCORE-N - score of the category 🏷️, guess TOP-N

and optionally

TRUE - actual label of the category 🏷️

Raw result tables 👀

Demo files:

Manually ✍ checked evaluation dataset RAW: model_RAW_EVAL.csv 📎
Unchecked with TRUE values RAW: model_RAW.csv 📎

With the following columns 📋:

FILE - name of the file
PAGE - number of the page
<CATEGORY_LABEL> - separate columns for each of the defined classes 🏷️
TRUE - actual label of the category 🏷️

The reason to use --raw flag is possible convenience of results review, since the most ambiguous cases are expected to be at the bottom of the table sorted in descending order by all <CATEGORY_LABEL> columns, while the most obvious (for the model) cases are expected to be at the top.

For developers 🛠️

Use this project code as a base for your own image classification tasks. Instructions on the key phases of the process are provided below.

Project files description 📋👀

File Name	Description
`classifier.py`	Model-specific classes and related functions including predefined values for training arguments
`utils.py`	Task-related algorithms
`run.py`	Starting point of the program with its main function - can be edited for flags and function argument extensions
`config.txt`	Changeable variables for the program - should be edited

Most of the changeable variables are in the config.txt ⚙ file, specifically, in the [TRAIN], [HF], and [SETUP] sections.

For more detailed training process adjustments refer to the related functions in classifier.py 📎 file, where you will find some predefined values not used in the run.py 📎 file.

To train the model run:

python3 run.py --train

To evaluate the model and create a confusion matrix plot 📊 run:

python3 run.py --eval

Important

In both cases, you must make sure that training data directory is set right in the config.txt ⚙ and it contains category 🏷️ subdirectories with images inside. Names of the category 🏷️ subdirectories become actual label names, and replaces the default categories 🏷️ list.

During training image transformations were applied sequentially with a 50% chance.

Image preprocessing steps 👀

transforms.ColorJitter(brightness 0.5)
transforms.ColorJitter(contrast 0.5)
transforms.ColorJitter(saturation 0.5)
transforms.ColorJitter(hue 0.5)
transforms.Lambda(lambda img: ImageEnhance.Sharpness(img).enhance(random.uniform(0.5, 1.5)))
transforms.Lambda(lambda img: img.filter(ImageFilter.GaussianBlur(radius=random.uniform(0, 2))))

Note

No rotation, reshaping, or flipping was applied to the images, manly colors manipulations were used. The reason behind this are pages containing specific form types, general text orientation on the pages, and the default reshape of the model input to the square 224x224 resolution images.

Training hyperparameters 👀

eval_strategy "epoch"
save_strategy "epoch"
learning_rate 5e-5
per_device_train_batch_size 8
per_device_eval_batch_size 8
num_train_epochs 3
warmup_ratio 0.1
logging_steps 10
load_best_model_at_end True
metric_for_best_model "accuracy"

Above are the default hyperparameters used in the training process that can be changed in the classifier.py 📎 file, where the model is defined and trained.

Data preparation 📦

There are useful multiplatform scripts in the data_scripts 📁 folder for the whole process of data preparation.

Note

The .sh scripts are adapted for Unix OS and .bat scripts are adapted for Windows OS

On Windows you must also install the following software before converting PDF documents to PNG images:

ImageMagick ⁴ 🔗 - download and install latest version
Ghostscript ⁵ 🔗 - download and install latest version (32 or 64 bit) by AGPL

PDF to PNG 📚

The source set of PDF documents must be converted to page-specific PNG images before processing. The following steps describe procedure of converting PDF documents to PNG images suitable for both training, evaluation, and prediction.

Firstly, copy the PDF-to-PNG converter script to the directory with PDF documents.

How to 👀

Windows:

move \local\folder\for\this\project\data_scripts\pdf2png.bat \full\path\to\your\folder\with\pdf\files

Unix:

cp /local/folder/for/this/project/data_scripts/pdf2png.sh /full/path/to/your/folder/with/pdf/files

Now check the content and comments in pdf2png.sh 📎 or pdf2png.bat 📎 script, and run it. You can optionally comment out the removal of processed PDF files from the script, yet it's not recommended in case you are going to launch the program several times from the same location.

How to 👀

Windows:

cd \full\path\to\your\folder\with\pdf\files
pdf2png.bat

Unix:

cd /full/path/to/your/folder/with/pdf/files
pdf2png.sh

After the program is done, you will have a directory full of document-specific subdirectories containing page-specific images with a similar structure:

Unix folder tree 🌳 structure 👀

/full/path/to/your/folder/with/pdf/files
├── PdfFile1Name
    ├── PdfFile1Name-001.png
    ├── PdfFile1Name-002.png
    └── ...
├── PdfFile2Name
    ├── PdfFile2Name-01.png
    ├── PDFFile2Name-02.png
    └── ...
├── PdfFile3Name
    └── PdfFile3Name-1.png 
├── PdfFile4Name
└── ...

Note

The page numbers are padded with zeros (on the left) to match the length of the last page number in each PDF file, this is done automatically by the pdftoppm command used on Unix. While ImageMagick's convert command used on Windows does not pad the page numbers.

Windows folder tree 🌳 structure 👀

\full\path\to\your\folder\with\pdf\files
├── PdfFile1Name
    ├── PdfFile1Name-1.png
    ├── PdfFile1Name-2.png
    └── ...
├── PdfFile2Name
    ├── PdfFile2Name-1.png
    ├── PDFFile2Name-2.png
    └── ...
├── PdfFile3Name
    └── PdfFile3Name-1.png 
├── PdfFile4Name
└── ...

Optionally you can use the move_single.sh 📎 or move_single.bat 📎 script to move all PNG files from directories with a single PNG file inside to the common directory of one-pagers.

How to 👀

Windows:

move \local\folder\for\this\project\data_scripts\move_single.bat \full\path\to\your\folder\with\pdf\files
cd \full\path\to\your\folder\with\pdf\files
move_single.bat

Unix:

cp /local/folder/for/this//project/data_scripts/move_single.sh /full/path/to/your/folder/with/pdf/files
cd /full/path/to/your/folder/with/pdf/files 
move_single.sh

The reason for such movement is simply convenience in the following annotation process. These changes are cared for in the next sort.sh 📎 and sort.bat 📎 scripts as well.

PNG pages annotation 🔎

Prepare a CSV table with such columns:

FILE - name of the PDF document which was the source of this page
PAGE - number of the page (NOT padded with 0s)
CLASS - label of the category 🏷️

Tip

Prepare equal in size categories 🏷️ if possible, so that the model will not be biased towards the over-represented labels 🏷️

It takes time ⌛ to collect at least several hundred of examples per category.

PNG pages sorting for training 📬

Cluster the annotated data into separate folders using the sort.sh 📎 or sort.bat 📎 script to copy data from the source folder to the training folder where each category 🏷️ has its own subdirectory.

How to 👀

Windows:

sort.bat

Unix:

sort.sh

Warning

It does NOT matter from which directory you launch the sorting script, but you must check the top of the script for (1) the path to the CSV table with annotations, (2) path to the directory containing document-specific subdirectories of page-specific PNG pages, and (3) path to the directory where you want to store the training data of label-specific directories with annotated page images.

After the program is done, you will have a directory full of label-specific subdirectories containing document-specific pages with a similar structure:

Unix folder tree 🌳 structure 👀

/full/path/to/your/folder/with/train/pages
├── Label1
    ├── PdfFileAName-00N.png
    ├── PdfFileBName-0M.png
    └── ...
├── Label2
├── Label3
├── Label4
└── ...

Windows folder tree 🌳 structure 👀

\full\path\to\your\folder\with\train\pages
├── Label1
    ├── PdfFileAName-N.png
    ├── PdfFileBName-M.png
    └── ...
├── Label2
├── Label3
├── Label4
└── ...

Before running the training, make sure to check the config.txt ⚙️ file for the [TRAIN] section variables, where you should set a path to the data folder.

Tip

In the config.txt ⚙️ file tweak the parameter of max_categ for maximum number of samples per category 🏷️, in case you have over-represented labels significantly dominating in size. Set max_categ higher than the number of samples in the largest category 🏷️ to use all data samples.

Contacts 📧

For support write to: lutsai.k@gmail.com responsible for this repository ⁶

Acknowledgements 🙏

Developed by UFAL ⁷ 👥
Funded by ATRIUM ⁸ 💰
Shared by ATRIUM ⁸ & UFAL ⁷
Model type: fine-tuned ViT with a 224x224 resolution size ²

Appendix 🤓

README emoji codes 👀

🖥 - your computer
🏷️ - label/category/class
📄 - page/file
📁 - folder/directory
📊 - generated diagrams or plots
🌳 - tree of file structure
⌛ - time-consuming process
✍ - manual action
🏆 - performance measurement
😊 - Hugging Face (HF)
📧 - contacts
👀 - click to see
⚙️ - configuration/settings
📎 - link to the internal file
🔗 - link to the external website

Content specific emoji codes 👀

📏 - table content
📈 - drawings/paintings/diagrams
🌄 - photos
✏️ - hand-written content
📄 - text content
📰 - mixed types of text content, maybe with graphics

Decorative emojis 👀

📇📜🔧▶🛠️📦🔎📚🙏👥📬🤓 - decorative purpose only

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Image classification using fine-tuned ViT - for historical documents sorting

Goal: solve a task of archive page images sorting (for their further content-based processing)

Table of contents 📑

Model description 📇

Data 📜

Categories 🏷️

How to install 🔧

How to run ▶️

Page processing 📄

Directory processing 📁

Results 📊

Result tables and their columns 📏📋

For developers 🛠️

Data preparation 📦

PDF to PNG 📚

PNG pages annotation 🔎

PNG pages sorting for training 📬

Contacts 📧

Acknowledgements 🙏

Appendix 🤓

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
data_scripts		data_scripts
result		result
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
classifier.py		classifier.py
config.txt		config.txt
requirements.txt		requirements.txt
run.py		run.py
utils.py		utils.py

License

ufal/atrium-page-classification

Folders and files

Latest commit

History

Repository files navigation

Image classification using fine-tuned ViT - for historical documents sorting

Goal: solve a task of archive page images sorting (for their further content-based processing)

Table of contents 📑

Model description 📇

Data 📜

Categories 🏷️

How to install 🔧

How to run ▶️

Page processing 📄

Directory processing 📁

Results 📊

Result tables and their columns 📏📋

For developers 🛠️

Data preparation 📦

PDF to PNG 📚

PNG pages annotation 🔎

PNG pages sorting for training 📬

Contacts 📧

Acknowledgements 🙏

Appendix 🤓

Footnotes

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages