Skip to content

Classification of historical page images using ViT - for ATRIUM project

License

Notifications You must be signed in to change notification settings

ufal/atrium-page-classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

89 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Image classification using fine-tuned ViT - for historical documents sorting

Goal: solve a task of archive page images sorting (for their further content-based processing)

Scope: Processing of images, training and evaluation of ViT model, input file/directory processing, class 🏷️ (category) results of top N predictions output, predictions summarizing into a tabular format, HF 😊 hub support for the model, multiplatform (Win/Lin) data preparation scripts for PDF to PNG conversion

Table of contents πŸ“‘


Model description πŸ“‡

πŸ”² Fine-tuned model repository: UFAL's vit-historical-page 1 πŸ”—

πŸ”³ Base model repository: Google's vit-base-patch16-224 2 πŸ”—

The model was trained on the manually annotated dataset of historical documents, in particular, images of pages from the archival documents with paper sources that were scanned into digital form. The images contain various combinations of texts οΈπŸ“„, tables πŸ“, drawings πŸ“ˆ, and photos πŸŒ„ - categories 🏷️ described below were formed based on those archival documents.

The key use case of the provided model and data processing pipeline is to classify an input PNG image from PDF scanned paper source into one of the categories - each responsible for the following content-specific data processing pipeline. In other words, when several APIs for different OCR subtasks are at your disposal - run this classifier first to mark the input data as machine typed (old style fonts) / hand-written ✏️ / just printed plain οΈπŸ“„ text or structured in tabular πŸ“ format text, as well as to mark presence of the printed πŸŒ„ or drawn πŸ“ˆ graphic materials yet to be extracted from the page images.

Data πŸ“œ

Training set of the model: 8950 images

Evaluation set (10% of the all, with the same proportions as below) model_EVAL.csv πŸ“Ž: 995 images

Manual ✍ annotation were performed beforehand and took some time βŒ›, the categories 🏷️ were formed from different sources of the archival documents from year 1920 to year 2020. Disproportion of the categories 🏷️ is NOT intentional, but rather a result of the source data nature.

In total, several hundred of separate PDF files were selected and split into PNG pages, some scanned documents were one-page long and some were much longer (dozens and hundreds of pages). The specific content and language of the source data is irrelevant considering the model's vision resolution, however all of the data samples were from archaeological reports which may somehow affect the drawings detection due to common form objects being ceramic pieces, arrowheads, and rocks firstly drawn by hand and later illustrated with digital tools.

Categories 🏷️

Label️ Ratio Description
DRAW 11.89% πŸ“ˆ - drawings, maps, paintings with text
DRAW_L 8.17% πŸ“ˆπŸ“ - drawings, etc with a table legend or inside tabular layout / forms
LINE_HW 5.99% βœοΈπŸ“ - handwritten text lines inside tabular layout / forms
LINE_P 6.06% πŸ“ - printed text lines inside tabular layout / forms
LINE_T 13.39% πŸ“ - machine typed text lines inside tabular layout / forms
PHOTO 10.21% πŸŒ„ - photos with text
PHOTO_L 7.86% πŸŒ„πŸ“ - photos inside tabular layout / forms or with a tabular annotation
TEXT 8.58% πŸ“° - mixed types of printed and handwritten texts
TEXT_HW 7.36% βœοΈπŸ“„ - only handwritten text
TEXT_P 6.95% πŸ“„ - only printed text
TEXT_T 13.53% πŸ“„ - only machine typed text

The categories were chosen to sort the pages by the following criterion:

  • presence of graphical elements (drawings πŸ“ˆ OR photos πŸŒ„)
  • type of text πŸ“„ (handwritten ✏️️ OR printed OR typed OR mixed πŸ“°)
  • presence of tabular layout / forms πŸ“

The reasons for such distinction are different processing pipelines for different types of pages, that would be applied after the classification.


How to install πŸ”§

The easiest way to obtain the model would be to use the HF 😊 hub repository 1 πŸ”— that can be easily accessed vie this project. Step-by-step instructions on this program installation are provided below.

Warning

Make sure you have Python version 3.10+ installed on your machine πŸ’». Then create a separate virtual environment for this project

How to πŸ‘€

Clone this project to your local machine πŸ–₯️ via:

cd /local/folder/for/this/project
git init
git clone https://github.com/ufal/atrium-page-classification.git

Follow the Unix* / Windows-specific instruction at the venv docs 3 πŸ‘€πŸ”— if you don't know how to. After creating the venv folder, activate the environment via:

source <your_venv_dir>/bin/activate

and then inside your virtual environment, you should install python libraries (takes time βŒ›)

Note

Up to 1 GB of space for model files and checkpoints is needed, and up to 7 GB of space for the python libraries (pytorch and its dependencies, etc)

Can be done via:

pip install -r requirements.txt

To test that everything works okay and see the flag descriptions call for --help ❓:

python3 run.py -h

To pull the model from the HF 😊 hub repository directly, load the model via:

python3 run.py --hf

You should see a message about loading the model from hub and then saving it locally. Only after you have obtained the trained model files (takes less time βŒ› than installing dependencies), you can play with any commands provided below.

Important

Unless you already have the model files in the 'model/model_version' directory next to this file, you must use the --hf flag to download the model files from the HF 😊 repo 1 πŸ”—

After the model is downloaded, you should see a similar file structure:

Full project tree 🌳 files structure πŸ‘€
/local/folder/for/this/project
β”œβ”€β”€ model
    └── model_version 
        β”œβ”€β”€ config.json
        β”œβ”€β”€ model.safetensors
        └── preprocessor_config.json
β”œβ”€β”€ checkpoint
        β”œβ”€β”€ models--google--vit-base-patch16-224
            β”œβ”€β”€ blobs
            β”œβ”€β”€ snapshots
            └── refs
        └── .locs
            └── models--google--vit-base-patch16-224
β”œβ”€β”€ model_output
    β”œβ”€β”€ checkpoint-version
        β”œβ”€β”€ config.json
        β”œβ”€β”€ model.safetensors
        β”œβ”€β”€ trainer_state.json
        β”œβ”€β”€ optimizer.pt
        β”œβ”€β”€ scheduler.pt
        β”œβ”€β”€ rng_state.pth
        └── training_args.bin
    └── ...
β”œβ”€β”€ data_scripts
    β”œβ”€β”€ windows
        β”œβ”€β”€ move_single.bat
        β”œβ”€β”€ pdf2png.bat
        └── sort.bat
    └── unix
        β”œβ”€β”€ move_single.sh
        β”œβ”€β”€ pdf2png.sh
        └── sort.sh
β”œβ”€β”€ result
    β”œβ”€β”€ plots
        β”œβ”€β”€ date-time_conf_mat.png
        └── ...
    └── tables
        β”œβ”€β”€ date-time_TOP-N.csv
        β”œβ”€β”€ date-time_TOP-N_EVAL.csv
        β”œβ”€β”€ date-time_EVAL_RAW.csv
        └── ...
β”œβ”€β”€ run.py
β”œβ”€β”€ classifier.py
β”œβ”€β”€ utils.py
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ config.txt
β”œβ”€β”€ README.md
└── ...

Some of the listed above folders may be missing, like model_output which is created after training the model.


How to run ▢️

There are two main ways to run the program:

  • Single PNG file classification πŸ“„
  • Directory with PNG files classification πŸ“

To begin with, open config.txt βš™ and change folder path in the [INPUT] section, then optionally change top_N and batch in the [SETUP] section.

Note

️ Top-3 is enough to cover most of the images, setting Top-5 will help with a small number of difficult to classify samples.

Caution

Do not try to change base_model and other section contents unless you know what you are doing

Page processing πŸ“„

The following prediction should be run using -f or --file flag with the path argument. Optionally, you can use -tn or --topn flag with the number of guesses you want to get, and also -m or --model flag with the path to the model folder argument.

How to πŸ‘€

Run the program from its starting point run.py πŸ“Ž with optional flags:

python3 run.py -tn 3 -f '/full/path/to/file.png' -m '/full/path/to/model/folder'

for exactly TOP-3 guesses

OR if you are sure about default variables set in the config.txt βš™:

python3 run.py -f '/full/path/to/file.png'

to run single PNG file classification - the output will be in the console.

Note

Console output and all result tables contain normalized scores for the highest N class 🏷️ scores

Directory processing πŸ“

The following prediction type does nor require explicit directory path setting with the -d or --directory, since its default value is set in the config.txt βš™ file and awaken when the --dir flag is used. The same flags for the number of guesses, and the model folder path as for the single page processing can be used. In addition, 2 directory-specific flags --inner and --raw are available.

Caution

You must either explicitly set -d flag's argument or use --dir flag (calling for the preset default value of the input directory) to process PNG files on the directory level, otherwise nothing will happen

How to πŸ‘€
python3 run.py -tn 3 -d '/full/path/to/directory' -m '/full/path/to/model/folder'

for exactly TOP-3 guesses from all images found in the given directory.

OR if you are really sure about default variables set in the config.txt βš™:

python3 run.py --dir 

The classification results of PNG pages collected from the directory will be saved πŸ’Ύ to related results πŸ“ folders defined in [OUTPUT] section of config.txt βš™ file.

Tip

To additionally get raw class 🏷️ probabilities from the model along with the TOP-N results, use --raw flag when processing the directory

Tip

To process all PNG files in the directory AND its subdirectories use the --inner flag when processing the directory


Results πŸ“Š

There are accuracy performance measurements and plots of confusion matrices for the evaluation dataset and tables with results in the results πŸ“ folder.

Evaluation set's accuracy (Top-3): 99.6% πŸ†

Confusion matrix πŸ“Š TOP-3 πŸ‘€

TOP-3 confusion matrix

Evaluation set's accuracy (Top-1): 97.3% πŸ†

Confusion matrix πŸ“Š TOP-1 πŸ‘€

TOP-1 confusion matrix

By running tests on the evaluation dataset after training you can generate the following output files:

  • data-time_model_TOP-N_EVAL.csv - results of the evaluation dataset with TOP-N guesses
  • data-time_conf_mat_TOP-N.png - confusion matrix plot for the evaluation dataset also with TOP-N guesses
  • data-time_model_EVAL_RAW.csv - raw probabilities for all classes of the evaluation dataset

Note

Generated tables will be sorted by FILE and PAGE number columns in ascending order.

Result tables and their columns πŸ“πŸ“‹

General result tables πŸ‘€

Demo files:

With the following columns πŸ“‹:

  • FILE - name of the file
  • PAGE - number of the page
  • CLASS-N - label of the category 🏷️, guess TOP-N
  • SCORE-N - score of the category 🏷️, guess TOP-N

and optionally

  • TRUE - actual label of the category 🏷️
Raw result tables πŸ‘€

Demo files:

With the following columns πŸ“‹:

  • FILE - name of the file
  • PAGE - number of the page
  • <CATEGORY_LABEL> - separate columns for each of the defined classes 🏷️
  • TRUE - actual label of the category 🏷️

The reason to use --raw flag is possible convenience of results review, since the most ambiguous cases are expected to be at the bottom of the table sorted in descending order by all <CATEGORY_LABEL> columns, while the most obvious (for the model) cases are expected to be at the top.


For developers πŸ› οΈ

Use this project code as a base for your own image classification tasks. Instructions on the key phases of the process are provided below.

Project files description πŸ“‹πŸ‘€
File Name Description
classifier.py Model-specific classes and related functions including predefined values for training arguments
utils.py Task-related algorithms
run.py Starting point of the program with its main function - can be edited for flags and function argument extensions
config.txt Changeable variables for the program - should be edited

Most of the changeable variables are in the config.txt βš™ file, specifically, in the [TRAIN], [HF], and [SETUP] sections.

For more detailed training process adjustments refer to the related functions in classifier.py πŸ“Ž file, where you will find some predefined values not used in the run.py πŸ“Ž file.

To train the model run:

python3 run.py --train  

To evaluate the model and create a confusion matrix plot πŸ“Š run:

python3 run.py --eval  

Important

In both cases, you must make sure that training data directory is set right in the config.txt βš™ and it contains category 🏷️ subdirectories with images inside. Names of the category 🏷️ subdirectories become actual label names, and replaces the default categories 🏷️ list.

During training image transformations were applied sequentially with a 50% chance.

Image preprocessing steps πŸ‘€
  • transforms.ColorJitter(brightness 0.5)
  • transforms.ColorJitter(contrast 0.5)
  • transforms.ColorJitter(saturation 0.5)
  • transforms.ColorJitter(hue 0.5)
  • transforms.Lambda(lambda img: ImageEnhance.Sharpness(img).enhance(random.uniform(0.5, 1.5)))
  • transforms.Lambda(lambda img: img.filter(ImageFilter.GaussianBlur(radius=random.uniform(0, 2))))

Note

No rotation, reshaping, or flipping was applied to the images, manly colors manipulations were used. The reason behind this are pages containing specific form types, general text orientation on the pages, and the default reshape of the model input to the square 224x224 resolution images.

Training hyperparameters πŸ‘€
  • eval_strategy "epoch"
  • save_strategy "epoch"
  • learning_rate 5e-5
  • per_device_train_batch_size 8
  • per_device_eval_batch_size 8
  • num_train_epochs 3
  • warmup_ratio 0.1
  • logging_steps 10
  • load_best_model_at_end True
  • metric_for_best_model "accuracy"

Above are the default hyperparameters used in the training process that can be changed in the classifier.py πŸ“Ž file, where the model is defined and trained.


Data preparation πŸ“¦

There are useful multiplatform scripts in the data_scripts πŸ“ folder for the whole process of data preparation.

Note

The .sh scripts are adapted for Unix OS and .bat scripts are adapted for Windows OS

On Windows you must also install the following software before converting PDF documents to PNG images:

  • ImageMagick 4 πŸ”— - download and install latest version
  • Ghostscript 5 πŸ”— - download and install latest version (32 or 64 bit) by AGPL

PDF to PNG πŸ“š

The source set of PDF documents must be converted to page-specific PNG images before processing. The following steps describe procedure of converting PDF documents to PNG images suitable for both training, evaluation, and prediction.

Firstly, copy the PDF-to-PNG converter script to the directory with PDF documents.

How to πŸ‘€

Windows:

move \local\folder\for\this\project\data_scripts\pdf2png.bat \full\path\to\your\folder\with\pdf\files

Unix:

cp /local/folder/for/this/project/data_scripts/pdf2png.sh /full/path/to/your/folder/with/pdf/files

Now check the content and comments in pdf2png.sh πŸ“Ž or pdf2png.bat πŸ“Ž script, and run it. You can optionally comment out the removal of processed PDF files from the script, yet it's not recommended in case you are going to launch the program several times from the same location.

How to πŸ‘€

Windows:

cd \full\path\to\your\folder\with\pdf\files
pdf2png.bat

Unix:

cd /full/path/to/your/folder/with/pdf/files
pdf2png.sh

After the program is done, you will have a directory full of document-specific subdirectories containing page-specific images with a similar structure:

Unix folder tree 🌳 structure πŸ‘€
/full/path/to/your/folder/with/pdf/files
β”œβ”€β”€ PdfFile1Name
    β”œβ”€β”€ PdfFile1Name-001.png
    β”œβ”€β”€ PdfFile1Name-002.png
    └── ...
β”œβ”€β”€ PdfFile2Name
    β”œβ”€β”€ PdfFile2Name-01.png
    β”œβ”€β”€ PDFFile2Name-02.png
    └── ...
β”œβ”€β”€ PdfFile3Name
    └── PdfFile3Name-1.png 
β”œβ”€β”€ PdfFile4Name
└── ...

Note

The page numbers are padded with zeros (on the left) to match the length of the last page number in each PDF file, this is done automatically by the pdftoppm command used on Unix. While ImageMagick's convert command used on Windows does not pad the page numbers.

Windows folder tree 🌳 structure πŸ‘€
\full\path\to\your\folder\with\pdf\files
β”œβ”€β”€ PdfFile1Name
    β”œβ”€β”€ PdfFile1Name-1.png
    β”œβ”€β”€ PdfFile1Name-2.png
    └── ...
β”œβ”€β”€ PdfFile2Name
    β”œβ”€β”€ PdfFile2Name-1.png
    β”œβ”€β”€ PDFFile2Name-2.png
    └── ...
β”œβ”€β”€ PdfFile3Name
    └── PdfFile3Name-1.png 
β”œβ”€β”€ PdfFile4Name
└── ...

Optionally you can use the move_single.sh πŸ“Ž or move_single.bat πŸ“Ž script to move all PNG files from directories with a single PNG file inside to the common directory of one-pagers.

How to πŸ‘€

Windows:

move \local\folder\for\this\project\data_scripts\move_single.bat \full\path\to\your\folder\with\pdf\files
cd \full\path\to\your\folder\with\pdf\files
move_single.bat

Unix:

cp /local/folder/for/this//project/data_scripts/move_single.sh /full/path/to/your/folder/with/pdf/files
cd /full/path/to/your/folder/with/pdf/files 
move_single.sh 

The reason for such movement is simply convenience in the following annotation process. These changes are cared for in the next sort.sh πŸ“Ž and sort.bat πŸ“Ž scripts as well.

PNG pages annotation πŸ”Ž

Prepare a CSV table with such columns:

  • FILE - name of the PDF document which was the source of this page
  • PAGE - number of the page (NOT padded with 0s)
  • CLASS - label of the category 🏷️

Tip

Prepare equal in size categories 🏷️ if possible, so that the model will not be biased towards the over-represented labels 🏷️

It takes time βŒ› to collect at least several hundred of examples per category.

PNG pages sorting for training πŸ“¬

Cluster the annotated data into separate folders using the sort.sh πŸ“Ž or sort.bat πŸ“Ž script to copy data from the source folder to the training folder where each category 🏷️ has its own subdirectory.

How to πŸ‘€

Windows:

sort.bat

Unix:

sort.sh

Warning

It does NOT matter from which directory you launch the sorting script, but you must check the top of the script for (1) the path to the CSV table with annotations, (2) path to the directory containing document-specific subdirectories of page-specific PNG pages, and (3) path to the directory where you want to store the training data of label-specific directories with annotated page images.

After the program is done, you will have a directory full of label-specific subdirectories containing document-specific pages with a similar structure:

Unix folder tree 🌳 structure πŸ‘€
/full/path/to/your/folder/with/train/pages
β”œβ”€β”€ Label1
    β”œβ”€β”€ PdfFileAName-00N.png
    β”œβ”€β”€ PdfFileBName-0M.png
    └── ...
β”œβ”€β”€ Label2
β”œβ”€β”€ Label3
β”œβ”€β”€ Label4
└── ...
Windows folder tree 🌳 structure πŸ‘€
\full\path\to\your\folder\with\train\pages
β”œβ”€β”€ Label1
    β”œβ”€β”€ PdfFileAName-N.png
    β”œβ”€β”€ PdfFileBName-M.png
    └── ...
β”œβ”€β”€ Label2
β”œβ”€β”€ Label3
β”œβ”€β”€ Label4
└── ...

Before running the training, make sure to check the config.txt βš™οΈ file for the [TRAIN] section variables, where you should set a path to the data folder.

Tip

In the config.txt βš™οΈ file tweak the parameter of max_categ for maximum number of samples per category 🏷️, in case you have over-represented labels significantly dominating in size. Set max_categ higher than the number of samples in the largest category 🏷️ to use all data samples.


Contacts πŸ“§

For support write to: lutsai.k@gmail.com responsible for this repository 6

Acknowledgements πŸ™

  • Developed by UFAL 7 πŸ‘₯
  • Funded by ATRIUM 8 πŸ’°
  • Shared by ATRIUM 8 & UFAL 7
  • Model type: fine-tuned ViT with a 224x224 resolution size 2

©️ 2022 UFAL & ATRIUM


Appendix πŸ€“

README emoji codes πŸ‘€
  • πŸ–₯ - your computer
  • 🏷️ - label/category/class
  • πŸ“„ - page/file
  • πŸ“ - folder/directory
  • πŸ“Š - generated diagrams or plots
  • 🌳 - tree of file structure
  • βŒ› - time-consuming process
  • ✍ - manual action
  • πŸ† - performance measurement
  • 😊 - Hugging Face (HF)
  • πŸ“§ - contacts
  • πŸ‘€ - click to see
  • βš™οΈ - configuration/settings
  • πŸ“Ž - link to the internal file
  • πŸ”— - link to the external website
Content specific emoji codes πŸ‘€
  • πŸ“ - table content
  • πŸ“ˆ - drawings/paintings/diagrams
  • πŸŒ„ - photos
  • ✏️ - hand-written content
  • πŸ“„ - text content
  • πŸ“° - mixed types of text content, maybe with graphics
Decorative emojis πŸ‘€
  • πŸ“‡πŸ“œπŸ”§β–ΆπŸ› οΈπŸ“¦πŸ”ŽπŸ“šπŸ™πŸ‘₯πŸ“¬πŸ€“ - decorative purpose only

Footnotes

  1. https://huggingface.co/ufal/vit-historical-page ↩ ↩2 ↩3

  2. https://huggingface.co/google/vit-base-patch16-224 ↩ ↩2

  3. https://docs.python.org/3/library/venv.html ↩

  4. https://imagemagick.org/script/download.php#windows ↩

  5. https://www.ghostscript.com/releases/gsdnld.html ↩

  6. https://github.com/ufal/atrium-page-classification ↩

  7. https://ufal.mff.cuni.cz/home-page ↩ ↩2

  8. https://atrium-research.eu/ ↩ ↩2

About

Classification of historical page images using ViT - for ATRIUM project

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published