Home

Image classification using fine-tuned ViT - for historical documents sorting

Goal: solve a task of archive page images sorting (for their further content-based processing)

Scope: Processing of images, training and evaluation of ViT model, input file/directory processing, class 🏷️ (category) results of top N predictions output, predictions summarizing into a tabular format, HF 😊 hub support for the model, multiplatform (Win/Lin) data preparation scripts for PDF to PNG conversion

Model description 📇

🔲 Fine-tuned model repository: UFAL's vit-historical-page ^1 🔗

🔳 Base model repository: Google's vit-base-patch16-224 ^2 🔗

The model was trained on the manually annotated dataset of historical documents, in particular, images of pages from the archival documents with paper sources that were scanned into digital form. The images contain various combinations of texts ️📄, tables 📏, drawings 📈, and photos 🌄 - categories 🏷️ described below were formed based on those archival documents.

The key use case of the provided model and data processing pipeline is to classify an input PNG image from PDF scanned paper source into one of the categories - each responsible for the following content-specific data processing pipeline. In other words, when several APIs for different OCR subtasks are at your disposal - run this classifier first to mark the input data as machine typed (old style fonts) / hand-written ✏️ / just printed plain ️📄 text or structured in tabular 📏 format text, as well as to mark presence of the printed 🌄 or drawn 📈 graphic materials yet to be extracted from the page images.

Data 📜

Training set of the model: 8950 images

Evaluation set (10% of the all, with the same proportions as below) model_EVAL.csv 📎: 995 images

Manual ✍ annotation were performed beforehand and took some time ⌛, the categories 🏷️ were formed from different sources of the archival documents from year 1920 to year 2020. Disproportion of the categories 🏷️ is NOT intentional, but rather a result of the source data nature.

In total, several hundred of separate PDF files were selected and split into PNG pages, some scanned documents were one-page long and some were much longer (dozens and hundreds of pages). The specific content and language of the source data is irrelevant considering the model's vision resolution, however all of the data samples were from archaeological reports which may somehow affect the drawings detection due to common form objects being ceramic pieces, arrowheads, and rocks firstly drawn by hand and later illustrated with digital tools.

Categories 🏷️

Label️	Ratio	Description
DRAW	11.89%	📈 - drawings, maps, paintings with text
DRAW_L	8.17%	📈📏 - drawings, etc with a table legend or inside tabular layout / forms
LINE_HW	5.99%	✏️📏 - handwritten text lines inside tabular layout / forms
LINE_P	6.06%	📏 - printed text lines inside tabular layout / forms
LINE_T	13.39%	📏 - machine typed text lines inside tabular layout / forms
PHOTO	10.21%	🌄 - photos with text
PHOTO_L	7.86%	🌄📏 - photos inside tabular layout / forms or with a tabular annotation
TEXT	8.58%	📰 - mixed types of printed and handwritten texts
TEXT_HW	7.36%	✏️📄 - only handwritten text
TEXT_P	6.95%	📄 - only printed text
TEXT_T	13.53%	📄 - only machine typed text

The categories were chosen to sort the pages by the following criterion:

presence of graphical elements (drawings 📈 OR photos 🌄)
type of text 📄 (handwritten ✏️️ OR printed OR typed OR mixed 📰)
presence of tabular layout / forms 📏

The reasons for such distinction are different processing pipelines for different types of pages, that would be applied after the classification.

Training

During training image transformations were applied sequentially with a 50% chance.

Image preprocessing steps 👀

transforms.ColorJitter(brightness 0.5)
transforms.ColorJitter(contrast 0.5)
transforms.ColorJitter(saturation 0.5)
transforms.ColorJitter(hue 0.5)
transforms.Lambda(lambda img: ImageEnhance.Sharpness(img).enhance(random.uniform(0.5, 1.5)))
transforms.Lambda(lambda img: img.filter(ImageFilter.GaussianBlur(radius=random.uniform(0, 2))))

Note

No rotation, reshaping, or flipping was applied to the images, manly colors manipulations were used. The reason behind this are pages containing specific form types, general text orientation on the pages, and the default reshape of the model input to the square 224x224 resolution images.

Training hyperparameters 👀

eval_strategy "epoch"
save_strategy "epoch"
learning_rate 5e-5
per_device_train_batch_size 8
per_device_eval_batch_size 8
num_train_epochs 3
warmup_ratio 0.1
logging_steps 10
load_best_model_at_end True
metric_for_best_model "accuracy"

Results 📊

Evaluation set's accuracy (Top-3): 99.6%

TOP-3 confusion matrix - trained ViT

Evaluation set's accuracy (Top-1): 97.3%

TOP-1 confusion matrix - trained ViT

Result tables

Manually ✍ checked evaluation dataset results (TOP-3): model_TOP-3_EVAL.csv 🔗
Manually ✍ checked evaluation dataset results (TOP-1): model_TOP-1_EVAL.csv 🔗

Table columns

FILE - name of the file
PAGE - number of the page
CLASS-N - label of the category 🏷️, guess TOP-N
SCORE-N - score of the category 🏷️, guess TOP-N
TRUE - actual label of the category 🏷️

Contacts 📧

For support write to 📧 lutsai.k@gmail.com 📧

Official repository: UFAL ^3

Acknowledgements 🙏

Developed by UFAL ^5 👥
Funded by ATRIUM ^4 💰
Shared by ATRIUM ^4 & UFAL ^5
Model type: fine-tuned ViT ^2 with a 224x224 resolution size

Provide feedback

Saved searches

Use saved searches to filter your results more quickly