-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Scope: Processing of images, training and evaluation of ViT model, input file/directory processing, class π·οΈ (category) results of top N predictions output, predictions summarizing into a tabular format, HF π hub support for the model, multiplatform (Win/Lin) data preparation scripts for PDF to PNG conversion
π² Fine-tuned model repository: UFAL's vit-historical-page ^1 π
π³ Base model repository: Google's vit-base-patch16-224 ^2 π
The model was trained on the manually annotated dataset of historical documents, in particular, images of pages from the archival documents with paper sources that were scanned into digital form. The images contain various combinations of texts οΈπ, tables π, drawings π, and photos π - categories π·οΈ described below were formed based on those archival documents.
The key use case of the provided model and data processing pipeline is to classify an input PNG image from PDF scanned paper source into one of the categories - each responsible for the following content-specific data processing pipeline. In other words, when several APIs for different OCR subtasks are at your disposal - run this classifier first to mark the input data as machine typed (old style fonts) / hand-written βοΈ / just printed plain οΈπ text or structured in tabular π format text, as well as to mark presence of the printed π or drawn π graphic materials yet to be extracted from the page images.
Training set of the model: 8950 images
Evaluation set (10% of the all, with the same proportions as below) model_EVAL.csv π: 995 images
Manual β annotation were performed beforehand and took some time β, the categories π·οΈ were formed from different sources of the archival documents from year 1920 to year 2020. Disproportion of the categories π·οΈ is NOT intentional, but rather a result of the source data nature.
In total, several hundred of separate PDF files were selected and split into PNG pages, some scanned documents were one-page long and some were much longer (dozens and hundreds of pages). The specific content and language of the source data is irrelevant considering the model's vision resolution, however all of the data samples were from archaeological reports which may somehow affect the drawings detection due to common form objects being ceramic pieces, arrowheads, and rocks firstly drawn by hand and later illustrated with digital tools.
LabelοΈ | Ratio | Description |
---|---|---|
DRAW | 11.89% | π - drawings, maps, paintings with text |
DRAW_L | 8.17% | ππ - drawings, etc with a table legend or inside tabular layout / forms |
LINE_HW | 5.99% | βοΈπ - handwritten text lines inside tabular layout / forms |
LINE_P | 6.06% | π - printed text lines inside tabular layout / forms |
LINE_T | 13.39% | π - machine typed text lines inside tabular layout / forms |
PHOTO | 10.21% | π - photos with text |
PHOTO_L | 7.86% | ππ - photos inside tabular layout / forms or with a tabular annotation |
TEXT | 8.58% | π° - mixed types of printed and handwritten texts |
TEXT_HW | 7.36% | βοΈπ - only handwritten text |
TEXT_P | 6.95% | π - only printed text |
TEXT_T | 13.53% | π - only machine typed text |
The categories were chosen to sort the pages by the following criterion:
- presence of graphical elements (drawings π OR photos π)
- type of text π (handwritten βοΈοΈ OR printed OR typed OR mixed π°)
- presence of tabular layout / forms π
The reasons for such distinction are different processing pipelines for different types of pages, that would be applied after the classification.
During training image transformations were applied sequentially with a 50% chance.
Image preprocessing steps π
- transforms.ColorJitter(brightness 0.5)
- transforms.ColorJitter(contrast 0.5)
- transforms.ColorJitter(saturation 0.5)
- transforms.ColorJitter(hue 0.5)
- transforms.Lambda(lambda img: ImageEnhance.Sharpness(img).enhance(random.uniform(0.5, 1.5)))
- transforms.Lambda(lambda img: img.filter(ImageFilter.GaussianBlur(radius=random.uniform(0, 2))))
Note
No rotation, reshaping, or flipping was applied to the images, manly colors manipulations were used. The reason behind this are pages containing specific form types, general text orientation on the pages, and the default reshape of the model input to the square 224x224 resolution images.
Training hyperparameters π
- eval_strategy "epoch"
- save_strategy "epoch"
- learning_rate 5e-5
- per_device_train_batch_size 8
- per_device_eval_batch_size 8
- num_train_epochs 3
- warmup_ratio 0.1
- logging_steps 10
- load_best_model_at_end True
- metric_for_best_model "accuracy"
Evaluation set's accuracy (Top-3): 99.6%
Evaluation set's accuracy (Top-1): 97.3%
-
Manually β checked evaluation dataset results (TOP-3): model_TOP-3_EVAL.csv π
-
Manually β checked evaluation dataset results (TOP-1): model_TOP-1_EVAL.csv π
- FILE - name of the file
- PAGE - number of the page
- CLASS-N - label of the category π·οΈ, guess TOP-N
- SCORE-N - score of the category π·οΈ, guess TOP-N
- TRUE - actual label of the category π·οΈ
For support write to π§ lutsai.k@gmail.com π§
Official repository: UFAL ^3
- Developed by UFAL ^5 π₯
- Funded by ATRIUM ^4 π°
- Shared by ATRIUM ^4 & UFAL ^5
- Model type: fine-tuned ViT ^2 with a 224x224 resolution size
Β©οΈ 2022 UFAL & ATRIUM