You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
| Scientific Paper |[Image](static/images/paper.jpg)|[Image](static/images/paper_text.jpg)|[Image](static/images/paper_layout.jpg)|[Image](static/images/paper_reading.jpg)|
| New York Times |[Image](static/images/nyt.jpg)|[Image](static/images/nyt_text.jpg)|[Image](static/images/nyt_layout.jpg)| --|
38
+
| New York Times |[Image](static/images/nyt.jpg)|[Image](static/images/nyt_text.jpg)|[Image](static/images/nyt_layout.jpg)|[Image](static/images/nyt_order.jpg)|
39
39
| Scanned Form |[Image](static/images/funsd.png)|[Image](static/images/funsd_text.jpg)|[Image](static/images/funsd_layout.jpg)|[Image](static/images/funsd_reading.jpg)|
Pass the `--math` command line argument to use the math detection model instead of the default model. This will detect math better, but will be worse at everything else.
68
+
Pass the `--math` command line argument to use the math text detection model instead of the default model. This will detect math better, but will be worse at everything else.
69
69
70
70
## OCR (text recognition)
71
71
72
-
You can OCR text in an image, pdf, or folder of images/pdfs with the following command. This will write out a json file with the detected text and bboxes, and optionally save images of the reconstructed page.
72
+
This commandwill write out a json file with the detected text and bboxes:
You can detect text lines in an image, pdf, or folder of images/pdfs with the following command. This will write out a json file with the detected bboxes.
120
+
This command will write out a json file with the detected bboxes.
You can detect the layout of an image, pdf, or folder of images/pdfs with the following command. This will write out a json file with the detected layout.
165
+
This command will write out a json file with the detected layout.
You can detect the reading order of an image, pdf, or folder of images/pdfs with the following command. This will write out a json file with the detected reading order and layout.
212
+
This command will write out a json file with the detected reading order and layout.
213
213
214
214
```shell
215
215
surya_order DATA_PATH --images
@@ -224,15 +224,14 @@ The `results.json` file will contain a json dictionary where the keys are the in
224
224
225
225
-`bboxes` - detected bounding boxes for text
226
226
-`bbox` - the axis-aligned rectangle for the text line in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner.
227
-
-`polygon` - the polygon for the text line in (x1, y1), (x2, y2), (x3, y3), (x4, y4) format. The points are in clockwise order from the top left.
228
-
-`confidence` - the confidence of the model in the detected text (0-1). This is currently not very reliable.
229
-
-`label` - the label for the bbox. One of `Caption`, `Footnote`, `Formula`, `List-item`, `Page-footer`, `Page-header`, `Picture`, `Figure`, `Section-header`, `Table`, `Text`, `Title`.
227
+
-`position` - the position in the reading order of the bbox, starting from 0.
228
+
-`label` - the label for the bbox. See the layout section of the documentation for a list of potential labels.
230
229
-`page` - the page number in the file
231
230
-`image_bbox` - the bbox for the image in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner. All line bboxes will be contained within this bbox.
232
231
233
232
**Performance tips**
234
233
235
-
Setting the `ORDER_BATCH_SIZE` env var properly will make a big difference when using a GPU. Each batch item will use `280MB` of VRAM, so very high batch sizes are possible. The default is a batch size `32`, which will use about 9GB of VRAM. Depending on your CPU core count, it might help, too - the default CPU batch size is `4`.
234
+
Setting the `ORDER_BATCH_SIZE` env var properly will make a big difference when using a GPU. Each batch item will use `360MB` of VRAM, so very high batch sizes are possible. The default is a batch size `32`, which will use about 11GB of VRAM. Depending on your CPU core count, it might help, too - the default CPU batch size is `4`.
236
235
237
236
### From python
238
237
@@ -357,6 +356,16 @@ I benchmarked the layout analysis on [Publaynet](https://github.com/ibm-aur-nlp/
357
356
- Precision - how well the predicted bboxes cover ground truth bboxes
358
357
- Recall - how well ground truth bboxes cover predicted bboxes
359
358
359
+
## Reading Order
360
+
361
+
75% mean accuracy, and .14 seconds per image on an A6000 GPU. See methodology for notes - this benchmark is not perfect measure of accuracy, and is more useful as a sanity check.
362
+
363
+
**Methodology**
364
+
365
+
I benchmarked the layout analysis on the layout dataset from [here](https://www.icst.pku.edu.cn/cpdp/sjzy/), which was not in the training data. Unfortunately, this dataset is fairly noisy, and not all the labels are correct. It was very hard to find a dataset annotated with reading order and also layout information. I wanted to avoid using a cloud service for the ground truth.
366
+
367
+
The accuracy is computed by finding if each pair of layout boxes is in the correct order, then taking the % that are correct.
368
+
360
369
## Running your own benchmarks
361
370
362
371
You can benchmark the performance of surya on your machine.
@@ -403,6 +412,16 @@ python benchmark/layout.py
403
412
-`--debug` will render images with detected text
404
413
-`--results_dir` will let you specify a directory to save results to instead of the default one
405
414
415
+
**Reading Order**
416
+
417
+
```
418
+
python benchmark/ordering.py
419
+
```
420
+
421
+
-`--max` controls how many images to process for the benchmark
422
+
-`--debug` will render images with detected text
423
+
-`--results_dir` will let you specify a directory to save results to instead of the default one
424
+
406
425
# Training
407
426
408
427
Text detection was trained on 4x A6000s for 3 days. It used a diverse set of images as training data. It was trained from scratch using a modified segformer architecture that reduces inference RAM requirements.
@@ -411,7 +430,7 @@ Text recognition was trained on 4x A6000s for 2 weeks. It was trained using a m
411
430
412
431
# Commercial usage
413
432
414
-
The text detection, layout analysis, and OCR models were trained from scratch, so they're okay for commercial usage. The weights are licensed cc-by-nc-sa-4.0, but I will waive that for any organization under $5M USD in gross revenue in the most recent 12-month period.
433
+
All models were trained from scratch, so they're okay for commercial usage. The weights are licensed cc-by-nc-sa-4.0, but I will waive that for any organization under $5M USD in gross revenue in the most recent 12-month period.
415
434
416
435
If you want to remove the GPL license requirements for inference or use the weights commercially over the revenue limit, please contact me at surya@vikas.sh for dual licensing.
417
436
@@ -424,4 +443,4 @@ This work would not have been possible without amazing open source AI work:
424
443
-[transformers](https://github.com/huggingface/transformers) from huggingface
425
444
-[CRAFT](https://github.com/clovaai/CRAFT-pytorch), a great scene text detection model
426
445
427
-
Thank you to everyone who makes open source AI possible.
446
+
Thank you to everyone who makes open source AI possible.
0 commit comments