Skip to content

Commit aade0fc

Browse files
committed
Update images
1 parent 272619a commit aade0fc

14 files changed

+23
-10
lines changed

README.md

+15-8
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,13 @@
11
# Surya
22

3-
Surya is for multilingual document OCR. It can do:
3+
Surya is a document OCR toolkit that does:
44

55
- Accurate OCR in 90+ languages
66
- Line-level text detection in any language
77
- Table and chart detection (coming soon)
88

99
It works on a range of documents (see [usage](#usage) and [benchmarks](#benchmarks) for more details).
1010

11-
Detection and OCR example:
12-
1311
| Detection | OCR |
1412
|:----------------------------------------------------------------:|:-----------------------------------------------------------------------:|
1513
| ![New York Times Article Detection](static/images/excerpt.png) | ![New York Times Article Recognition](static/images/excerpt_text.png) |
@@ -29,10 +27,11 @@ Surya is named for the [Hindu sun god](https://en.wikipedia.org/wiki/Surya), who
2927
| Chinese | [Image](static/images/chinese.jpg) | [Image](static/images/chinese_text.jpg) |
3028
| Hindi | [Image](static/images/hindi.jpg) | [Image](static/images/hindi_text.jpg) |
3129
| Arabic | [Image](static/images/arabic.jpg) | [Image](static/images/arabic_text.jpg) |
30+
| Chinese + Hindi | [Image](static/images/chi_hind.jpg) | [Image](static/images/chi_hind_text.jpg) |
3231
| Presentation | [Image](static/images/pres.png) | [Image](static/images/pres_text.jpg) |
3332
| Scientific Paper | [Image](static/images/paper.jpg) | [Image](static/images/paper_text.jpg) |
3433
| Scanned Document | [Image](static/images/scanned.png) | [Image](static/images/scanned_text.jpg) |
35-
| New York Times | [Image](static/images/nyt.png) | [Image](static/images/nyt_text.png) |
34+
| New York Times | [Image](static/images/nyt.jpg) | [Image](static/images/nyt_text.jpg) |
3635
| Scanned Form | [Image](static/images/funsd.png) | [Image](static/images/funsd_text.jpg) |
3736
| Textbook | [Image](static/images/textbook.jpg) | [Image](static/images/textbook_text.jpg) |
3837

@@ -64,7 +63,7 @@ surya_gui
6463

6564
## OCR (text recognition)
6665

67-
You can detect text in an image, pdf, or folder of images/pdfs with the following command. This will write out a json file with the detected text and bboxes, and optionally save images of the reconstructed page.
66+
You can OCR text in an image, pdf, or folder of images/pdfs with the following command. This will write out a json file with the detected text and bboxes, and optionally save images of the reconstructed page.
6867

6968
```
7069
surya_ocr DATA_PATH --images --langs hi,en
@@ -89,7 +88,7 @@ The `results.json` file will contain these keys for each page of the input docum
8988

9089
**Performance tips**
9190

92-
Setting the `RECOGNITION_BATCH_SIZE` env var properly will make a big difference when using a GPU. Each batch item will use `40MB` of VRAM, so very high batch sizes are possible. The default is a batch size `256`, which will use about 10GB of VRAM. Depending on your CPU core count, it may help, too - the default CPU batch size is `32`.
91+
Setting the `RECOGNITION_BATCH_SIZE` env var properly will make a big difference when using a GPU. Each batch item will use `50MB` of VRAM, so very high batch sizes are possible. The default is a batch size `256`, which will use about 12.8GB of VRAM. Depending on your CPU core count, it may help, too - the default CPU batch size is `32`.
9392

9493
### From python
9594

@@ -158,7 +157,8 @@ predictions = batch_detection([image], model, processor)
158157

159158
If OCR isn't working properly:
160159

161-
- If the lines aren't detected properly, try increasing resolution of the image if the width is below `896px`, and vice versa. Very high width images don't work well with the detector.
160+
- Try increasing resolution of the image so the text is bigger. If the resolution is already very high, try decreasing it to no more than a `2048px` width.
161+
- Preprocessing the image (binarizing, deskewing, etc) can help with very old/blurry images.
162162
- You can adjust `DETECTOR_BLANK_THRESHOLD` and `DETECTOR_TEXT_THRESHOLD` if you don't get good results. `DETECTOR_BLANK_THRESHOLD` controls the space between lines - any prediction below this number will be considered blank space. `DETECTOR_TEXT_THRESHOLD` controls how text is joined - any number above this is considered text. `DETECTOR_TEXT_THRESHOLD` should always be higher than `DETECTOR_BLANK_THRESHOLD`, and both should be in the 0-1 range. Looking at the heatmap from the debug output of the detector can tell you how to adjust these (if you see faint things that look like boxes, lower the thresholds, and if you see bboxes being joined together, raise the thresholds).
163163

164164

@@ -175,7 +175,14 @@ If you want to develop surya, you can install it manually:
175175

176176
## OCR
177177

178-
Coming soon.
178+
179+
Tesseract is CPU-based, and surya is CPU or GPU. I tried to cost-match the resources used, so I used a 1xA6000 (48GB VRAM) for surya, and 28 CPU cores for Tesseract (same price on Lambda Labs/DigitalOcean).
180+
181+
**Methodology**
182+
183+
I measured normalized edit distance (0-1, lower is better) based on a set of real-world and synthetic pdfs. I sampled PDFs from common crawl, then filtered out the ones with bad OCR. I couldn't find PDFs for some languages, so I also generated simple synthetic PDFs for those.
184+
185+
I used the reference line bboxes from the PDFs with both tesseract and surya, to just evaluate the OCR quality.
179186

180187
## Text line detection
181188

benchmark/recognition.py

+6
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ def main():
2424
parser.add_argument("--max", type=int, help="Maximum number of pdf pages to OCR.", default=None)
2525
parser.add_argument("--debug", type=int, help="Debug level - 1 dumps bad detection info, 2 writes out images.", default=0)
2626
parser.add_argument("--tesseract", action="store_true", help="Run tesseract instead of surya.", default=False)
27+
parser.add_argument("--langs", type=str, help="Specify certain languages to benchmark.", default=None)
2728
args = parser.parse_args()
2829

2930
rec_model = load_recognition_model()
@@ -34,6 +35,11 @@ def main():
3435
split = f"train[:{args.max}]"
3536

3637
dataset = datasets.load_dataset(settings.RECOGNITION_BENCH_DATASET_NAME, split=split)
38+
39+
if args.langs:
40+
langs = args.langs.split(",")
41+
dataset = dataset.filter(lambda x: x["language"] in langs)
42+
3743
images = list(dataset["image"])
3844
images = [i.convert("RGB") for i in images]
3945
bboxes = list(dataset["bboxes"])

ocr_text.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ def main():
2525
parser.add_argument("--lang_file", type=str, help="Path to file with languages to use for OCR. Should be a JSON dict with file names as keys, and the value being a list of language codes/names.", default=None)
2626
args = parser.parse_args()
2727

28-
assert args.langs or args.lang_file, "Must provide either --lang or --lang_file"
28+
assert args.langs or args.lang_file, "Must provide either --langs or --lang_file"
2929

3030
if os.path.isdir(args.input_path):
3131
images, names = load_from_folder(args.input_path, args.max, args.start_page)

pyproject.toml

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[tool.poetry]
22
name = "surya-ocr"
3-
version = "0.1.6"
3+
version = "0.2.0"
44
description = "Document OCR models for multilingual text detection and recognition"
55
authors = ["Vik Paruchuri <vik.paruchuri@gmail.com>"]
66
readme = "README.md"

static/images/arabic.jpg

136 KB
Loading

static/images/arabic_text.jpg

88.7 KB
Loading

static/images/chi_hind.jpg

536 KB
Loading

static/images/chi_hind_orig.jpg

445 KB
Loading

static/images/chi_hind_text.jpg

224 KB
Loading

static/images/funsd_text.jpg

-1.86 KB
Loading

static/images/nyt.jpg

1.12 MB
Loading

static/images/nyt.png

-2.2 MB
Binary file not shown.

static/images/nyt_text.jpg

267 KB
Loading

static/images/nyt_text.png

-793 KB
Binary file not shown.

0 commit comments

Comments
 (0)