Skip to content

Commit 5b91009

Browse files
committed
Enable passing in languages from file
1 parent 07a715c commit 5b91009

File tree

8 files changed

+127
-35
lines changed

8 files changed

+127
-35
lines changed

README.md

+70-12
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,11 @@
22

33
Surya is a multilingual document OCR toolkit. It can do:
44

5-
- Accurate line-level text detection
6-
- Text recognition (coming soon)
5+
- Accurate line-level text detection in any language
6+
- Text recognition in 90+ languages
77
- Table and chart detection (coming soon)
88

9-
It works on a range of documents and languages (see [usage](#usage) and [benchmarks](#benchmarks) for more details).
9+
It works on a range of documents (see [usage](#usage) and [benchmarks](#benchmarks) for more details).
1010

1111
![New York Times Article Example](static/images/excerpt.png)
1212

@@ -46,6 +46,62 @@ Model weights will automatically download the first time you run surya.
4646
- Inspect the settings in `surya/settings.py`. You can override any settings with environment variables.
4747
- Your torch device will be automatically detected, but you can override this. For example, `TORCH_DEVICE=cuda`. Note that the `mps` device has a bug (on the [Apple side](https://github.com/pytorch/pytorch/issues/84936)) that may prevent it from working properly.
4848

49+
## OCR (text recognition)
50+
51+
You can detect text lines in an image, pdf, or folder of images/pdfs with the following command. This will write out a json file with the detected text and bboxes, and optionally save images of the reconstructed page.
52+
53+
```
54+
surya_ocr DATA_PATH --images --lang hi,en
55+
```
56+
57+
- `DATA_PATH` can be an image, pdf, or folder of images/pdfs
58+
- `--lang` specifies the language(s) to use for OCR. You can comma separate multiple languages. Use the language name or two-letter ISO code from [here](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes). Surya supports the 90+ languages found in `surya/languages.py`.
59+
- `--lang_file` if you want to use a different language for different PDFs/images, you can specify languages here. The format is a JSON dict with the keys being filenames and the values as a list, like `{"file1.pdf": ["en", "hi"], "file2.pdf": ["en"]}`.
60+
- `--images` will save images of the pages and detected text lines (optional)
61+
- `--results_dir` specifies the directory to save results to instead of the default
62+
- `--max` specifies the maximum number of pages to process if you don't want to process everything
63+
- `--start_page` specifies the page number to start processing from
64+
65+
The `results.json` file will contain these keys for each page of the input document(s):
66+
67+
- `text_lines` - the detected text in each line
68+
- `polys` - the polygons for each detected text line in (x1, y1), (x2, y2), (x3, y3), (x4, y4) format. The points are in clockwise order from the top left.
69+
- `bboxes` - the axis-aligned rectangles for each detected text line in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner.
70+
- `language` - the languages specified for the page
71+
- `name` - the name of the file
72+
- `page_number` - the page number in the file
73+
74+
**Performance tips**
75+
76+
Setting the `RECOGNITION_BATCH_SIZE` env var properly will make a big difference when using a GPU. Each batch item will use `40MB` of VRAM, so very high batch sizes are possible. The default is a batch size `256`, which will use about 10GB of VRAM.
77+
78+
Depending on your CPU core count, `RECOGNITION_BATCH_SIZE` might make a difference there too - the default CPU batch size is `32`.
79+
80+
81+
### From Python
82+
83+
You can also do OCR from code with:
84+
85+
```
86+
from PIL import Image
87+
from surya.ocr import run_ocr
88+
from surya.model.detection.segformer import load_model as load_det_model, load_processor as load_det_processor
89+
from surya.model.recognition.model import load_model as load_rec_model
90+
from surya.model.recognition.processor import load_processor as load_rec_processor
91+
92+
image = Image.open(IMAGE_PATH)
93+
langs = ["en"] # Replace with your languages
94+
95+
det_processor = load_det_processor()
96+
det_model = load_det_model()
97+
98+
rec_model = load_rec_model()
99+
rec_processor = load_rec_processor()
100+
101+
predictions = run_ocr([image], langs, det_model, det_processor, rec_model, rec_processor)
102+
```
103+
104+
49105
## Text line detection
50106

51107
You can detect text lines in an image, pdf, or folder of images/pdfs with the following command. This will write out a json file with the detected bboxes, and optionally save images of the pages with the bboxes.
@@ -75,6 +131,7 @@ Depending on your CPU core count, `DETECTOR_BATCH_SIZE` might make a difference
75131

76132
You can adjust `DETECTOR_NMS_THRESHOLD` and `DETECTOR_TEXT_THRESHOLD` if you don't get good results. Try lowering them to detect more text, and vice versa.
77133

134+
78135
### From Python
79136

80137
You can also do text detection from code with:
@@ -91,10 +148,6 @@ model, processor = load_model(), load_processor()
91148
predictions = batch_detection([image], model, processor)
92149
```
93150

94-
## Text recognition
95-
96-
Coming soon.
97-
98151
## Table and chart detection
99152

100153
Coming soon.
@@ -113,10 +166,14 @@ If you want to develop surya, you can install it manually:
113166
- This is specialized for document OCR. It will likely not work on photos or other images.
114167
- It is for printed text, not handwriting.
115168
- The model has trained itself to ignore advertisements.
116-
- This has worked for every language I've tried, but languages with very different character sets may not work well.
169+
- You can find language support for OCR in `surya/languages.py`. Text detection should work with any language.
117170

118171
# Benchmarks
119172

173+
## OCR
174+
175+
Coming soon.
176+
120177
## Text line detection
121178

122179
![Benchmark chart](static/images/benchmark_chart_small.png)
@@ -168,13 +225,13 @@ python benchmark/detection.py --max 256
168225

169226
# Training
170227

171-
This was trained on 4x A6000s for about 3 days. It used a diverse set of images as training data. It was trained from scratch using a modified segformer architecture that reduces inference RAM requirements.
228+
The text detection was trained on 4x A6000s for about 3 days. It used a diverse set of images as training data. It was trained from scratch using a modified segformer architecture that reduces inference RAM requirements.
172229

173-
# Commercial usage
230+
Text recognition was trained on 4x A6000s for 2 weeks. It was trained using a modified donut model (GQA, MoE layer, UTF-16 decoding, layer config changes).
174231

175-
**Text detection**
232+
# Commercial usage
176233

177-
The text detection model was trained from scratch, so it's okay for commercial usage. The weights are licensed cc-by-nc-sa-4.0, but I will waive that for any organization under $5M USD in gross revenue in the most recent 12-month period.
234+
The text detection and OCR models were trained from scratch, so they're okay for commercial usage. The weights are licensed cc-by-nc-sa-4.0, but I will waive that for any organization under $5M USD in gross revenue in the most recent 12-month period.
178235

179236
If you want to remove the GPL license requirements for inference or use the weights commercially over the revenue limit, please contact me at surya@vikas.sh for dual licensing.
180237

@@ -183,6 +240,7 @@ If you want to remove the GPL license requirements for inference or use the weig
183240
This work would not have been possible without amazing open source AI work:
184241

185242
- [Segformer](https://arxiv.org/pdf/2105.15203.pdf) from NVIDIA
243+
- [Donut](https://github.com/clovaai/donut) from Naver
186244
- [transformers](https://github.com/huggingface/transformers) from huggingface
187245
- [CRAFT](https://github.com/clovaai/CRAFT-pytorch), a great scene text detection model
188246

detect_text.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515

1616
def main():
1717
parser = argparse.ArgumentParser(description="Detect bboxes in an input file or folder (PDFs or image).")
18-
parser.add_argument("input_path", type=str, help="Path to pdf or image file to detect bboxes in.")
18+
parser.add_argument("input_path", type=str, help="Path to pdf or image file or folder to detect bboxes in.")
1919
parser.add_argument("--results_dir", type=str, help="Path to JSON file with OCR results.", default=os.path.join(settings.RESULT_DIR, "surya"))
2020
parser.add_argument("--max", type=int, help="Maximum number of pages to process.", default=None)
2121
parser.add_argument("--images", action="store_true", help="Save images of detected bboxes.", default=False)

ocr_text.py

+25-19
Original file line numberDiff line numberDiff line change
@@ -2,42 +2,30 @@
22
import json
33
from collections import defaultdict
44

5-
from surya.input.load import load_from_folder, load_from_file
5+
from surya.input.langs import replace_lang_with_code, get_unique_langs
6+
from surya.input.load import load_from_folder, load_from_file, load_lang_file
67
from surya.model.detection.segformer import load_model as load_detection_model, load_processor as load_detection_processor
78
from surya.model.recognition.model import load_model as load_recognition_model
89
from surya.model.recognition.processor import load_processor as load_recognition_processor
910
from surya.model.recognition.tokenizer import _tokenize
1011
from surya.ocr import run_ocr
1112
from surya.postprocessing.text import draw_text_on_image
1213
from surya.settings import settings
13-
from surya.languages import LANGUAGE_TO_CODE, CODE_TO_LANGUAGE
1414
import os
1515

1616

1717
def main():
1818
parser = argparse.ArgumentParser(description="Detect bboxes in an input file or folder (PDFs or image).")
19-
parser.add_argument("input_path", type=str, help="Path to pdf or image file to detect bboxes in.")
19+
parser.add_argument("input_path", type=str, help="Path to pdf or image file or folder to detect bboxes in.")
2020
parser.add_argument("--results_dir", type=str, help="Path to JSON file with OCR results.", default=os.path.join(settings.RESULT_DIR, "surya"))
2121
parser.add_argument("--max", type=int, help="Maximum number of pages to process.", default=None)
2222
parser.add_argument("--start_page", type=int, help="Page to start processing at.", default=0)
2323
parser.add_argument("--images", action="store_true", help="Save images of detected bboxes.", default=False)
24-
parser.add_argument("--lang", type=str, help="Language to use for OCR. Comma separate for multiple.", default="en")
24+
parser.add_argument("--langs", type=str, help="Language(s) to use for OCR. Comma separate for multiple. Can be a capitalized language name, or a 2-letter ISO 639 code.", default=None)
25+
parser.add_argument("--lang_file", type=str, help="Path to file with languages to use for OCR. Should be a JSON dict with file names as keys, and the value being a list of language codes/names.", default=None)
2526
args = parser.parse_args()
2627

27-
# Split and validate language codes
28-
langs = args.lang.split(",")
29-
for i in range(len(langs)):
30-
if langs[i] in LANGUAGE_TO_CODE:
31-
langs[i] = LANGUAGE_TO_CODE[langs[i]]
32-
if langs[i] not in CODE_TO_LANGUAGE:
33-
raise ValueError(f"Language code {langs[i]} not found.")
34-
35-
det_processor = load_detection_processor()
36-
det_model = load_detection_model()
37-
38-
_, lang_tokens = _tokenize("", langs)
39-
rec_model = load_recognition_model(langs=lang_tokens) # Prune model moes to only include languages we need
40-
rec_processor = load_recognition_processor()
28+
assert args.langs or args.lang_file, "Must provide either --lang or --lang_file"
4129

4230
if os.path.isdir(args.input_path):
4331
images, names = load_from_folder(args.input_path, args.max, args.start_page)
@@ -46,10 +34,28 @@ def main():
4634
images, names = load_from_file(args.input_path, args.max, args.start_page)
4735
folder_name = os.path.basename(args.input_path).split(".")[0]
4836

37+
if args.lang_file:
38+
# We got all of our language settings from a file
39+
langs = load_lang_file(args.lang_file, names)
40+
for lang in langs:
41+
replace_lang_with_code(lang)
42+
image_langs = langs
43+
else:
44+
# We got our language settings from the input
45+
langs = args.langs.split(",")
46+
replace_lang_with_code(langs)
47+
image_langs = [langs] * len(images)
48+
49+
det_processor = load_detection_processor()
50+
det_model = load_detection_model()
51+
52+
_, lang_tokens = _tokenize("", get_unique_langs(image_langs))
53+
rec_model = load_recognition_model(langs=lang_tokens) # Prune model moe layer to only include languages we need
54+
rec_processor = load_recognition_processor()
55+
4956
result_path = os.path.join(args.results_dir, folder_name)
5057
os.makedirs(result_path, exist_ok=True)
5158

52-
image_langs = [langs] * len(images)
5359
predictions_by_image = run_ocr(images, image_langs, det_model, det_processor, rec_model, rec_processor)
5460

5561
page_num = defaultdict(int)

pyproject.toml

+1
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,7 @@ arabic-reshaper = "^3.0.0"
3838

3939
[tool.poetry.scripts]
4040
surya_detect = "detect_text:main"
41+
surya_ocr = "ocr_text:main"
4142

4243
[build-system]
4344
requires = ["poetry-core"]

surya/benchmark/tesseract.py

+2-1
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
from surya.settings import settings
55
import os
66
from concurrent.futures import ProcessPoolExecutor
7+
from surya.detection import get_batch_size
78

89

910
def tesseract_bboxes(img):
@@ -23,7 +24,7 @@ def tesseract_bboxes(img):
2324

2425
def tesseract_parallel(imgs):
2526
# Tesseract uses 4 threads per instance
26-
tess_parallel_cores = min(len(imgs), settings.DETECTOR_BATCH_SIZE)
27+
tess_parallel_cores = min(len(imgs), get_batch_size())
2728
cpus = os.cpu_count()
2829
tess_parallel_cores = min(tess_parallel_cores, cpus)
2930

surya/input/langs.py

+19
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
from typing import List
2+
from surya.languages import LANGUAGE_TO_CODE, CODE_TO_LANGUAGE
3+
4+
5+
def replace_lang_with_code(langs: List[str]):
6+
for i in range(len(langs)):
7+
if langs[i] in LANGUAGE_TO_CODE:
8+
langs[i] = LANGUAGE_TO_CODE[langs[i]]
9+
if langs[i] not in CODE_TO_LANGUAGE:
10+
raise ValueError(f"Language code {langs[i]} not found.")
11+
12+
13+
def get_unique_langs(langs: List[List[str]]):
14+
uniques = []
15+
for lang_list in langs:
16+
for lang in lang_list:
17+
if lang not in uniques:
18+
uniques.append(lang)
19+
return uniques

surya/input/load.py

+8-1
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
import os
33
import filetype
44
from PIL import Image
5+
import json
56

67

78
def get_name_from_path(path):
@@ -57,4 +58,10 @@ def load_from_folder(folder_path, max_pages=None, start_page=None):
5758
image, name = load_image(path)
5859
images.extend(image)
5960
names.extend(name)
60-
return images, names
61+
return images, names
62+
63+
64+
def load_lang_file(lang_path, names):
65+
with open(lang_path, "r") as f:
66+
lang_dict = json.load(f)
67+
return [lang_dict[name].copy() for name in names]

surya/settings.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ def TORCH_DEVICE_DETECTION(self) -> str:
5151
DETECTOR_NMS_THRESHOLD: float = 0.35 # Threshold for non-maximum suppression
5252

5353
# Text recognition
54-
RECOGNITION_MODEL_CHECKPOINT: str = "vikp/rec_test_utf16m"
54+
RECOGNITION_MODEL_CHECKPOINT: str = "vikp/text_recognizer_test"
5555
RECOGNITION_MAX_TOKENS: int = 160
5656
RECOGNITION_BATCH_SIZE: Optional[int] = None # Defaults to 8 for CPU/MPS, 256 otherwise
5757
RECOGNITION_IMAGE_SIZE: Dict = {"height": 196, "width": 896}

0 commit comments

Comments
 (0)