You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+70-12
Original file line number
Diff line number
Diff line change
@@ -2,11 +2,11 @@
2
2
3
3
Surya is a multilingual document OCR toolkit. It can do:
4
4
5
-
- Accurate line-level text detection
6
-
- Text recognition (coming soon)
5
+
- Accurate line-level text detection in any language
6
+
- Text recognition in 90+ languages
7
7
- Table and chart detection (coming soon)
8
8
9
-
It works on a range of documents and languages (see [usage](#usage) and [benchmarks](#benchmarks) for more details).
9
+
It works on a range of documents (see [usage](#usage) and [benchmarks](#benchmarks) for more details).
10
10
11
11

12
12
@@ -46,6 +46,62 @@ Model weights will automatically download the first time you run surya.
46
46
- Inspect the settings in `surya/settings.py`. You can override any settings with environment variables.
47
47
- Your torch device will be automatically detected, but you can override this. For example, `TORCH_DEVICE=cuda`. Note that the `mps` device has a bug (on the [Apple side](https://github.com/pytorch/pytorch/issues/84936)) that may prevent it from working properly.
48
48
49
+
## OCR (text recognition)
50
+
51
+
You can detect text lines in an image, pdf, or folder of images/pdfs with the following command. This will write out a json file with the detected text and bboxes, and optionally save images of the reconstructed page.
52
+
53
+
```
54
+
surya_ocr DATA_PATH --images --lang hi,en
55
+
```
56
+
57
+
-`DATA_PATH` can be an image, pdf, or folder of images/pdfs
58
+
-`--lang` specifies the language(s) to use for OCR. You can comma separate multiple languages. Use the language name or two-letter ISO code from [here](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes). Surya supports the 90+ languages found in `surya/languages.py`.
59
+
-`--lang_file` if you want to use a different language for different PDFs/images, you can specify languages here. The format is a JSON dict with the keys being filenames and the values as a list, like `{"file1.pdf": ["en", "hi"], "file2.pdf": ["en"]}`.
60
+
-`--images` will save images of the pages and detected text lines (optional)
61
+
-`--results_dir` specifies the directory to save results to instead of the default
62
+
-`--max` specifies the maximum number of pages to process if you don't want to process everything
63
+
-`--start_page` specifies the page number to start processing from
64
+
65
+
The `results.json` file will contain these keys for each page of the input document(s):
66
+
67
+
-`text_lines` - the detected text in each line
68
+
-`polys` - the polygons for each detected text line in (x1, y1), (x2, y2), (x3, y3), (x4, y4) format. The points are in clockwise order from the top left.
69
+
-`bboxes` - the axis-aligned rectangles for each detected text line in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner.
70
+
-`language` - the languages specified for the page
71
+
-`name` - the name of the file
72
+
-`page_number` - the page number in the file
73
+
74
+
**Performance tips**
75
+
76
+
Setting the `RECOGNITION_BATCH_SIZE` env var properly will make a big difference when using a GPU. Each batch item will use `40MB` of VRAM, so very high batch sizes are possible. The default is a batch size `256`, which will use about 10GB of VRAM.
77
+
78
+
Depending on your CPU core count, `RECOGNITION_BATCH_SIZE` might make a difference there too - the default CPU batch size is `32`.
79
+
80
+
81
+
### From Python
82
+
83
+
You can also do OCR from code with:
84
+
85
+
```
86
+
from PIL import Image
87
+
from surya.ocr import run_ocr
88
+
from surya.model.detection.segformer import load_model as load_det_model, load_processor as load_det_processor
89
+
from surya.model.recognition.model import load_model as load_rec_model
90
+
from surya.model.recognition.processor import load_processor as load_rec_processor
You can detect text lines in an image, pdf, or folder of images/pdfs with the following command. This will write out a json file with the detected bboxes, and optionally save images of the pages with the bboxes.
@@ -75,6 +131,7 @@ Depending on your CPU core count, `DETECTOR_BATCH_SIZE` might make a difference
75
131
76
132
You can adjust `DETECTOR_NMS_THRESHOLD` and `DETECTOR_TEXT_THRESHOLD` if you don't get good results. Try lowering them to detect more text, and vice versa.
This was trained on 4x A6000s for about 3 days. It used a diverse set of images as training data. It was trained from scratch using a modified segformer architecture that reduces inference RAM requirements.
228
+
The text detection was trained on 4x A6000s for about 3 days. It used a diverse set of images as training data. It was trained from scratch using a modified segformer architecture that reduces inference RAM requirements.
172
229
173
-
# Commercial usage
230
+
Text recognition was trained on 4x A6000s for 2 weeks. It was trained using a modified donut model (GQA, MoE layer, UTF-16 decoding, layer config changes).
174
231
175
-
**Text detection**
232
+
# Commercial usage
176
233
177
-
The text detection model was trained from scratch, so it's okay for commercial usage. The weights are licensed cc-by-nc-sa-4.0, but I will waive that for any organization under $5M USD in gross revenue in the most recent 12-month period.
234
+
The text detection and OCR models were trained from scratch, so they're okay for commercial usage. The weights are licensed cc-by-nc-sa-4.0, but I will waive that for any organization under $5M USD in gross revenue in the most recent 12-month period.
178
235
179
236
If you want to remove the GPL license requirements for inference or use the weights commercially over the revenue limit, please contact me at surya@vikas.sh for dual licensing.
180
237
@@ -183,6 +240,7 @@ If you want to remove the GPL license requirements for inference or use the weig
183
240
This work would not have been possible without amazing open source AI work:
184
241
185
242
-[Segformer](https://arxiv.org/pdf/2105.15203.pdf) from NVIDIA
243
+
-[Donut](https://github.com/clovaai/donut) from Naver
186
244
-[transformers](https://github.com/huggingface/transformers) from huggingface
187
245
-[CRAFT](https://github.com/clovaai/CRAFT-pytorch), a great scene text detection model
parser=argparse.ArgumentParser(description="Detect bboxes in an input file or folder (PDFs or image).")
19
-
parser.add_argument("input_path", type=str, help="Path to pdf or image file to detect bboxes in.")
19
+
parser.add_argument("input_path", type=str, help="Path to pdf or image file or folder to detect bboxes in.")
20
20
parser.add_argument("--results_dir", type=str, help="Path to JSON file with OCR results.", default=os.path.join(settings.RESULT_DIR, "surya"))
21
21
parser.add_argument("--max", type=int, help="Maximum number of pages to process.", default=None)
22
22
parser.add_argument("--start_page", type=int, help="Page to start processing at.", default=0)
23
23
parser.add_argument("--images", action="store_true", help="Save images of detected bboxes.", default=False)
24
-
parser.add_argument("--lang", type=str, help="Language to use for OCR. Comma separate for multiple.", default="en")
24
+
parser.add_argument("--langs", type=str, help="Language(s) to use for OCR. Comma separate for multiple. Can be a capitalized language name, or a 2-letter ISO 639 code.", default=None)
25
+
parser.add_argument("--lang_file", type=str, help="Path to file with languages to use for OCR. Should be a JSON dict with file names as keys, and the value being a list of language codes/names.", default=None)
25
26
args=parser.parse_args()
26
27
27
-
# Split and validate language codes
28
-
langs=args.lang.split(",")
29
-
foriinrange(len(langs)):
30
-
iflangs[i] inLANGUAGE_TO_CODE:
31
-
langs[i] =LANGUAGE_TO_CODE[langs[i]]
32
-
iflangs[i] notinCODE_TO_LANGUAGE:
33
-
raiseValueError(f"Language code {langs[i]} not found.")
34
-
35
-
det_processor=load_detection_processor()
36
-
det_model=load_detection_model()
37
-
38
-
_, lang_tokens=_tokenize("", langs)
39
-
rec_model=load_recognition_model(langs=lang_tokens) # Prune model moes to only include languages we need
40
-
rec_processor=load_recognition_processor()
28
+
assertargs.langsorargs.lang_file, "Must provide either --lang or --lang_file"
0 commit comments