You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+26-11
Original file line number
Diff line number
Diff line change
@@ -1,9 +1,9 @@
1
1
# Surya
2
2
3
-
Surya is a multilingual document OCR toolkit. It can do:
3
+
Surya is for multilingual document OCR. It can do:
4
4
5
-
- Accurate line-level text detection in any language
6
-
-Text recognition in 90+ languages
5
+
- Accurate OCR in 90+ languages
6
+
-Line-level text detection in any language
7
7
- Table and chart detection (coming soon)
8
8
9
9
It works on a range of documents (see [usage](#usage) and [benchmarks](#benchmarks) for more details).
@@ -39,23 +39,23 @@ Install with:
39
39
pip install surya-ocr
40
40
```
41
41
42
-
Model weights will automatically download the first time you run surya.
42
+
Model weights will automatically download the first time you run surya. Note that this does not work with the latest version of transformers `4.37+`[yet](https://github.com/huggingface/transformers/issues/28846#issuecomment-1926109135), so you will need to keep `4.36.2`, which is installed with surya.
43
43
44
44
# Usage
45
45
46
46
- Inspect the settings in `surya/settings.py`. You can override any settings with environment variables.
47
-
- Your torch device will be automatically detected, but you can override this. For example, `TORCH_DEVICE=cuda`. Note that the `mps` device has a bug (on the [Apple side](https://github.com/pytorch/pytorch/issues/84936)) that may prevent it from working properly.
47
+
- Your torch device will be automatically detected, but you can override this. For example, `TORCH_DEVICE=cuda`. For text detection, the `mps` device has a bug (on the [Apple side](https://github.com/pytorch/pytorch/issues/84936)) that may prevent it from working properly.
48
48
49
49
## OCR (text recognition)
50
50
51
-
You can detect text lines in an image, pdf, or folder of images/pdfs with the following command. This will write out a json file with the detected text and bboxes, and optionally save images of the reconstructed page.
51
+
You can detect text in an image, pdf, or folder of images/pdfs with the following command. This will write out a json file with the detected text and bboxes, and optionally save images of the reconstructed page.
52
52
53
53
```
54
54
surya_ocr DATA_PATH --images --langs hi,en
55
55
```
56
56
57
57
-`DATA_PATH` can be an image, pdf, or folder of images/pdfs
58
-
-`--langs` specifies the language(s) to use for OCR. You can comma separate multiple languages. Use the language name or two-letter ISO code from [here](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes). Surya supports the 90+ languages found in `surya/languages.py`.
58
+
-`--langs` specifies the language(s) to use for OCR. You can comma separate multiple languages (I don't recommend using more than `4`). Use the language name or two-letter ISO code from [here](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes). Surya supports the 90+ languages found in `surya/languages.py`.
59
59
-`--lang_file` if you want to use a different language for different PDFs/images, you can specify languages here. The format is a JSON dict with the keys being filenames and the values as a list, like `{"file1.pdf": ["en", "hi"], "file2.pdf": ["en"]}`.
60
60
-`--images` will save images of the pages and detected text lines (optional)
61
61
-`--results_dir` specifies the directory to save results to instead of the default
@@ -158,15 +158,17 @@ If you want to develop surya, you can install it manually:
-`--pdf_path` will let you specify a pdf to benchmark instead of the default data
223
225
-`--results_dir` will let you specify a directory to save results to instead of the default one
224
226
227
+
**Text recognition**
228
+
229
+
This will evaluate surya and optionally tesseract on multilingual pdfs from common crawl.
230
+
231
+
```
232
+
python benchmark/recognition.py --max 256
233
+
```
234
+
235
+
-`--max` controls how many images to process for the benchmark
236
+
-`--debug` will render images with detected text
237
+
-`--results_dir` will let you specify a directory to save results to instead of the default one
238
+
-`--tesseract` will run the benchmark with tesseract. You have to run `sudo apt-get install tesseract-ocr-all` to install all tesseract data, and set `TESSDATA_PREFIX` to the path to the tesseract data folder.
239
+
225
240
226
241
# Training
227
242
228
-
The text detection was trained on 4x A6000s for about 3 days. It used a diverse set of images as training data. It was trained from scratch using a modified segformer architecture that reduces inference RAM requirements.
243
+
Text detection was trained on 4x A6000s for 3 days. It used a diverse set of images as training data. It was trained from scratch using a modified segformer architecture that reduces inference RAM requirements.
229
244
230
245
Text recognition was trained on 4x A6000s for 2 weeks. It was trained using a modified donut model (GQA, MoE layer, UTF-16 decoding, layer config changes).
0 commit comments