You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
You can detect text in an image, pdf, or folder of images/pdfs with the following command. This will write out a json file with the detected text and bboxes, and optionally save images of the reconstructed page.
66
+
You can OCR text in an image, pdf, or folder of images/pdfs with the following command. This will write out a json file with the detected text and bboxes, and optionally save images of the reconstructed page.
68
67
69
68
```
70
69
surya_ocr DATA_PATH --images --langs hi,en
@@ -89,7 +88,7 @@ The `results.json` file will contain these keys for each page of the input docum
89
88
90
89
**Performance tips**
91
90
92
-
Setting the `RECOGNITION_BATCH_SIZE` env var properly will make a big difference when using a GPU. Each batch item will use `40MB` of VRAM, so very high batch sizes are possible. The default is a batch size `256`, which will use about 10GB of VRAM. Depending on your CPU core count, it may help, too - the default CPU batch size is `32`.
91
+
Setting the `RECOGNITION_BATCH_SIZE` env var properly will make a big difference when using a GPU. Each batch item will use `50MB` of VRAM, so very high batch sizes are possible. The default is a batch size `256`, which will use about 12.8GB of VRAM. Depending on your CPU core count, it may help, too - the default CPU batch size is `32`.
- If the lines aren't detected properly, try increasing resolution of the image if the width is below `896px`, and vice versa. Very high width images don't work well with the detector.
160
+
- Try increasing resolution of the image so the text is bigger. If the resolution is already very high, try decreasing it to no more than a `2048px` width.
161
+
- Preprocessing the image (binarizing, deskewing, etc) can help with very old/blurry images.
162
162
- You can adjust `DETECTOR_BLANK_THRESHOLD` and `DETECTOR_TEXT_THRESHOLD` if you don't get good results. `DETECTOR_BLANK_THRESHOLD` controls the space between lines - any prediction below this number will be considered blank space. `DETECTOR_TEXT_THRESHOLD` controls how text is joined - any number above this is considered text. `DETECTOR_TEXT_THRESHOLD` should always be higher than `DETECTOR_BLANK_THRESHOLD`, and both should be in the 0-1 range. Looking at the heatmap from the debug output of the detector can tell you how to adjust these (if you see faint things that look like boxes, lower the thresholds, and if you see bboxes being joined together, raise the thresholds).
163
163
164
164
@@ -175,7 +175,14 @@ If you want to develop surya, you can install it manually:
175
175
176
176
## OCR
177
177
178
-
Coming soon.
178
+
179
+
Tesseract is CPU-based, and surya is CPU or GPU. I tried to cost-match the resources used, so I used a 1xA6000 (48GB VRAM) for surya, and 28 CPU cores for Tesseract (same price on Lambda Labs/DigitalOcean).
180
+
181
+
**Methodology**
182
+
183
+
I measured normalized edit distance (0-1, lower is better) based on a set of real-world and synthetic pdfs. I sampled PDFs from common crawl, then filtered out the ones with bad OCR. I couldn't find PDFs for some languages, so I also generated simple synthetic PDFs for those.
184
+
185
+
I used the reference line bboxes from the PDFs with both tesseract and surya, to just evaluate the OCR quality.
Copy file name to clipboardexpand all lines: ocr_text.py
+1-1
Original file line number
Diff line number
Diff line change
@@ -25,7 +25,7 @@ def main():
25
25
parser.add_argument("--lang_file", type=str, help="Path to file with languages to use for OCR. Should be a JSON dict with file names as keys, and the value being a list of language codes/names.", default=None)
26
26
args=parser.parse_args()
27
27
28
-
assertargs.langsorargs.lang_file, "Must provide either --lang or --lang_file"
28
+
assertargs.langsorargs.lang_file, "Must provide either --langs or --lang_file"
0 commit comments