You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
| Scanned Form |[Image](static/images/funsd.png)||
31
36
32
37
# Installation
33
38
@@ -78,9 +83,7 @@ Setting the `RECOGNITION_BATCH_SIZE` env var properly will make a big difference
78
83
Depending on your CPU core count, `RECOGNITION_BATCH_SIZE` might make a difference there too - the default CPU batch size is `32`.
79
84
80
85
81
-
### From Python
82
-
83
-
You can also do OCR from code with:
86
+
### From python
84
87
85
88
```
86
89
from PIL import Image
@@ -132,9 +135,7 @@ Depending on your CPU core count, `DETECTOR_BATCH_SIZE` might make a difference
132
135
You can adjust `DETECTOR_NMS_THRESHOLD` and `DETECTOR_TEXT_THRESHOLD` if you don't get good results. Try lowering them to detect more text, and vice versa.
133
136
134
137
135
-
### From Python
136
-
137
-
You can also do text detection from code with:
138
+
### From python
138
139
139
140
```
140
141
from PIL import Image
@@ -164,11 +165,9 @@ If you want to develop surya, you can install it manually:
164
165
# Limitations
165
166
166
167
- This is specialized for document OCR. It will likely not work on photos or other images.
167
-
- It is for printed text, not handwriting.
168
+
- It is for printed text, not handwriting (though it may work on some handwriting).
168
169
- The model has trained itself to ignore advertisements.
169
170
- You can find language support for OCR in `surya/languages.py`. Text detection should work with any language.
170
-
- Math will not be detected well with the main detector model. Use `DETECTOR_MODEL_CHECKPOINT=vikp/line_detector_math` for better results.
171
-
172
171
173
172
# Benchmarks
174
173
@@ -193,11 +192,11 @@ Tesseract is CPU-based, and surya is CPU or GPU. I ran the benchmarks on a syst
193
192
194
193
**Methodology**
195
194
196
-
Surya predicts line-level bboxes, while tesseract and others predict word-level or character-level. It's also hard to find 100% correct datasets with line-level annotations. Merging bboxes can be noisy, so I chose not to use IoU as the metric for evaluation.
195
+
Surya predicts line-level bboxes, while tesseract and others predict word-level or character-level. It's hard to find 100% correct datasets with line-level annotations. Merging bboxes can be noisy, so I chose not to use IoU as the metric for evaluation.
197
196
198
197
I instead used coverage, which calculates:
199
198
200
-
- Precision - how well predicted bboxes cover ground truth bboxes
199
+
- Precision - how well the predicted bboxes cover ground truth bboxes
201
200
- Recall - how well ground truth bboxes cover predicted bboxes
202
201
203
202
First calculate coverage for each bbox, then add a small penalty for double coverage, since we want the detection to have non-overlapping bboxes. Anything with a coverage of 0.5 or higher is considered a match.
0 commit comments