Skip to content

Commit 349f57a

Browse files
committed
Add demo images
1 parent e0785d1 commit 349f57a

16 files changed

+49
-44
lines changed

README.md

+22-23
Original file line numberDiff line numberDiff line change
@@ -8,26 +8,31 @@ Surya is for multilingual document OCR. It can do:
88

99
It works on a range of documents (see [usage](#usage) and [benchmarks](#benchmarks) for more details).
1010

11-
![New York Times Article Example](static/images/excerpt.png)
11+
Detection and OCR example:
1212

13-
Surya is named after the [Hindu sun god](https://en.wikipedia.org/wiki/Surya), who has universal vision.
13+
| Detection | OCR |
14+
|---------------------------------------------------------------:|:----------------------------------------------------------------------|
15+
| ![New York Times Article Detection](static/images/excerpt.png) | ![New York Times Article Recognition](static/images/excerpt_text.png) |
16+
17+
18+
Surya is named for the [Hindu sun god](https://en.wikipedia.org/wiki/Surya), who has universal vision.
1419

1520
## Community
1621

1722
[Discord](https://discord.gg//KuZwXNGnfH) is where we discuss future development.
1823

1924
## Examples
2025

21-
| Name | Text Detection |
22-
|------------------|-------------------------------------|
23-
| New York Times | [Image](static/images/nyt.png) |
24-
| Japanese | [Image](static/images/japanese.png) |
25-
| Chinese | [Image](static/images/chinese.png) |
26-
| Hindi | [Image](static/images/hindi.png) |
27-
| Presentation | [Image](static/images/pres.png) |
28-
| Scientific Paper | [Image](static/images/paper.png) |
29-
| Scanned Document | [Image](static/images/scanned.png) |
30-
| Scanned Form | [Image](static/images/funsd.png) |
26+
| Name | Text Detection | OCR |
27+
|------------------|:-----------------------------------:|-----------------------------------------:|
28+
| New York Times | [Image](static/images/nyt.png) | [Image](static/images/nyt_text.png) |
29+
| Japanese | [Image](static/images/japanese.png) | [Image](static/images/japanese_text.png) |
30+
| Chinese | [Image](static/images/chinese.png) | [Image](static/images/chinese_text.png) |
31+
| Hindi | [Image](static/images/hindi.png) | [Image](static/images/hindi_text.png) |
32+
| Presentation | [Image](static/images/pres.png) | [Image](static/images/pres_text.png) |
33+
| Scientific Paper | [Image](static/images/paper.png) | [Image](static/images/paper_text.png) |
34+
| Scanned Document | [Image](static/images/scanned.png) | [Image](static/images/scanned_text.png) |
35+
| Scanned Form | [Image](static/images/funsd.png) | |
3136

3237
# Installation
3338

@@ -78,9 +83,7 @@ Setting the `RECOGNITION_BATCH_SIZE` env var properly will make a big difference
7883
Depending on your CPU core count, `RECOGNITION_BATCH_SIZE` might make a difference there too - the default CPU batch size is `32`.
7984

8085

81-
### From Python
82-
83-
You can also do OCR from code with:
86+
### From python
8487

8588
```
8689
from PIL import Image
@@ -132,9 +135,7 @@ Depending on your CPU core count, `DETECTOR_BATCH_SIZE` might make a difference
132135
You can adjust `DETECTOR_NMS_THRESHOLD` and `DETECTOR_TEXT_THRESHOLD` if you don't get good results. Try lowering them to detect more text, and vice versa.
133136

134137

135-
### From Python
136-
137-
You can also do text detection from code with:
138+
### From python
138139

139140
```
140141
from PIL import Image
@@ -164,11 +165,9 @@ If you want to develop surya, you can install it manually:
164165
# Limitations
165166

166167
- This is specialized for document OCR. It will likely not work on photos or other images.
167-
- It is for printed text, not handwriting.
168+
- It is for printed text, not handwriting (though it may work on some handwriting).
168169
- The model has trained itself to ignore advertisements.
169170
- You can find language support for OCR in `surya/languages.py`. Text detection should work with any language.
170-
- Math will not be detected well with the main detector model. Use `DETECTOR_MODEL_CHECKPOINT=vikp/line_detector_math` for better results.
171-
172171

173172
# Benchmarks
174173

@@ -193,11 +192,11 @@ Tesseract is CPU-based, and surya is CPU or GPU. I ran the benchmarks on a syst
193192

194193
**Methodology**
195194

196-
Surya predicts line-level bboxes, while tesseract and others predict word-level or character-level. It's also hard to find 100% correct datasets with line-level annotations. Merging bboxes can be noisy, so I chose not to use IoU as the metric for evaluation.
195+
Surya predicts line-level bboxes, while tesseract and others predict word-level or character-level. It's hard to find 100% correct datasets with line-level annotations. Merging bboxes can be noisy, so I chose not to use IoU as the metric for evaluation.
197196

198197
I instead used coverage, which calculates:
199198

200-
- Precision - how well predicted bboxes cover ground truth bboxes
199+
- Precision - how well the predicted bboxes cover ground truth bboxes
201200
- Recall - how well ground truth bboxes cover predicted bboxes
202201

203202
First calculate coverage for each bbox, then add a small penalty for double coverage, since we want the detection to have non-overlapping bboxes. Anything with a coverage of 0.5 or higher is considered a match.

benchmark/recognition.py

+13-2
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ def main():
2222
parser = argparse.ArgumentParser(description="Detect bboxes in a PDF.")
2323
parser.add_argument("--results_dir", type=str, help="Path to JSON file with OCR results.", default=os.path.join(settings.RESULT_DIR, "benchmark"))
2424
parser.add_argument("--max", type=int, help="Maximum number of pdf pages to OCR.", default=None)
25-
parser.add_argument("--debug", action="store_true", help="Run in debug mode.", default=False)
25+
parser.add_argument("--debug", type=int, help="Debug level - 1 dumps bad detection info, 2 writes out images.", default=0)
2626
parser.add_argument("--tesseract", action="store_true", help="Run tesseract instead of surya.", default=False)
2727
args = parser.parse_args()
2828

@@ -54,8 +54,10 @@ def main():
5454
surya_time = time.time() - start
5555

5656
surya_scores = defaultdict(list)
57+
img_surya_scores = []
5758
for idx, (pred, ref_text, lang) in enumerate(zip(predictions_by_image, line_text, lang_list)):
5859
image_score = overlap_score(pred["text_lines"], ref_text)
60+
img_surya_scores.append(image_score)
5961
for l in lang:
6062
surya_scores[CODE_TO_LANGUAGE[l]].append(image_score)
6163

@@ -118,7 +120,16 @@ def main():
118120
print(tabulate(table_data, headers=table_headers, tablefmt="github"))
119121
print("Only a few major languages are displayed. See the result path for additional languages.")
120122

121-
if args.debug:
123+
if args.debug >= 1:
124+
bad_detections = []
125+
for idx, (score, lang) in enumerate(zip(flat_surya_scores, lang_list)):
126+
if score < .8:
127+
bad_detections.append((idx, lang, score))
128+
print(f"Found {len(bad_detections)} bad detections. Writing to file...")
129+
with open(os.path.join(result_path, "bad_detections.json"), "w+") as f:
130+
json.dump(bad_detections, f)
131+
132+
if args.debug == 2:
122133
for idx, (image, pred, ref_text, bbox, lang) in enumerate(zip(images, predictions_by_image, line_text, bboxes, lang_list)):
123134
pred_image_name = f"{'_'.join(lang)}_{idx}_pred.png"
124135
ref_image_name = f"{'_'.join(lang)}_{idx}_ref.png"

static/images/chinese_text.png

189 KB
Loading

static/images/excerpt_text.png

115 KB
Loading

static/images/hindi.png

27.5 KB
Loading

static/images/hindi_text.png

182 KB
Loading

static/images/japanese.png

248 KB
Loading

static/images/japanese_text.png

403 KB
Loading

static/images/nyt_text.png

793 KB
Loading

static/images/paper_text.png

389 KB
Loading

static/images/pres.png

-207 KB
Loading

static/images/pres_text.png

148 KB
Loading

static/images/scanned_text.png

300 KB
Loading

surya/benchmark/tesseract.py

+6-5
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
import numpy as np
44
import pytesseract
55
from pytesseract import Output
6+
from tqdm import tqdm
67

78
from surya.input.processing import slice_bboxes_from_image
89
from surya.settings import settings
@@ -37,11 +38,12 @@ def tesseract_ocr_parallel(imgs, bboxes, langs: List[str]):
3738
cpus = os.cpu_count()
3839
tess_parallel_cores = min(tess_parallel_cores, cpus)
3940

40-
# Tesseract uses 4 threads per instance
41-
tess_parallel = max(tess_parallel_cores // 4, 1)
41+
# Tesseract uses up to 4 processes per instance
42+
# Divide by 2 because tesseract doesn't seem to saturate all 4 cores with these small images
43+
tess_parallel = max(tess_parallel_cores // 2, 1)
4244

4345
with ProcessPoolExecutor(max_workers=tess_parallel) as executor:
44-
tess_text = executor.map(tesseract_ocr, imgs, bboxes, langs)
46+
tess_text = tqdm(executor.map(tesseract_ocr, imgs, bboxes, langs), total=len(imgs), desc="Running tesseract OCR")
4547
tess_text = list(tess_text)
4648
return tess_text
4749

@@ -71,7 +73,7 @@ def tesseract_parallel(imgs):
7173
tess_parallel = max(tess_parallel_cores // 4, 1)
7274

7375
with ProcessPoolExecutor(max_workers=tess_parallel) as executor:
74-
tess_bboxes = executor.map(tesseract_bboxes, imgs)
76+
tess_bboxes = tqdm(executor.map(tesseract_bboxes, imgs), total=len(imgs), desc="Running tesseract bbox detection")
7577
tess_bboxes = list(tess_bboxes)
7678
return tess_bboxes
7779

@@ -163,7 +165,6 @@ def tesseract_parallel(imgs):
163165
"tam": "Tamil",
164166
"tel": "Telugu",
165167
"tgk": "Tajik",
166-
"tgl": "Tagalog",
167168
"tha": "Thai",
168169
"tir": "Tigrinya",
169170
"tur": "Turkish",

surya/input/processing.py

+7-13
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
import os
12
from typing import List
23

34
import numpy as np
@@ -69,37 +70,30 @@ def slice_bboxes_from_image(image: Image.Image, bboxes):
6970

7071
def slice_polys_from_image(image: Image.Image, polys):
7172
lines = []
72-
for poly in polys:
73-
lines.append(slice_and_pad_poly(image, poly))
73+
for idx, poly in enumerate(polys):
74+
lines.append(slice_and_pad_poly(image, poly, idx))
7475
return lines
7576

7677

77-
def slice_and_pad_poly(image: Image.Image, coordinates):
78+
def slice_and_pad_poly(image: Image.Image, coordinates, idx):
7879
# Create a mask for the polygon
7980
mask = Image.new('L', image.size, 0)
8081

8182
# coordinates must be in tuple form for PIL
8283
coordinates = [(corner[0], corner[1]) for corner in coordinates]
8384
ImageDraw.Draw(mask).polygon(coordinates, outline=1, fill=1)
85+
bbox = mask.getbbox()
8486
mask = np.array(mask)
8587

8688
# Extract the polygonal area from the image
8789
polygon_image = np.array(image)
88-
polygon_image[~mask] = 0
90+
polygon_image[mask == 0] = 0
8991
polygon_image = Image.fromarray(polygon_image)
9092

91-
bbox_image = Image.new('L', image.size, 0)
92-
ImageDraw.Draw(bbox_image).polygon(coordinates, outline=1, fill=1)
93-
bbox = bbox_image.getbbox()
94-
9593
rectangle = Image.new('RGB', (bbox[2] - bbox[0], bbox[3] - bbox[1]), 'white')
9694

9795
# Paste the polygon into the rectangle
98-
polygon_center = (bbox[2] + bbox[0]) // 2, (bbox[3] + bbox[1]) // 2
99-
rectangle_center = rectangle.width // 2, rectangle.height // 2
100-
paste_position = (rectangle_center[0] - polygon_center[0] + bbox[0],
101-
rectangle_center[1] - polygon_center[1] + bbox[1])
102-
rectangle.paste(polygon_image.crop(bbox), paste_position)
96+
rectangle.paste(polygon_image.crop(bbox), (0, 0))
10397

10498
return rectangle
10599

surya/ocr.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ def run_recognition(images: List[Image.Image], langs: List[List[str]], rec_model
1616
slice_map = []
1717
all_slices = []
1818
all_langs = []
19-
for idx, (image, lang) in tqdm(enumerate(zip(images, langs)), desc="Slicing images"):
19+
for idx, (image, lang) in enumerate(zip(images, langs)):
2020
if polygons is not None:
2121
slices = slice_polys_from_image(image, polygons[idx])
2222
else:

0 commit comments

Comments
 (0)