Enable passing in languages from file

VikParuchuri · VikParuchuri · commit 5b910095aa65 · 2024-02-06T13:02:29.000-08:00
diff --git a/README.md b/README.md
@@ -2,11 +2,11 @@
 
 Surya is a multilingual document OCR toolkit.  It can do:
 
-- Accurate line-level text detection
-- Text recognition (coming soon)
+- Accurate line-level text detection in any language
+- Text recognition in 90+ languages
 - Table and chart detection (coming soon)
 
-It works on a range of documents and languages (see [usage](#usage) and [benchmarks](#benchmarks) for more details).
+It works on a range of documents (see [usage](#usage) and [benchmarks](#benchmarks) for more details).
 
 ![New York Times Article Example](static/images/excerpt.png)
 
@@ -46,6 +46,62 @@ Model weights will automatically download the first time you run surya.
 - Inspect the settings in `surya/settings.py`.  You can override any settings with environment variables.
 - Your torch device will be automatically detected, but you can override this.  For example, `TORCH_DEVICE=cuda`. Note that the `mps` device has a bug (on the [Apple side](https://github.com/pytorch/pytorch/issues/84936)) that may prevent it from working properly.
 
+## OCR (text recognition)
+
+You can detect text lines in an image, pdf, or folder of images/pdfs with the following command.  This will write out a json file with the detected text and bboxes, and optionally save images of the reconstructed page.
+
+```
+surya_ocr DATA_PATH --images --lang hi,en
+```
+
+- `DATA_PATH` can be an image, pdf, or folder of images/pdfs
+- `--lang` specifies the language(s) to use for OCR.  You can comma separate multiple languages. Use the language name or two-letter ISO code from [here](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes).  Surya supports the 90+ languages found in `surya/languages.py`.
+- `--lang_file` if you want to use a different language for different PDFs/images, you can specify languages here.  The format is a JSON dict with the keys being filenames and the values as a list, like `{"file1.pdf": ["en", "hi"], "file2.pdf": ["en"]}`.
+- `--images` will save images of the pages and detected text lines (optional)
+- `--results_dir` specifies the directory to save results to instead of the default
+- `--max` specifies the maximum number of pages to process if you don't want to process everything
+- `--start_page` specifies the page number to start processing from
+
+The `results.json` file will contain these keys for each page of the input document(s):
+
+- `text_lines` - the detected text in each line
+- `polys` - the polygons for each detected text line in (x1, y1), (x2, y2), (x3, y3), (x4, y4) format.  The points are in clockwise order from the top left.
+- `bboxes` - the axis-aligned rectangles for each detected text line in (x1, y1, x2, y2) format.  (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner.
+- `language` - the languages specified for the page
+- `name` - the name of the file
+- `page_number` - the page number in the file
+
+**Performance tips**
+
+Setting the `RECOGNITION_BATCH_SIZE` env var properly will make a big difference when using a GPU.  Each batch item will use `40MB` of VRAM, so very high batch sizes are possible.  The default is a batch size `256`, which will use about 10GB of VRAM.
+
+Depending on your CPU core count, `RECOGNITION_BATCH_SIZE` might make a difference there too - the default CPU batch size is `32`.
+
+
+### From Python
+
+You can also do OCR from code with:
+
+```
+from PIL import Image
+from surya.ocr import run_ocr
+from surya.model.detection.segformer import load_model as load_det_model, load_processor as load_det_processor
+from surya.model.recognition.model import load_model as load_rec_model
+from surya.model.recognition.processor import load_processor as load_rec_processor
+
+image = Image.open(IMAGE_PATH)
+langs = ["en"] # Replace with your languages
+
+det_processor = load_det_processor()
+det_model = load_det_model()
+
+rec_model = load_rec_model()
+rec_processor = load_rec_processor()
+
+predictions = run_ocr([image], langs, det_model, det_processor, rec_model, rec_processor)
+```
+
+
 ## Text line detection
 
 You can detect text lines in an image, pdf, or folder of images/pdfs with the following command.  This will write out a json file with the detected bboxes, and optionally save images of the pages with the bboxes.
@@ -75,6 +131,7 @@ Depending on your CPU core count, `DETECTOR_BATCH_SIZE` might make a difference
 
 You can adjust `DETECTOR_NMS_THRESHOLD` and `DETECTOR_TEXT_THRESHOLD` if you don't get good results.  Try lowering them to detect more text, and vice versa.
 
+
 ### From Python
 
 You can also do text detection from code with:
@@ -91,10 +148,6 @@ model, processor = load_model(), load_processor()
 predictions = batch_detection([image], model, processor)
 ```
 
-## Text recognition
-
-Coming soon.
-
 ## Table and chart detection
 
 Coming soon.
@@ -113,10 +166,14 @@ If you want to develop surya, you can install it manually:
 - This is specialized for document OCR.  It will likely not work on photos or other images.
 - It is for printed text, not handwriting.
 - The model has trained itself to ignore advertisements.
-- This has worked for every language I've tried, but languages with very different character sets may not work well.
+- You can find language support for OCR in `surya/languages.py`.  Text detection should work with any language.
 
 # Benchmarks
 
+## OCR
+
+Coming soon.
+
 ## Text line detection
 
 ![Benchmark chart](static/images/benchmark_chart_small.png)
@@ -168,13 +225,13 @@ python benchmark/detection.py --max 256
 
 # Training
 
-This was trained on 4x A6000s for about 3 days.  It used a diverse set of images as training data.  It was trained from scratch using a modified segformer architecture that reduces inference RAM requirements.
+The text detection was trained on 4x A6000s for about 3 days.  It used a diverse set of images as training data.  It was trained from scratch using a modified segformer architecture that reduces inference RAM requirements.
 
-# Commercial usage
+Text recognition was trained on 4x A6000s for 2 weeks.  It was trained using a modified donut model (GQA, MoE layer, UTF-16 decoding, layer config changes).
 
-**Text detection**
+# Commercial usage
 
-The text detection model was trained from scratch, so it's okay for commercial usage.  The weights are licensed cc-by-nc-sa-4.0, but I will waive that for any organization under $5M USD in gross revenue in the most recent 12-month period.
+The text detection and OCR models were trained from scratch, so they're okay for commercial usage.  The weights are licensed cc-by-nc-sa-4.0, but I will waive that for any organization under $5M USD in gross revenue in the most recent 12-month period.
 
 If you want to remove the GPL license requirements for inference or use the weights commercially over the revenue limit, please contact me at surya@vikas.sh for dual licensing.
 
@@ -183,6 +240,7 @@ If you want to remove the GPL license requirements for inference or use the weig
 This work would not have been possible without amazing open source AI work:
 
 - [Segformer](https://arxiv.org/pdf/2105.15203.pdf) from NVIDIA
+- [Donut](https://github.com/clovaai/donut) from Naver
 - [transformers](https://github.com/huggingface/transformers) from huggingface
 - [CRAFT](https://github.com/clovaai/CRAFT-pytorch), a great scene text detection model
 
diff --git a/detect_text.py b/detect_text.py
@@ -15,7 +15,7 @@
 
 def main():
     parser = argparse.ArgumentParser(description="Detect bboxes in an input file or folder (PDFs or image).")
-    parser.add_argument("input_path", type=str, help="Path to pdf or image file to detect bboxes in.")
+    parser.add_argument("input_path", type=str, help="Path to pdf or image file or folder to detect bboxes in.")
     parser.add_argument("--results_dir", type=str, help="Path to JSON file with OCR results.", default=os.path.join(settings.RESULT_DIR, "surya"))
     parser.add_argument("--max", type=int, help="Maximum number of pages to process.", default=None)
     parser.add_argument("--images", action="store_true", help="Save images of detected bboxes.", default=False)
diff --git a/ocr_text.py b/ocr_text.py
@@ -2,42 +2,30 @@
 import json
 from collections import defaultdict
 
-from surya.input.load import load_from_folder, load_from_file
+from surya.input.langs import replace_lang_with_code, get_unique_langs
+from surya.input.load import load_from_folder, load_from_file, load_lang_file
 from surya.model.detection.segformer import load_model as load_detection_model, load_processor as load_detection_processor
 from surya.model.recognition.model import load_model as load_recognition_model
 from surya.model.recognition.processor import load_processor as load_recognition_processor
 from surya.model.recognition.tokenizer import _tokenize
 from surya.ocr import run_ocr
 from surya.postprocessing.text import draw_text_on_image
 from surya.settings import settings
-from surya.languages import LANGUAGE_TO_CODE, CODE_TO_LANGUAGE
 import os
 
 
 def main():
     parser = argparse.ArgumentParser(description="Detect bboxes in an input file or folder (PDFs or image).")
-    parser.add_argument("input_path", type=str, help="Path to pdf or image file to detect bboxes in.")
+    parser.add_argument("input_path", type=str, help="Path to pdf or image file or folder to detect bboxes in.")
     parser.add_argument("--results_dir", type=str, help="Path to JSON file with OCR results.", default=os.path.join(settings.RESULT_DIR, "surya"))
     parser.add_argument("--max", type=int, help="Maximum number of pages to process.", default=None)
     parser.add_argument("--start_page", type=int, help="Page to start processing at.", default=0)
     parser.add_argument("--images", action="store_true", help="Save images of detected bboxes.", default=False)
-    parser.add_argument("--lang", type=str, help="Language to use for OCR. Comma separate for multiple.", default="en")
+    parser.add_argument("--langs", type=str, help="Language(s) to use for OCR. Comma separate for multiple. Can be a capitalized language name, or a 2-letter ISO 639 code.", default=None)
+    parser.add_argument("--lang_file", type=str, help="Path to file with languages to use for OCR. Should be a JSON dict with file names as keys, and the value being a list of language codes/names.", default=None)
     args = parser.parse_args()
 
-    # Split and validate language codes
-    langs = args.lang.split(",")
-    for i in range(len(langs)):
-        if langs[i] in LANGUAGE_TO_CODE:
-            langs[i] = LANGUAGE_TO_CODE[langs[i]]
-        if langs[i] not in CODE_TO_LANGUAGE:
-            raise ValueError(f"Language code {langs[i]} not found.")
-
-    det_processor = load_detection_processor()
-    det_model = load_detection_model()
-
-    _, lang_tokens = _tokenize("", langs)
-    rec_model = load_recognition_model(langs=lang_tokens) # Prune model moes to only include languages we need
-    rec_processor = load_recognition_processor()
+    assert args.langs or args.lang_file, "Must provide either --lang or --lang_file"
 
     if os.path.isdir(args.input_path):
         images, names = load_from_folder(args.input_path, args.max, args.start_page)
@@ -46,10 +34,28 @@ def main():
         images, names = load_from_file(args.input_path, args.max, args.start_page)
         folder_name = os.path.basename(args.input_path).split(".")[0]
 
+    if args.lang_file:
+        # We got all of our language settings from a file
+        langs = load_lang_file(args.lang_file, names)
+        for lang in langs:
+            replace_lang_with_code(lang)
+        image_langs = langs
+    else:
+        # We got our language settings from the input
+        langs = args.langs.split(",")
+        replace_lang_with_code(langs)
+        image_langs = [langs] * len(images)
+
+    det_processor = load_detection_processor()
+    det_model = load_detection_model()
+
+    _, lang_tokens = _tokenize("", get_unique_langs(image_langs))
+    rec_model = load_recognition_model(langs=lang_tokens) # Prune model moe layer to only include languages we need
+    rec_processor = load_recognition_processor()
+
     result_path = os.path.join(args.results_dir, folder_name)
     os.makedirs(result_path, exist_ok=True)
 
-    image_langs = [langs] * len(images)
     predictions_by_image = run_ocr(images, image_langs, det_model, det_processor, rec_model, rec_processor)
 
     page_num = defaultdict(int)
diff --git a/pyproject.toml b/pyproject.toml
@@ -38,6 +38,7 @@ arabic-reshaper = "^3.0.0"
 
 [tool.poetry.scripts]
 surya_detect = "detect_text:main"
+surya_ocr = "ocr_text:main"
 
 [build-system]
 requires = ["poetry-core"]
diff --git a/surya/benchmark/tesseract.py b/surya/benchmark/tesseract.py
@@ -4,6 +4,7 @@
 from surya.settings import settings
 import os
 from concurrent.futures import ProcessPoolExecutor
+from surya.detection import get_batch_size
 
 
 def tesseract_bboxes(img):
@@ -23,7 +24,7 @@ def tesseract_bboxes(img):
 
 def tesseract_parallel(imgs):
     # Tesseract uses 4 threads per instance
-    tess_parallel_cores = min(len(imgs), settings.DETECTOR_BATCH_SIZE)
+    tess_parallel_cores = min(len(imgs), get_batch_size())
     cpus = os.cpu_count()
     tess_parallel_cores = min(tess_parallel_cores, cpus)
 
diff --git a/surya/input/langs.py b/surya/input/langs.py
@@ -0,0 +1,19 @@
+from typing import List
+from surya.languages import LANGUAGE_TO_CODE, CODE_TO_LANGUAGE
+
+
+def replace_lang_with_code(langs: List[str]):
+    for i in range(len(langs)):
+        if langs[i] in LANGUAGE_TO_CODE:
+            langs[i] = LANGUAGE_TO_CODE[langs[i]]
+        if langs[i] not in CODE_TO_LANGUAGE:
+            raise ValueError(f"Language code {langs[i]} not found.")
+
+
+def get_unique_langs(langs: List[List[str]]):
+    uniques = []
+    for lang_list in langs:
+        for lang in lang_list:
+            if lang not in uniques:
+                uniques.append(lang)
+    return uniques
diff --git a/surya/input/load.py b/surya/input/load.py
@@ -2,6 +2,7 @@
 import os
 import filetype
 from PIL import Image
+import json
 
 
 def get_name_from_path(path):
@@ -57,4 +58,10 @@ def load_from_folder(folder_path, max_pages=None, start_page=None):
             image, name = load_image(path)
             images.extend(image)
             names.extend(name)
-    return images, names
+    return images, names
+
+
+def load_lang_file(lang_path, names):
+    with open(lang_path, "r") as f:
+        lang_dict = json.load(f)
+    return [lang_dict[name].copy() for name in names]
diff --git a/surya/settings.py b/surya/settings.py
@@ -51,7 +51,7 @@ def TORCH_DEVICE_DETECTION(self) -> str:
     DETECTOR_NMS_THRESHOLD: float = 0.35 # Threshold for non-maximum suppression
 
     # Text recognition
-    RECOGNITION_MODEL_CHECKPOINT: str = "vikp/rec_test_utf16m"
+    RECOGNITION_MODEL_CHECKPOINT: str = "vikp/text_recognizer_test"
     RECOGNITION_MAX_TOKENS: int = 160
     RECOGNITION_BATCH_SIZE: Optional[int] = None # Defaults to 8 for CPU/MPS, 256 otherwise
     RECOGNITION_IMAGE_SIZE: Dict = {"height": 196, "width": 896}