Skip to content

Use any vision LLMs to perform OCR using LangChain

License

Notifications You must be signed in to change notification settings

a-klos/langchain-ocr

Repository files navigation

LangChain-OCR

LangChain-OCR is an advanced OCR solution that converts PDFs and image files into Markdown using cutting-edge vision LLMs. The project comprises two main components: the OCR library (usable via CLI) and a FastAPI backend that offers a streamlined interface for file uploads and processing.

OCR Logo

Table of Contents

  1. Overview
  2. Features
  3. Installation
    1. Prerequisites
    2. Cloning & Environment Setup
  4. Usage
    1. CLI
    2. FastAPI Server
    3. Docker Compose Deployment
  5. Contributing
  6. License
  7. Contact

1. Overview

LangChain-OCR leverages vision LLMs to deliver high-quality OCR conversion from PDFs and images (JPEG, PNG) into Markdown. With support for both a direct CLI and an asynchronous FastAPI interface, it serves as a versatile tool for developers and end-users.

2. Features

  • File Conversion: Convert PDFs and images (JPEG, PNG) to Markdown.
  • Extensible Design: Easily customize converters, language models, and dependency injections with Inject.
  • Modern API: Asynchronous processing built on FastAPI.
  • Observability: Integrated tracing via Langfuse.
  • Multilingual Support: Configurable language settings.
  • LLM Integration: Supports Ollama, vLLM and OpenAI with potential for other providers.
  • Containerization: Ready-to-use Docker and Docker Compose configurations.
  • CLI Access: Quick OCR processing through the command line.

3. Installation

3.1 Prerequisites

  • Python: 3.11 or higher (refer to api/.python-version)
  • Dependency Manager: Poetry
  • Docker & Docker Compose: For containerized deployment

3.2 Cloning & Environment Setup

Clone the repository and configure your environment:

git clone https://github.com/a-klos/langchain-ocr.git
cd langchain-ocr
cp .env.template .env

Edit the .env file as necessary to adjust language settings, model configuration, and endpoints.

4. Usage

LangChain-OCR can be employed in different ways:

4.1 CLI

For quick OCR tasks via the command line, see the CLI documentation.

4.2 FastAPI Server

Launch the FastAPI backend to access OCR functionality through a RESTful API. Detailed instructions are provided in the FastAPI README.

4.3 Docker Compose Deployment

Deploy the entire stack with Docker Compose:

  1. Install Docker Compose:
    Follow the installation guide.

  2. Build & Run Containers:
    In the repository root, execute:

    docker compose up --build
  3. Pull a Vision-Capable Model:
    Ensure your model configuration matches by pulling the model (e.g., gemma3:4b-it-q4_K_M):

    ollama pull <<model_name>>
  4. Access the Services:

  5. Stop Containers:
    When done, clean up with:

    docker compose down

5. Contributing

Contributions, bug reports, and feature suggestions are welcome. See CONTRIBUTING.md for details on how to get involved.

6. License

Licensed under the MIT License. Refer to the LICENSE file for more information.

7. Contact

For questions, issues, or suggestions, please open an issue on GitHub or contact the maintainer at aklos.ocr@gmail.com.

About

Use any vision LLMs to perform OCR using LangChain

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •