This project provides a web application to compare responses from multiple Large Language Models (LLMs) side by side, with automated analysis. It generates a comprehensive HTML report containing the models' answers to a given query, along with similarity metrics and visualizations.
This project uses two different licenses for different parts. Please see Licenses below.
- Self-Consistency Improves Chain Of Thought Reasoning in Language Models
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
- Qwen Technical Report
- Multi-Model Query: Ask a question once and get responses from multiple LLMs.
- Automated Metrics: Computes cosine similarity between response embeddings to find the most consistent answer, plus BLEU and ROUGE-L scores that use the most consistent answer from each model as the reference.
- Visualizations: Includes PCA plots of embeddings, similarity distribution histograms, heatmaps, and summary bar charts.
- Interactive Web UI: React frontend with real-time progress updates as models generate responses.
- Reports Archive: Each run produces an HTML report saved in a persistent volume (
reports/
directory) for later viewing.
- Frontend: React app (SPA) that connects via WebSocket to the backend for live updates. Users input the query and model names here.
- Backend: FastAPI server that orchestrates calls to the LLM engine (Ollama) and computes metrics. Exposes a WebSocket endpoint for progress and serves the final reports.
- LLM Engine: Ollama is used to run LLMs and embedding models locally (supports CPU, Apple MPS, and NVIDIA GPUs).
- Containerization: Docker Compose is used to containerize the frontend, backend, and Ollama services for easy deployment across different environments.
After cloning, your repository should have the following structure:
llm-comparative-analysis/
├── .gitignore # Ignore node_modules, build artifacts, etc.
├── README.md
├── docker-compose.yml
├── reports/ # Persisted HTML reports (empty initially)
├── backend/
│ ├── Dockerfile.backend
│ ├── app.py
│ ├── model_comparison.py
│ └── requirements.txt
├── docker/
│ ├── Dockerfile.ollama
│ └── entrypoint.sh
└── frontend/
├── Dockerfile.frontend
├── package.json
├── package-lock.json # Not committed, Generated by npm install
|── node_modules/ # Not committed, generated by npm install
├── public/
│ └── index.html
└── src/
├── App.js
└── index.js
Note: The node_modules/
folder and package-lock.json
is not committed (and is listed in .gitignore
).
- Docker (or Podman with podman-compose) installed on your system.
- Sufficient system resources to run the chosen LLMs. (For heavy models, a GPU is recommended: Apple M1/M2 or NVIDIA GPU with CUDA.)
- Ensure your system has internet access for downloading Docker images and models.
- (Optional) If using BLEU metrics, note that NLTK data may be downloaded on the first run.
Follow these steps after cloning the repository:
git clone https://github.com/isaacharlem/llm-comparative-analysis.git
cd llm-comparative-analysis
Navigate to the frontend directory and install the dependencies.
cd frontend
npm install
This will generate a consistent package-lock.json. (The node_modules/ folder is not committed.)
Return to the repository root and set executable permission on the entrypoint script:
cd ..
chmod +x docker/entrypoint.sh
From the repository root, run:
docker-compose up --build
This command will:
- Build the Frontend: Using Dockerfile.frontend to install Node.js dependencies, build the React app, and serve it via Nginx.
- Build the Backend: Using Dockerfile.backend to install Python dependencies and run the FastAPI app.
- Build the Ollama Service: Using docker/Dockerfile.ollama with the provided entrypoint.sh script, which pulls the required models (smollm:135m, deepseek-r1:1.5b, qwen:1.8b) and then starts the Ollama server.
Once the containers are running:
- Frontend: Open your browser at http://localhost:8080 (for production build via Nginx) or http://localhost:3000 if using the development server.
- Backend: The FastAPI backend (with WebSocket and API endpoints) runs on http://localhost:8000.
- Ollama Service: Runs on http://localhost:11434.
On the frontend page:
- Enter a Query: Type your question or prompt.
- List Models: Provide model identifiers as a comma-separated list (e.g., deepseek-r1:1.5b, qwen:1.8b). Ensure these models match those pulled by the Ollama service. Note: If you want to test Ollama models other than deepseek-r1:1.5b, qwen:1.8b, or phi3:mini, you must also update the pull commands in docker/entrypoint.sh.
- Set Number of Responses: Choose the number of responses each model should generate.
- Reference Answer (Optional): Enter a reference answer to compute BLEU and ROUGE-L scores.
- Click Generate Report. Real-time progress will be shown via WebSocket updates. Once completed, the final report (an HTML page) will be displayed in an embedded frame and saved in the reports/ directory.
To stop all services, press Ctrl+C in the terminal running Docker Compose or run:
docker-compose down
The generated reports will remain in the reports/ directory.
- Model Configuration: To change the models being pulled, update the entrypoint script in docker/entrypoint.sh. For example, modify the commands to pull different models.
- NLP Metrics & Visualizations: The backend (in backend/model_comparison.py) computes cosine similarity, BLEU, and ROUGE-L metrics, and generates various plots. You can extend these metrics by editing that file.
- GPU Support: For NVIDIA GPU support, ensure Docker is configured with the NVIDIA Container Toolkit. You can adjust GPU settings in docker-compose.yml if needed. For Apple M1/M2, Docker Desktop should automatically handle hardware acceleration.
- Default 404 on Root: The FastAPI backend is API-only and may return 404 for the root path. You can add a simple route in backend/app.py if desired.
- Ollama Service:
If the Ollama service fails to start, try re-running
chmod +x docker/entrypoint.sh
anddocker-compose up --build
.
model_comparison.py
is derived from code originally developed by Isaac Harlem at the UChicago Data Science Institute and is licensed under the BSD 3-Clause License (see LICENSE-BSD) (Copyright © 2024, UChicago Data Science Institute).
This statement does not imply any official endorsement by the UChicago Data Science Institute.
All other files in this repository are authored by Isaac Harlem and are licensed under the MIT License (see LICENSE-MIT).