DeepEval is an open-source evaluation framework for LLMs. It allows to "unit test" LLM outputs in a similar way to Pytest with 14+ LLM-evaluated metrics backed by research.
In this tutorial, you'll learn how to use DeepEval with Vertex AI.
By default, DeepEval uses Open AI but it can be configured to use any LLM as explained in
Using a custom LLM docs. To use a custom
LLM, you need to inherit DeepEvalBaseLLM
class and implement its methods such as get_model_name
, load_model
,
generate
with your LLM.
There's a Google VertexAI Example that shows how to implement DeepEval for Vertex AI. However, it unnecessarily uses LangChain and the implementation seems a little lacking.
Instead, I created 2 implementations of DeepEval for Vertex AI in vertex_ai folder:
- google_vertex_ai.py contains
GoogleVertexAI
class and it implements DeepEval with Vertex AI library. - google_vertex_ai_langchain.py contains
GoogleVertexAILangChain
class and it implements DeepEval via LangChain over Vertex AI library.
To use the GoogleVertexAI
class, you simply need to specify the model name, your project id, and location:
from vertex_ai.google_vertex_ai import GoogleVertexAI
model = GoogleVertexAI(model_name=your-model-name,
project=your-project-id,
location="us-central1")
Then, you can pass the model to the metric you want to use in your tests:
metric = AnswerRelevancyMetric(
model=model,
threshold=0.5)
The same applies to GoogleVertexAILangChain
class. Let's look at some test cases.
The answer relevancy metric measures the quality of your RAG pipeline's generator by evaluating how relevant the actual_output of your LLM application is compared to the provided input
To test answer relevancy with Vertex AI, take a look at test_answer_relevancy.py.
Run it:
deepeval test run test_answer_relevancy.py
You should get a nice report on the outcome:
Test Results
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃ Test case ┃ Metric ┃ Score ┃ Status ┃ Overall Success Rate ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩
│ test_answer_relevancy │ │ │ │ 100.0% │
│ │ Answer Relevancy │ 1.0 (threshold=0.5, evaluation model=gemini-1.5-flash-001, reason=The score is 1.00 because the response perfectly addresses the input, │ PASSED │ │
│ │ │ providing a clear and concise explanation for why the sky is blue! Keep up the great work!, error=None) │ │ │
│ │ │ │ │ │
└─────────────────────────────────┴──────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────┴──────────────────────┘
The summarization metric uses LLMs to determine whether your LLM (application) is generating factually correct summaries while including the necessary details from the original text
To test summarization with Vertex AI, take a look at test_summarization.py.
Run it:
deepeval test run test_summarization.py
Result:
Test Results
┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃ Test case ┃ Metric ┃ Score ┃ Status ┃ Overall Success Rate ┃
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩
│ test_summarization │ │ │ │ 100.0% │
│ │ Summarization │ 0.67 (threshold=0.5, evaluation model=gemini-1.5-flash-001, reason=The score is 0.67 because the summary fails to answer a question that the original text can │ PASSED │ │
│ │ │ answer. It is not clear from the summary whether the coverage score is calculated based on the percentage of 'yes' answers, but this information is present in │ │ │
│ │ │ the original text., error=None) │ │ │
└────────────────────┴───────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────┴──────────────────────┘
The hallucination metric determines whether your LLM generates
factually correct information by comparing the actual_output
to the provided context.
To test hallucination with Vertex AI, take a look at test_hallucination.py.
Run it:
deepeval test run test_hallucination.py
Result:
Test Results
┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃ Test case ┃ Metric ┃ Score ┃ Status ┃ Overall Success Rate ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩
│ test_hallucination │ │ │ │ 100.0% │
│ │ Hallucination │ 0.0 (threshold=0.5, evaluation model=gemini-1.5-flash-001, reason=The score is 0.00 because the output aligns with the provided context, indicating no │ PASSED │ │
│ │ │ hallucinations., error=None) │ │ │
│ │ │ │ │ │
│ test_hallucination_fails │ │ │ │ 0.0% │
│ │ Hallucination │ 1.0 (threshold=0.5, evaluation model=gemini-1.5-flash-001, reason=The score is 1.00 because the output contradicts a key fact from the context, stating │ FAILED │ │
│ │ │ that London is the capital of France, which is incorrect. This indicates a significant hallucination., error=None) │ │ │
└──────────────────────────┴───────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────┴──────────────────────┘
DeepEval have more metrics you can check out on their Metrics page.