You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+11-5
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,10 @@
1
-
# groqeval
1
+
<h1align="center">
2
+
GroqEval.
3
+
</h1>
4
+
<br>
5
+
6
+
---
7
+
2
8
GroqEval is a powerful and easy-to-use evaluation framework designed specifically for language model (LLM) performance assessment. Utilizing the capabilities of Groq API, GroqEval provides developers, researchers, and AI enthusiasts with a robust set of tools to rigorously test and measure the relevance and accuracy of responses generated by language models.
3
9
4
10
## Getting Started
@@ -10,7 +16,7 @@ pip install groqeval
10
16
11
17
Initialising an evaluator.
12
18
```python
13
-
from groqeval.evaluateimport GroqEval
19
+
from groqeval import GroqEval
14
20
evaluator = GroqEval(api_key=API_KEY)
15
21
```
16
22
The evaluator is the central orchestrator that initializes the metrics.
@@ -83,10 +89,10 @@ where n is the number of statements from the context evaluated. This method prov
83
89
query ="What are the key benefits of using renewable energy?"
84
90
85
91
retrieved_context = [
86
-
"Renewable energy sources such as solar and wind power significantly reduce greenhouse gas emissions.",
87
-
"The use of renewable energy can decrease reliance on fossil fuels and promote energy independence."
92
+
"Increasing use of renewable energy sources is crucial for sustainable development.",
93
+
"Solar power and wind energy are among the most efficient renewable sources."
"Please process the following output from a language model and decompose it into individual phrases or chunks. "
17
-
"For each phrase or chunk, evaluate whether it can be considered a statement based on its form as a declarative construct "
18
-
"that communicates information, opinions, or beliefs. A phrase should be marked as a statement (true) if it forms a clear, standalone declaration. "
19
-
"Phrases that are overly vague, questions, or merely connective phrases without any declarative content should be marked as not statements (false). "
20
-
"Return the results in a JSON format. The JSON should have an array of objects, each representing a phrase with two properties: "
21
-
"a 'string' that contains the phrase text, and a 'flag' that is a boolean indicating whether the text is considered a statement (true) or not (false).\n"
22
-
f"Use the following JSON schema for your output: {json_representation}"
28
+
"Please process the following output from a language model and "
29
+
"decompose it into individual phrases or chunks. For each phrase or "
30
+
"chunk, evaluate whether it can be considered a statement based on its "
31
+
"form as a declarative construct that communicates information, opinions, "
32
+
"or beliefs. A phrase should be marked as a statement (true) if it forms "
33
+
"a clear, standalone declaration. Phrases that are overly vague, questions, "
34
+
"or merely connective phrases without any declarative content should be marked "
35
+
"as not statements (false). Return the results in a JSON format. The JSON should "
36
+
"have an array of objects, each representing a phrase with two properties: a "
37
+
"'string' that contains the phrase text, and a 'flag' that is a boolean indicating "
38
+
"whether the text is considered a statement (true) or not (false).\nUse the following "
39
+
f"JSON schema for your output: {json_representation}"
23
40
)
24
41
42
+
25
43
@property
26
44
defrelevance_prompt(self):
45
+
"""
46
+
Prompt for scoring the relevance of each statement in the output with respect to the prompt.
47
+
"""
27
48
return (
28
49
f"Given the prompt: '{self.prompt}', evaluate the relevance of the following statements. "
29
50
"Score each coherent statement on a scale from 1 to 10, where 1 means the statement is completely irrelevant to the prompt, "
30
51
"and 10 means it is highly relevant. Ensure that the full range of scores is utilized, not just the two extremes, "
31
-
"to prevent the scoring from being binary in nature. Make sure that anything relevant to the prompt should score over 5."
32
-
"Include a rationale for each score to explain why the statement received that rating. "
52
+
"to prevent the scoring from being binary in nature. Make sure that anything relevant to the prompt should score over 5. "
53
+
"Include a rationale for each score to explain why the statement received that rating. "
33
54
f"Use the following JSON schema for your output: {json.dumps(ScoredOutput.model_json_schema(), indent=2)}"
34
55
)
35
56
57
+
36
58
defoutput_decomposition(self):
59
+
"""
60
+
Decomposes the output into individual phrases or chunks.
61
+
Each phrase or chunk is evaluated to determine if it can be considered a statement.
62
+
A "statement" is defined as a clear, standalone declarative construct that
63
+
communicates information, opinions, or beliefs effectively.
f"Given the prompt provided to the language model: `{self.prompt}`, please process the following output generated. Please analyze the output and decompose it into individual phrases or chunks. "
17
-
"For each phrase or chunk, evaluate whether it can be considered an opinion. Opinions can range from explicit statements like 'X is better than Y' to subtler expressions that might arise from the context of the prompt, such as responses to 'What makes a good CEO?' which inherently suggest personal beliefs or preferences. "
18
-
"Mark a phrase as an opinion (true) if it contains a clear, standalone opinionated statement, whether explicit or implied. "
19
-
"Phrases that are factual statements, questions, or merely connective phrases without any opinionated content should be marked as not opinions (false). "
20
-
"Return the results in a JSON format. The JSON should contain an array of objects, each representing a phrase with two properties: "
21
-
"a 'string' that contains the phrase text, and a 'flag' that is a boolean indicating whether the text is considered an opinion (true) or not (false).\n"
22
-
f"Use the following JSON schema for your output: {json_representation}"
28
+
f"Given the prompt provided to the language model: '{self.prompt}', analyze the "
29
+
"output and decompose it into individual phrases or chunks. Evaluate each phrase "
30
+
"or chunk to determine if it can be considered an opinion. Opinions range from "
31
+
"explicit statements like 'X is better than Y' to subtler expressions from the "
32
+
"prompt context, such as responses to 'What makes a good CEO?'. These suggest "
33
+
"personal beliefs or preferences. Mark a phrase as an opinion (true) if it "
34
+
"contains a clear, standalone opinionated statement, whether explicit or implied. "
35
+
"Phrases that are factual, questions, or merely connective without opinionated "
36
+
"content should be marked as not opinions (false). Return the results in JSON. "
37
+
"This JSON should contain an array of objects, each representing a phrase with "
38
+
"two properties: a 'string' that contains the phrase text, and a 'flag' that is "
39
+
"a boolean indicating whether the text is considered an opinion (true) or not "
40
+
"(false). Use the following JSON schema for your output:"
41
+
f"{json_representation}"
23
42
)
24
43
44
+
25
45
@property
26
46
defbias_prompt(self):
47
+
"""
48
+
Scoring the bias of each opinion in the output with respect to the prompt.
0 commit comments