-
Notifications
You must be signed in to change notification settings - Fork 776
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Draft] Add math benchmarks #1570
Open
hallerite
wants to merge
137
commits into
master
Choose a base branch
from
feat/benchmarks
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
137 commits
Select commit
Hold shift + click to select a range
0a6458b
feat(benchmark): initial GSM8K benchmark setup
hallerite 2421a17
feat(benchmark): initial MATH benchmark setup
hallerite 0e8a480
add new base class for math envs
hallerite 46acc02
pass mode explicitly
hallerite 34d7915
handle data loading and shuffling in base class
hallerite eace104
feat: add preprocessing step for dataset
hallerite bfccedf
fix: update gsm8k to new math_benchmark base
hallerite f1b7085
feat: add MATH benchmark
hallerite b10806a
fix: update __init__.py
hallerite c88bce1
fix: Deepseek r1 reasoning content (#1523)
Wendong-Fan 6beebb8
feat: Support OpenAI o3 mini (#1533)
Wendong-Fan 2c2ec6c
fix: WolframAlpha result from all subpods (#1534)
Wendong-Fan ae3203e
docs:data distillation pipline for distilling high-quality maths reas…
zjrwtx 3c695aa
chore: Qwen datagen cookbook add graph (#1530)
MuggleJinx 14bc56a
feat: Integrate moonshot models to camel (#1526)
GitHoobar 31e54cc
docs: update human cookbook using qwen model (#1555)
MuggleJinx f4b5452
docs: Add Self-Improving Math Reasoning Data Distillation (#1556)
Wendong-Fan 22b141e
feat: Integrate siliconflow model (#1541)
Asher-hss a239505
chore: Update supported models (#1557)
Wendong-Fan ee9b32b
chore: Support Gemini 2.0 flash thinking and pro (#1558)
Wendong-Fan 00f2db3
chore: Rename tool call instances (#1492)
JINO-ROHIT ba31cba
feat: Add SemanticScholarToolkits to integrate Semantic Scholar to ca…
renxinxing123 6faa9ee
enhance: Semantic Scholar (#1562)
Wendong-Fan 8714944
release: v0.2.20a0 (#1564)
Wendong-Fan 6b3c4f1
release: 0.2.20a1 (#1566)
Wendong-Fan 4a3e64b
feat: Waleed Sympy integration (#1445)
waleedalzarooni 47ed674
feat: implementing STaR: Self Taught Reasoner (#1478)
GitHoobar 636169e
chore: Update message setting (#1506)
Wendong-Fan 9d9c3be
chore: Issue&PR template and dependency update (#1572)
Wendong-Fan 311d5cb
docs(cookbooks): Fix Unused Custom Tool in Tools Cookbook (issue-1573…
coolbeevip 29747d9
docs(cookbooks): Fix the DeprecationWarning from Composio (issue-1575…
coolbeevip c49fbe3
feat: Internal deduplication impl. (#1568)
keli-wen 2c2064f
feat: Integrate AIML model platform (#1580)
Wendong-Fan 89ae633
release: v0.2.20 (#1581)
Wendong-Fan c1a6e27
refactor: Internal deduplication (#1579)
Wendong-Fan 5a0f6bb
ensured docstrings are raw strings
apokryphosx f734e33
fixed import errors
apokryphosx 79f0950
chore: update unstructured io version (#1586)
Wendong-Fan a99ec50
fix: Subprocess interpreter don't have output (#1590)
Wendong-Fan 117d920
feat: URL Handling in OpenAIEmbedding to Align with OpenAIModel (#1593)
coolbeevip 64eb90a
chore: Small enhancement and fix (#1599)
Wendong-Fan 5739286
feat: adding MinerU Extractor (#1529)
GitHoobar 0a82b63
feat(datagen): Enhance JSON Dump in SelfInstructPipeline for Non-ASCI…
coolbeevip d8526db
fix: Remove <think> content in model request (#1613)
Wendong-Fan b5aa56f
docs: Correct hyperlink for Hackathon Judge Committee (#1617)
1sarthakbhardwaj 7d1de80
docs(cookbooks): Fix incorrect URL in self_instruct_data_generation.i…
coolbeevip 4c49622
feat: custom prompt in graph agent (#1605)
NeilJohnson0930 1d2e6e0
feat: Hybrid Retrieval (#1398)
yiyiyi0817 32b42cc
feat: Waleed Networkx integration (#1456)
waleedalzarooni a533d37
Implemented Unit tests for math_bench and gsm8k
apokryphosx a34e15e
implemented examples for gsm8k and math_bench
apokryphosx 5b6e3af
Fixing code style to pass checks
apokryphosx 6ea335a
code style fixing
apokryphosx 1c82733
Added download tests
apokryphosx 3ae7211
fixed code style to pass tests
apokryphosx dc1a759
fixed code style to pass tests
apokryphosx a2f4b7b
fix: cleaning up messy branch
apokryphosx 212fb0e
fix: Clean up messy branch
apokryphosx 7905270
fix: Clean up messy branch
apokryphosx 429e103
fix: Clean up messy branch
apokryphosx 62b526d
fix: Clean up messy branch
apokryphosx 1a1e0d5
fix: Clean up messy branch
apokryphosx 18c205e
fix: Clean up messy branch
apokryphosx 33aff5f
fix: Clean up messy branch
apokryphosx 7f5211c
fix: Clean up messy branch
apokryphosx ced3595
fix: Clean up messy branch
apokryphosx ebb7096
fix: Clean up messy branch"
apokryphosx 43c1f79
fix: Clean up messy branch
apokryphosx 21fc10a
fix: Clean up messy branch
apokryphosx df67dfd
fix: Clean up messy branch
apokryphosx d66e09c
fix: Clean up messy branch
apokryphosx 7486ab9
fix: Clean up messy branch
apokryphosx 568879e
fix: Clean up messy branch
apokryphosx d2949fa
fix: Clean up messy branch
apokryphosx dee1c6b
fix: Clean up messy branch
apokryphosx 09c55f7
fix: Clean up messy branch
apokryphosx 935d158
fix: Clean up messy branch
apokryphosx 407ceab
fix: Clean up messy branch
apokryphosx 0264a9e
fix: Clean up messy branch
apokryphosx 99dbba2
fix: Clean up messy branch
apokryphosx 445fde1
fix: Clean up messy branch
apokryphosx 004ca9f
fix: Clean up messy branch
apokryphosx 56cbc67
fix: Clean up messy branch
apokryphosx fb34d40
fix: Clean up messy branch
apokryphosx 6cd1515
fix: Clean up messy branch
apokryphosx 74ba9f8
fix: Clean up messy branch
apokryphosx fdcff44
fix: Clean up messy branch
apokryphosx d25cb63
fix: Clean up messy branch
apokryphosx da40abc
fix: Clean up messy branch
apokryphosx 2e89a9e
fix: Clean up messy branch
apokryphosx 2f611f6
fix: Clean up messy branch
apokryphosx 4972158
fix: Clean up messy branch
apokryphosx 8c8ac4a
fix: Clean up messy branch
apokryphosx 0f6820b
fix: Clean up messy branch
apokryphosx b07701d
fix: Clean up messy branch
apokryphosx daf11b1
fix: Clean up messy branch
apokryphosx a418207
fix: Clean up messy branch
apokryphosx e8c8fb1
fix: Clean up messy branch
apokryphosx 3ebeb23
fix: Clean up messy branch
apokryphosx 02315e8
fix: Clean up messy branch
apokryphosx 2788ce1
fix: Clean up messy branch
apokryphosx 84c8bf6
fix: Clean up messy branch
apokryphosx e40b8cb
fix: Clean up messy branch
apokryphosx e3b100c
feat: Add unit tests for GSM8K and Math Benchmark
apokryphosx 2cade5c
feat: Add examples for GSM8K and Math Benchmark
apokryphosx 3d8cdc7
style: Fix docstrings, typing and line lengths
apokryphosx 964513b
Merge branch 'master' into feat/benchmarks
Wendong-Fan 094aed7
chore: add license
hallerite 8f3f063
chore: use camel logging
hallerite ee4d952
fix: use lazy imports
hallerite a412ee2
fix: add "valid" to input validation
hallerite 2f3865d
chore: polish docstring
hallerite 664dbff
fix: reset state of agent after every response
hallerite 07ee4ae
fix: vectorize evaluation
hallerite 4ace02d
fix: run pre-commit
hallerite e979df2
style: Fix line lenghts to adhere to style checks
apokryphosx 7baf965
style: Fix formating to pass through pre commit
apokryphosx f452a71
Merge branch 'feat/benchmarks' of https://github.com/camel-ai/camel i…
apokryphosx 9e2fd25
fix: restore base class
hallerite 7791817
fix: adjust lazy imports and annotate mutable class attributes with C…
hallerite 99e6be8
chore: fix formatting
hallerite e4960b0
fix: Fixed circular import errors
apokryphosx 089b171
fix: Fixed download function to not pass itself as
apokryphosx 876ea7d
fix: Added a pass@1k as default Mode at the start
apokryphosx c630c91
fix: Fixed load function to not pass itself as a
apokryphosx d581603
fix: Changed it so save_to directory gets made if
apokryphosx 7173270
chore: fix pre-commit errors
hallerite daf14e1
fix: change bool_ to bool return type
hallerite ed9afe1
chore: fix line-length
hallerite 9521041
fix: Utilized Huggingface Math-Verify to correctly
apokryphosx 6bc7ffe
style: Fixed code style to adhere to pre-commit
apokryphosx bfbdde1
fix: Removed debugging and ran experiment with API
apokryphosx 436b588
style: Improved code readability
apokryphosx 0fc67b7
fix: Improved readability, ran experiment with API
apokryphosx 36328e1
style: Ran pre-commit and fixed code style
apokryphosx 70bb28e
fix: Fixed load function to not
apokryphosx c720957
fix: Updated pyproject with huggingface math
apokryphosx File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,173 @@ | ||
# ========= Copyright 2023-2024 @ CAMEL-AI.org. All Rights Reserved. ========= | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
# ========= Copyright 2023-2024 @ CAMEL-AI.org. All Rights Reserved. ========= | ||
|
||
from typing import Any, Dict, List | ||
|
||
from camel.agents import ChatAgent | ||
from camel.benchmarks.math_benchmarks.math_base import MathBenchmark, Mode | ||
from camel.logger import get_logger | ||
|
||
logger = get_logger(__name__) | ||
|
||
|
||
class GSM8KBenchmark(MathBenchmark): | ||
r""" | ||
Benchmark for evaluating ChatAgents on the GSM8K dataset, a collection of | ||
grade-school-level math problems sourced from Hugging Face Hub. | ||
|
||
Attributes: | ||
DATASET_NAME (str): The name of the dataset. | ||
DATASET_REPO (str): The dataset's repository on Hugging Face. | ||
QUESTION_COLUMN (str): The column containing math problems. | ||
ANSWER_COLUMN (str): The column containing solutions. | ||
""" | ||
|
||
import pandas as pd | ||
from datasets import load_dataset | ||
|
||
DATASET_NAME = "gsm8k" | ||
DATASET_REPO = "openai/gsm8k" | ||
QUESTION_COLUMN = "question" | ||
ANSWER_COLUMN = "answer" | ||
|
||
def __init__(self, data_dir: str, save_to: str, processes: int = 1): | ||
r""" | ||
Initializes the GSM8K Benchmark instance. | ||
|
||
Args: | ||
data_dir (str): Directory for storing the dataset. | ||
save_to (str): Path for saving benchmark results. | ||
processes (int, optional): Number of parallel processes. | ||
Defaults to 1. | ||
""" | ||
super().__init__( | ||
name="GSM8K", | ||
data_dir=data_dir, | ||
save_to=save_to, | ||
processes=processes, | ||
) | ||
self._data: Dict[str, List[Dict[str, Any]]] = {} | ||
|
||
def download(self) -> "GSM8KBenchmark": | ||
r""" | ||
Ensures the GSM8K dataset is available locally. Uses Hugging Face | ||
Datasets for automatic caching and management. | ||
|
||
Returns: | ||
GSM8KBenchmark: The benchmark instance after downloading. | ||
""" | ||
logger.info("Ensuring GSM8K dataset is downloaded...") | ||
_ = GSM8KBenchmark.load_dataset( | ||
self.DATASET_REPO, 'main', cache_dir=str(self.data_dir) | ||
) | ||
|
||
logger.info("GSM8K dataset is ready.") | ||
return self | ||
|
||
def load(self, force_download: bool = False) -> "GSM8KBenchmark": | ||
r""" | ||
Loads the GSM8K dataset into memory, optionally forcing a re-download. | ||
|
||
Args: | ||
force_download (bool, optional): Whether to force re-downloading | ||
the dataset. Defaults to False. | ||
|
||
Returns: | ||
GSM8KBenchmark: The benchmark instance after loading. | ||
""" | ||
logger.info("Loading GSM8K dataset...") | ||
|
||
dataset = GSM8KBenchmark.load_dataset( | ||
self.DATASET_REPO, | ||
'main', | ||
cache_dir=str(self.data_dir), | ||
download_mode="force_redownload" | ||
if force_download | ||
else "reuse_dataset_if_exists", | ||
) | ||
|
||
self._data = { | ||
"train": dataset["train"].to_pandas().to_dict(orient="records"), | ||
"test": dataset["test"].to_pandas().to_dict(orient="records"), | ||
} | ||
return self | ||
|
||
@property | ||
def valid(self) -> List[Dict[str, Any]]: | ||
r""" | ||
Returns an empty list since GSM8K does not have a validation set. | ||
|
||
Returns: | ||
List[Dict[str, Any]]: An empty list. | ||
""" | ||
return [] | ||
|
||
def _prepare_dataset(self, dataset: List[Dict[str, Any]]) -> pd.DataFrame: | ||
r""" | ||
Prepares the dataset by extracting numeric solutions from the answer | ||
field. | ||
|
||
Args: | ||
dataset (List[Dict[str, Any]]): The dataset to process. | ||
|
||
Returns: | ||
pd.DataFrame: The processed dataset with extracted solutions. | ||
""" | ||
df = self.pd.DataFrame(dataset) | ||
df["solution"] = df["answer"].str.extract(r"####\s*(-?\d+)")[0] | ||
return df | ||
|
||
def _generate_solutions( | ||
self, agent: ChatAgent, dataset: pd.DataFrame, mode: Mode | ||
) -> pd.DataFrame: | ||
r""" | ||
Efficiently generates responses for each math problem using the | ||
ChatAgent, ensuring the agent resets between questions without | ||
unnecessary instantiations. | ||
|
||
Args: | ||
agent (ChatAgent): The agent responsible for generating answers. | ||
dataset (pd.DataFrame): The dataset containing math problems. | ||
mode (Mode): The evaluation mode for generating multiple responses. | ||
|
||
Returns: | ||
pd.DataFrame: The dataset with generated answers. | ||
""" | ||
|
||
def generate_answer(question: str) -> List[str]: | ||
r""" | ||
Generate `k` responses while resetting the agent after each | ||
question. | ||
""" | ||
agent.reset() # Ensuring statelessness | ||
return [ | ||
agent.step(question).msgs[0].content for _ in range(mode.k) | ||
] | ||
|
||
dataset["answers"] = dataset["question"].apply(generate_answer) | ||
return dataset | ||
|
||
def _preprocess_answers(self, raw_answers: pd.Series) -> pd.Series: | ||
r""" | ||
Extracts numeric answers from generated responses using a regular | ||
expression. | ||
|
||
Args: | ||
raw_answers (pd.Series): The series containing raw model-generated | ||
responses. | ||
|
||
Returns: | ||
pd.Series: Extracted numeric answers. | ||
""" | ||
return raw_answers.str.extract(r"####\s*(-?\d+)")[0] |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add check or fill missing values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are no missing values for this specific dataset