Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Draft] Add math benchmarks #1570

Open
wants to merge 137 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
137 commits
Select commit Hold shift + click to select a range
0a6458b
feat(benchmark): initial GSM8K benchmark setup
hallerite Jan 30, 2025
2421a17
feat(benchmark): initial MATH benchmark setup
hallerite Jan 30, 2025
0e8a480
add new base class for math envs
hallerite Feb 5, 2025
46acc02
pass mode explicitly
hallerite Feb 5, 2025
34d7915
handle data loading and shuffling in base class
hallerite Feb 5, 2025
eace104
feat: add preprocessing step for dataset
hallerite Feb 6, 2025
bfccedf
fix: update gsm8k to new math_benchmark base
hallerite Feb 6, 2025
f1b7085
feat: add MATH benchmark
hallerite Feb 6, 2025
b10806a
fix: update __init__.py
hallerite Feb 6, 2025
c88bce1
fix: Deepseek r1 reasoning content (#1523)
Wendong-Fan Jan 30, 2025
6beebb8
feat: Support OpenAI o3 mini (#1533)
Wendong-Fan Jan 31, 2025
2c2ec6c
fix: WolframAlpha result from all subpods (#1534)
Wendong-Fan Feb 2, 2025
ae3203e
docs:data distillation pipline for distilling high-quality maths reas…
zjrwtx Feb 3, 2025
3c695aa
chore: Qwen datagen cookbook add graph (#1530)
MuggleJinx Feb 4, 2025
14bc56a
feat: Integrate moonshot models to camel (#1526)
GitHoobar Feb 4, 2025
31e54cc
docs: update human cookbook using qwen model (#1555)
MuggleJinx Feb 5, 2025
f4b5452
docs: Add Self-Improving Math Reasoning Data Distillation (#1556)
Wendong-Fan Feb 5, 2025
22b141e
feat: Integrate siliconflow model (#1541)
Asher-hss Feb 5, 2025
a239505
chore: Update supported models (#1557)
Wendong-Fan Feb 5, 2025
ee9b32b
chore: Support Gemini 2.0 flash thinking and pro (#1558)
Wendong-Fan Feb 5, 2025
00f2db3
chore: Rename tool call instances (#1492)
JINO-ROHIT Feb 6, 2025
ba31cba
feat: Add SemanticScholarToolkits to integrate Semantic Scholar to ca…
renxinxing123 Feb 6, 2025
6faa9ee
enhance: Semantic Scholar (#1562)
Wendong-Fan Feb 6, 2025
8714944
release: v0.2.20a0 (#1564)
Wendong-Fan Feb 7, 2025
6b3c4f1
release: 0.2.20a1 (#1566)
Wendong-Fan Feb 7, 2025
4a3e64b
feat: Waleed Sympy integration (#1445)
waleedalzarooni Feb 7, 2025
47ed674
feat: implementing STaR: Self Taught Reasoner (#1478)
GitHoobar Feb 7, 2025
636169e
chore: Update message setting (#1506)
Wendong-Fan Feb 7, 2025
9d9c3be
chore: Issue&PR template and dependency update (#1572)
Wendong-Fan Feb 9, 2025
311d5cb
docs(cookbooks): Fix Unused Custom Tool in Tools Cookbook (issue-1573…
coolbeevip Feb 9, 2025
29747d9
docs(cookbooks): Fix the DeprecationWarning from Composio (issue-1575…
coolbeevip Feb 9, 2025
c49fbe3
feat: Internal deduplication impl. (#1568)
keli-wen Feb 9, 2025
2c2064f
feat: Integrate AIML model platform (#1580)
Wendong-Fan Feb 9, 2025
89ae633
release: v0.2.20 (#1581)
Wendong-Fan Feb 9, 2025
c1a6e27
refactor: Internal deduplication (#1579)
Wendong-Fan Feb 10, 2025
5a0f6bb
ensured docstrings are raw strings
apokryphosx Feb 11, 2025
f734e33
fixed import errors
apokryphosx Feb 11, 2025
79f0950
chore: update unstructured io version (#1586)
Wendong-Fan Feb 11, 2025
a99ec50
fix: Subprocess interpreter don't have output (#1590)
Wendong-Fan Feb 11, 2025
117d920
feat: URL Handling in OpenAIEmbedding to Align with OpenAIModel (#1593)
coolbeevip Feb 12, 2025
64eb90a
chore: Small enhancement and fix (#1599)
Wendong-Fan Feb 13, 2025
5739286
feat: adding MinerU Extractor (#1529)
GitHoobar Feb 14, 2025
0a82b63
feat(datagen): Enhance JSON Dump in SelfInstructPipeline for Non-ASCI…
coolbeevip Feb 15, 2025
d8526db
fix: Remove <think> content in model request (#1613)
Wendong-Fan Feb 15, 2025
b5aa56f
docs: Correct hyperlink for Hackathon Judge Committee (#1617)
1sarthakbhardwaj Feb 16, 2025
7d1de80
docs(cookbooks): Fix incorrect URL in self_instruct_data_generation.i…
coolbeevip Feb 16, 2025
4c49622
feat: custom prompt in graph agent (#1605)
NeilJohnson0930 Feb 16, 2025
1d2e6e0
feat: Hybrid Retrieval (#1398)
yiyiyi0817 Feb 16, 2025
32b42cc
feat: Waleed Networkx integration (#1456)
waleedalzarooni Feb 16, 2025
a533d37
Implemented Unit tests for math_bench and gsm8k
apokryphosx Feb 16, 2025
a34e15e
implemented examples for gsm8k and math_bench
apokryphosx Feb 16, 2025
5b6e3af
Fixing code style to pass checks
apokryphosx Feb 16, 2025
6ea335a
code style fixing
apokryphosx Feb 16, 2025
1c82733
Added download tests
apokryphosx Feb 16, 2025
3ae7211
fixed code style to pass tests
apokryphosx Feb 16, 2025
dc1a759
fixed code style to pass tests
apokryphosx Feb 16, 2025
a2f4b7b
fix: cleaning up messy branch
apokryphosx Feb 20, 2025
212fb0e
fix: Clean up messy branch
apokryphosx Feb 20, 2025
7905270
fix: Clean up messy branch
apokryphosx Feb 20, 2025
429e103
fix: Clean up messy branch
apokryphosx Feb 20, 2025
62b526d
fix: Clean up messy branch
apokryphosx Feb 20, 2025
1a1e0d5
fix: Clean up messy branch
apokryphosx Feb 20, 2025
18c205e
fix: Clean up messy branch
apokryphosx Feb 20, 2025
33aff5f
fix: Clean up messy branch
apokryphosx Feb 20, 2025
7f5211c
fix: Clean up messy branch
apokryphosx Feb 20, 2025
ced3595
fix: Clean up messy branch
apokryphosx Feb 20, 2025
ebb7096
fix: Clean up messy branch"
apokryphosx Feb 20, 2025
43c1f79
fix: Clean up messy branch
apokryphosx Feb 20, 2025
21fc10a
fix: Clean up messy branch
apokryphosx Feb 20, 2025
df67dfd
fix: Clean up messy branch
apokryphosx Feb 20, 2025
d66e09c
fix: Clean up messy branch
apokryphosx Feb 20, 2025
7486ab9
fix: Clean up messy branch
apokryphosx Feb 20, 2025
568879e
fix: Clean up messy branch
apokryphosx Feb 20, 2025
d2949fa
fix: Clean up messy branch
apokryphosx Feb 20, 2025
dee1c6b
fix: Clean up messy branch
apokryphosx Feb 20, 2025
09c55f7
fix: Clean up messy branch
apokryphosx Feb 20, 2025
935d158
fix: Clean up messy branch
apokryphosx Feb 20, 2025
407ceab
fix: Clean up messy branch
apokryphosx Feb 20, 2025
0264a9e
fix: Clean up messy branch
apokryphosx Feb 20, 2025
99dbba2
fix: Clean up messy branch
apokryphosx Feb 20, 2025
445fde1
fix: Clean up messy branch
apokryphosx Feb 20, 2025
004ca9f
fix: Clean up messy branch
apokryphosx Feb 20, 2025
56cbc67
fix: Clean up messy branch
apokryphosx Feb 20, 2025
fb34d40
fix: Clean up messy branch
apokryphosx Feb 20, 2025
6cd1515
fix: Clean up messy branch
apokryphosx Feb 20, 2025
74ba9f8
fix: Clean up messy branch
apokryphosx Feb 20, 2025
fdcff44
fix: Clean up messy branch
apokryphosx Feb 20, 2025
d25cb63
fix: Clean up messy branch
apokryphosx Feb 20, 2025
da40abc
fix: Clean up messy branch
apokryphosx Feb 20, 2025
2e89a9e
fix: Clean up messy branch
apokryphosx Feb 20, 2025
2f611f6
fix: Clean up messy branch
apokryphosx Feb 20, 2025
4972158
fix: Clean up messy branch
apokryphosx Feb 20, 2025
8c8ac4a
fix: Clean up messy branch
apokryphosx Feb 20, 2025
0f6820b
fix: Clean up messy branch
apokryphosx Feb 20, 2025
b07701d
fix: Clean up messy branch
apokryphosx Feb 20, 2025
daf11b1
fix: Clean up messy branch
apokryphosx Feb 20, 2025
a418207
fix: Clean up messy branch
apokryphosx Feb 20, 2025
e8c8fb1
fix: Clean up messy branch
apokryphosx Feb 20, 2025
3ebeb23
fix: Clean up messy branch
apokryphosx Feb 20, 2025
02315e8
fix: Clean up messy branch
apokryphosx Feb 20, 2025
2788ce1
fix: Clean up messy branch
apokryphosx Feb 20, 2025
84c8bf6
fix: Clean up messy branch
apokryphosx Feb 20, 2025
e40b8cb
fix: Clean up messy branch
apokryphosx Feb 20, 2025
e3b100c
feat: Add unit tests for GSM8K and Math Benchmark
apokryphosx Feb 20, 2025
2cade5c
feat: Add examples for GSM8K and Math Benchmark
apokryphosx Feb 20, 2025
3d8cdc7
style: Fix docstrings, typing and line lengths
apokryphosx Feb 20, 2025
964513b
Merge branch 'master' into feat/benchmarks
Wendong-Fan Feb 22, 2025
094aed7
chore: add license
hallerite Feb 23, 2025
8f3f063
chore: use camel logging
hallerite Feb 23, 2025
ee4d952
fix: use lazy imports
hallerite Feb 23, 2025
a412ee2
fix: add "valid" to input validation
hallerite Feb 23, 2025
2f3865d
chore: polish docstring
hallerite Feb 23, 2025
664dbff
fix: reset state of agent after every response
hallerite Feb 23, 2025
07ee4ae
fix: vectorize evaluation
hallerite Feb 23, 2025
4ace02d
fix: run pre-commit
hallerite Feb 23, 2025
e979df2
style: Fix line lenghts to adhere to style checks
apokryphosx Feb 24, 2025
7baf965
style: Fix formating to pass through pre commit
apokryphosx Feb 24, 2025
f452a71
Merge branch 'feat/benchmarks' of https://github.com/camel-ai/camel i…
apokryphosx Feb 24, 2025
9e2fd25
fix: restore base class
hallerite Feb 24, 2025
7791817
fix: adjust lazy imports and annotate mutable class attributes with C…
hallerite Feb 24, 2025
99e6be8
chore: fix formatting
hallerite Feb 24, 2025
e4960b0
fix: Fixed circular import errors
apokryphosx Mar 2, 2025
089b171
fix: Fixed download function to not pass itself as
apokryphosx Mar 2, 2025
876ea7d
fix: Added a pass@1k as default Mode at the start
apokryphosx Mar 2, 2025
c630c91
fix: Fixed load function to not pass itself as a
apokryphosx Mar 2, 2025
d581603
fix: Changed it so save_to directory gets made if
apokryphosx Mar 2, 2025
7173270
chore: fix pre-commit errors
hallerite Mar 2, 2025
daf14e1
fix: change bool_ to bool return type
hallerite Mar 2, 2025
ed9afe1
chore: fix line-length
hallerite Mar 2, 2025
9521041
fix: Utilized Huggingface Math-Verify to correctly
apokryphosx Mar 2, 2025
6bc7ffe
style: Fixed code style to adhere to pre-commit
apokryphosx Mar 2, 2025
bfbdde1
fix: Removed debugging and ran experiment with API
apokryphosx Mar 3, 2025
436b588
style: Improved code readability
apokryphosx Mar 3, 2025
0fc67b7
fix: Improved readability, ran experiment with API
apokryphosx Mar 3, 2025
36328e1
style: Ran pre-commit and fixed code style
apokryphosx Mar 3, 2025
70bb28e
fix: Fixed load function to not
apokryphosx Mar 3, 2025
c720957
fix: Updated pyproject with huggingface math
apokryphosx Mar 3, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions camel/benchmarks/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,18 @@
from .apibench import APIBenchBenchmark
from .base import BaseBenchmark
from .gaia import DefaultGAIARetriever, GAIABenchmark
from .math_benchmarks.gsm8k import GSM8KBenchmark
from .math_benchmarks.math_base import MathBenchmark, Mode
from .math_benchmarks.math_bench import MATHBenchmark
from .nexus import NexusBenchmark
from .ragbench import RAGBenchBenchmark

__all__ = [
"BaseBenchmark",
"MathBenchmark",
"Mode",
"MATHBenchmark",
"GSM8KBenchmark",
"GAIABenchmark",
"DefaultGAIARetriever",
"NexusBenchmark",
Expand Down
173 changes: 173 additions & 0 deletions camel/benchmarks/math_benchmarks/gsm8k.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
# ========= Copyright 2023-2024 @ CAMEL-AI.org. All Rights Reserved. =========
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ========= Copyright 2023-2024 @ CAMEL-AI.org. All Rights Reserved. =========

from typing import Any, Dict, List

from camel.agents import ChatAgent
from camel.benchmarks.math_benchmarks.math_base import MathBenchmark, Mode
from camel.logger import get_logger

logger = get_logger(__name__)


class GSM8KBenchmark(MathBenchmark):
r"""
Benchmark for evaluating ChatAgents on the GSM8K dataset, a collection of
grade-school-level math problems sourced from Hugging Face Hub.

Attributes:
DATASET_NAME (str): The name of the dataset.
DATASET_REPO (str): The dataset's repository on Hugging Face.
QUESTION_COLUMN (str): The column containing math problems.
ANSWER_COLUMN (str): The column containing solutions.
"""

import pandas as pd
from datasets import load_dataset

DATASET_NAME = "gsm8k"
DATASET_REPO = "openai/gsm8k"
QUESTION_COLUMN = "question"
ANSWER_COLUMN = "answer"

def __init__(self, data_dir: str, save_to: str, processes: int = 1):
r"""
Initializes the GSM8K Benchmark instance.

Args:
data_dir (str): Directory for storing the dataset.
save_to (str): Path for saving benchmark results.
processes (int, optional): Number of parallel processes.
Defaults to 1.
"""
super().__init__(
name="GSM8K",
data_dir=data_dir,
save_to=save_to,
processes=processes,
)
self._data: Dict[str, List[Dict[str, Any]]] = {}

def download(self) -> "GSM8KBenchmark":
r"""
Ensures the GSM8K dataset is available locally. Uses Hugging Face
Datasets for automatic caching and management.

Returns:
GSM8KBenchmark: The benchmark instance after downloading.
"""
logger.info("Ensuring GSM8K dataset is downloaded...")
_ = GSM8KBenchmark.load_dataset(
self.DATASET_REPO, 'main', cache_dir=str(self.data_dir)
)

logger.info("GSM8K dataset is ready.")
return self

def load(self, force_download: bool = False) -> "GSM8KBenchmark":
r"""
Loads the GSM8K dataset into memory, optionally forcing a re-download.

Args:
force_download (bool, optional): Whether to force re-downloading
the dataset. Defaults to False.

Returns:
GSM8KBenchmark: The benchmark instance after loading.
"""
logger.info("Loading GSM8K dataset...")

dataset = GSM8KBenchmark.load_dataset(
self.DATASET_REPO,
'main',
cache_dir=str(self.data_dir),
download_mode="force_redownload"
if force_download
else "reuse_dataset_if_exists",
)

self._data = {
"train": dataset["train"].to_pandas().to_dict(orient="records"),
"test": dataset["test"].to_pandas().to_dict(orient="records"),
}
return self

@property
def valid(self) -> List[Dict[str, Any]]:
r"""
Returns an empty list since GSM8K does not have a validation set.

Returns:
List[Dict[str, Any]]: An empty list.
"""
return []

def _prepare_dataset(self, dataset: List[Dict[str, Any]]) -> pd.DataFrame:
r"""
Prepares the dataset by extracting numeric solutions from the answer
field.

Args:
dataset (List[Dict[str, Any]]): The dataset to process.

Returns:
pd.DataFrame: The processed dataset with extracted solutions.
"""
df = self.pd.DataFrame(dataset)
df["solution"] = df["answer"].str.extract(r"####\s*(-?\d+)")[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add check or fill missing values?

df["solution"] = (
    df["answer"].str.extract(r"####\s*(-?\d+)", expand=False)
                 .fillna(value="NO_MATCH")  # or np.nan
)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are no missing values for this specific dataset

return df

def _generate_solutions(
self, agent: ChatAgent, dataset: pd.DataFrame, mode: Mode
) -> pd.DataFrame:
r"""
Efficiently generates responses for each math problem using the
ChatAgent, ensuring the agent resets between questions without
unnecessary instantiations.

Args:
agent (ChatAgent): The agent responsible for generating answers.
dataset (pd.DataFrame): The dataset containing math problems.
mode (Mode): The evaluation mode for generating multiple responses.

Returns:
pd.DataFrame: The dataset with generated answers.
"""

def generate_answer(question: str) -> List[str]:
r"""
Generate `k` responses while resetting the agent after each
question.
"""
agent.reset() # Ensuring statelessness
return [
agent.step(question).msgs[0].content for _ in range(mode.k)
]

dataset["answers"] = dataset["question"].apply(generate_answer)
return dataset

def _preprocess_answers(self, raw_answers: pd.Series) -> pd.Series:
r"""
Extracts numeric answers from generated responses using a regular
expression.

Args:
raw_answers (pd.Series): The series containing raw model-generated
responses.

Returns:
pd.Series: Extracted numeric answers.
"""
return raw_answers.str.extract(r"####\s*(-?\d+)")[0]
Loading
Loading