Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Draft] Add math benchmarks #1570

Open
wants to merge 137 commits into
base: master
Choose a base branch
from
Open

[Draft] Add math benchmarks #1570

wants to merge 137 commits into from

Conversation

hallerite
Copy link
Collaborator

@hallerite hallerite commented Feb 7, 2025

Description

This PR introduces a base class for math benchmarks and provides implementations for:

  • GSM8K benchmark
  • MATH benchmark

Motivation and Context

This PR addresses and closes #1510.

Types of Changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds core functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update (adds/modifies project documentation)
  • Example update (adds/modifies example code)

Implemented Tasks ✅

  • Implement math base benchmark
  • Implement GSM8K benchmark
  • Implement MATH benchmark
  • Add unit tests
  • Add example code on how to use benchmarks
  • Update documentation with usage details

Checklist 📝

Please go over all the following points and put an x in the boxes that apply.
If you're unsure about any, feel free to ask!

  • I have read the CONTRIBUTION GUIDE (required).
  • My changes require a documentation update.
  • I have updated tests accordingly. (required for a bug fix or a new feature)
  • I have updated the documentation accordingly.

Draft Status 🚧

Current Progress:

  • ✅ Core benchmark implementations are complete.
  • 🚧 Work in Progress: Tests, examples, and documentation updates.

Next Steps:

  • Implement unit tests to ensure benchmark reliability.
  • Provide example usage to guide users.
  • Finalize and update documentation.

Copy link
Collaborator

@zjrwtx zjrwtx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @hallerite ,great work! but the docstring need to be polished ,please refer to:
https://github.com/camel-ai/camel/blob/master/CONTRIBUTING.md#guideline-for-writing-docstrings



class GSM8KBenchmark(MathBenchmark):
"""Benchmark for evaluating ChatAgents on the GSM8K dataset from Hugging Face Hub."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Benchmark for evaluating ChatAgents on the GSM8K dataset from Hugging Face Hub."""
r"""Benchmark for evaluating ChatAgents on the GSM8K dataset from Hugging Face Hub."""

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a docstring optimize example

@zjrwtx
Copy link
Collaborator

zjrwtx commented Feb 8, 2025

can we add an example under the example file directory?

Wendong-Fan and others added 14 commits February 11, 2025 13:25
…oning data with thought process (Long Cot data)from deepseek R1 (#1532)

Co-authored-by: “yifeng.wang” <“3038880699@qq.com;q:wqqgit config --global user.name “yifeng.wang”git config --global user.email “3038880699@qq.com>
Co-authored-by: Wendong <w3ndong.fan@gmail.com>
Co-authored-by: Wendong-Fan <133094783+Wendong-Fan@users.noreply.github.com>
Co-authored-by: Wendong-Fan <133094783+Wendong-Fan@users.noreply.github.com>
Co-authored-by: Wendong <w3ndong.fan@gmail.com>
Co-authored-by: Wendong-Fan <133094783+Wendong-Fan@users.noreply.github.com>
Co-authored-by: Wendong <w3ndong.fan@gmail.com>
Co-authored-by: Wendong-Fan <133094783+Wendong-Fan@users.noreply.github.com>
Co-authored-by: Wendong <w3ndong.fan@gmail.com>
…mel (#1493)

Co-authored-by: 任信行 <renxinxing@renxinxingdeMacBook-Pro.local>
Co-authored-by: Harry Ye <116691547+harryeqs@users.noreply.github.com>
Co-authored-by: Wendong-Fan <133094783+Wendong-Fan@users.noreply.github.com>
hallerite and others added 29 commits February 23, 2025 12:35
parse and evaluate the Agents Output
pass itself as a directory
verify and added it to mypy overrides since it
doesn't have a typing package
@hallerite
Copy link
Collaborator Author

hallerite commented Mar 4, 2025

@apokryphosx added math-verify as dependency, but since it is very new, it cannot be resolved it seems. Any idea what we should do? Without it, the benchmarks are much less powerful.

cc: @Wendong-Fan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

[Feature Request] Math and code benchmark to evaluate trained model