Skip to content

Code, results and other artifacts from the paper introducing the WildChat-50m dataset and the Re-Wild model family.

License

Notifications You must be signed in to change notification settings

penfever/wildchat-50m

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WildChat-50m

This repository contains all code, results and other artifacts from the paper introducing the WildChat-50m dataset and the Re-Wild model family.

Links

Our Dataset

Our Models, Including Re-Wild

Our Judgment Datasets

Extended Evalchemy Results

Weights and Biases Logs

These will be made available with a later release.

Generating Model Responses

The dataset was produced using generate_model_responses.py. Although we focused on the WildChat-1M dataset, we believe that the code should generalize reasonably easily to other HuggingFace datasets which contain a column of conversations.

Viewing Sample Model Responses

You can find sample conversations, in custom HTML format, as well as judgments, in the model_responses directory.

Configs

The configs directory contains samples of the scripts used to launch our training runs, which were conducted using axolotl for SFT and open-instruct for DPO.

Notebooks

Our plotting notebook reproduces the plots associated with this paper. The conversation_processing notebook generates the custom-formatted HTML conversations for pairs of models side-by-side, which we utilize in the appendix of our paper. The mt_bench_jsonl_to_html generates the custom-formatted HTML conversations for single models with judgments. textual_similarity was used to compute the similarity scores between models.

Licensing

All code and data authored by us is released under the Apache 2.0 License. All data not authored by us is subject to its original license(s).

Citation

If you find our work useful, please consider citing us!

@misc{feuer2025wildchat50mdeepdiverole,
      title={WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training}, 
      author={Benjamin Feuer and Chinmay Hegde},
      year={2025},
      eprint={2501.18511},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2501.18511}, 
}

About

Code, results and other artifacts from the paper introducing the WildChat-50m dataset and the Re-Wild model family.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published