uv sync
then open nbs/01_abliterate.ipynb in Jupyter Notebook.
TODO:
- tidy up
- use https://github.com/wassname/activation_store to cache large numbers of examples to disc
I use
-
performance: wikitext perplexity
-
compliance: https://huggingface.co/datasets/wassname/genies_preferences
-
FailSpy's abliterator: https://github.com/FailSpy/abliterator
- uses datasets
Undi95/orthogonal-activation-steering-TOXIC
vstatsu-lab/alpaca
- uses datasets
-
a more advanced method, instead of removing all diferences, it removes the ones that are predictive, https://github.com/EleutherAI/concept-erasure https://github.com/EleutherAI/conceptual-constraints/blob/main/notebooks/concept_erasure.ipynb
- which uses the HANS dataset:
-
https://github.com/Sumandora/remove-refusals-with-transformers/
- uses
advbench/harmful_behaviors
vsalpaca_cleaned
- uses