Skip to content

Latest commit

 

History

History
1231 lines (613 loc) · 79.3 KB

CHANGELOG.md

File metadata and controls

1231 lines (613 loc) · 79.3 KB

CHANGELOG

v0.4.0 (2025-02-22)

Chore

  • chore: making test less flaky (3effa18)

  • chore: fix updated torch types (4c46da6)

  • chore: fixing linting errors and adding precommit hook (85f6241)

Feature

  • feat: allow setting the artifacts path (2a4b4dc)

Fix

  • fix: gracefully handle slashes in model filename for autointerp (5d6464a)

  • fix: fix typing and updating mdl for saelens >=5.4.0 (802d1c3)

  • fix: load probe class with weights_only = False (f05bf40)

  • fix: Update README to include eval output schema update instructions (f0adee2)

  • fix: Update json schema jsons (2b2a6d3)

Unknown

  • Merge pull request #60 from chanind/deflaking-test

chore: making test less flaky (963f2e8)

  • Remove threshold from state dict if we aren't using it (d91a218)

  • Merge pull request #59 from chanind/artifacts-path-option

feat: allow setting the artifacts path (53901a2)

  • Merge pull request #58 from chanind/fixing-types

chore: fix updated torch types (849018f)

  • Merge pull request #57 from chanind/fix-slash-in-model-name-autointerp

fix: gracefully handle slashes in model filename for autointerp (11b2e38)

  • adding artifacts_path to unlearning eval (ce1de32)

  • By default we don't use a threshold for custom topk SAEs (60579ed)

  • Merge pull request #56 from chanind/type-fixes

fix: fix typing and updating mdl for saelens >=5.4.0 (0888d07)

  • Merge pull request #55 from chanind/precommit-check

chore: fixing linting errors and adding precommit hook (7ac7ced)

  • Fix SAE Bench SAEs repo names (18dc457)

  • Prevent potential division by zero (92315dd)

  • Add optional pinned dependencies (e74f0cf)

  • Calculate featurewise statistics in demo (5204b48)

  • Improve documentation on custom SAE usage (f15fe53)

  • Merge pull request #53 from adamkarvonen/hide_absorption_stddev

hide stddev from default display for absorption (155afbc)

  • hide stddev from default display for absorption (d970f05)

  • Merge pull request #52 from adamkarvonen/update_scr_tpp

update scr_tpp_schema to show top 20 by default (f551e7b)

  • update scr_tpp_schema to show top 20 by default (59320e2)

  • Merge pull request #51 from adamkarvonen/update_schema_jsons

fix: Update eval output schema jsons (7b2021c)

  • Add computational requirements (9b621a9)

  • Improve graphing notebook, include matryoshka results in graphs (f2d1d98)

  • Merge pull request #50 from chanind/lint-and-type-check

chore: Adding formatting, linting and type checking (a0fb5e9)

  • adding README and Makefile with helpers (7452eca)

  • fixing linting and type-checking issues (e663e3a)

  • formatting with ruff (14dad45)

  • Check that unlearning data exists before running unlearning eval (294b25c)

  • Improve export notebook (e2b0b3c)

  • Improve graphing utils (661920d)

  • Fix spelling (8c0df93)

  • Add standard deviation for absorption / autointerp, store results per class for sparse probing / tpp for potential error bars (141aff7)

  • Use GPU probing in correct location (ec5efa8)

v0.3.2 (2025-01-14)

Fix

  • fix: use GPU for llm probing (ba0956e)

Unknown

  • Don't hardcode the device for unlearning (a594ee6)

  • Update unlearning data path (443761d)

v0.3.1 (2025-01-14)

Fix

  • fix: pass device into core evals (e6651ea)

Unknown

  • fold W_dec norm when loading SAE Lens SAEs (511d51a)

  • Change default sparse probing k values (271a9d4)

v0.3.0 (2025-01-13)

Feature

  • feat: Add a frac alive calculation to core (0399550)

Unknown

  • added absorption fraction metric (#48)

feat: added absorption fraction metric

  • Small fixes

  • remove unused FeatureAbsorptionCalculator._filter_prompts function


Co-authored-by: Demian Till <demian.till@cambridgeconsultants.com> (7545ee3)

  • Add a script for organizing and uploading results (4689129)

  • Calculate featurewise statistics by default (bca84ca)

v0.2.0 (2025-01-09)

Feature

  • feat: add misc core metrics (2c731f6)

Unknown

  • Make sure grad is enabled for absorption tests (bd25ca0)

v0.1.0 (2025-01-09)

Feature

  • feat: EvalOutput and EvalConfig base classes to allow easy JSON schema export (537219a)

Fix

  • fix: eval_result_unstructured should be optional (38e81b0)

  • fix: dump to json file correctly (5f1cf15)

Unknown

  • git commit -m "fix: add missing init.py" (20b20f2)

  • Merge pull request #47 from chanind/packaging

feat: Setting up Python packaging and autodeploy with Semantic Release (e52a418)

  • Merge branch 'main' into packaging (9bc22a4)

  • Merge branch 'main' into packaging (bb10234)

  • Update SAE Bench demo to use new graphing functions (9bbfdc5)

  • switching to poetry and setting up CI (a9af271)

  • Add option to pass in arbitrary sae_class (e450661)

  • Mention dictionary_learning (c140e71)

  • Update graphing notebook to work with filenames (dc6f951)

  • deprecate graphing notebook (67118ee)

  • migrating to sae_bench base dir (bb8e145)

  • Use a smaller batch size for unlearning (3a099d2)

  • Reduce memory usage by only caching required activations (f026998)

  • Remove debugging check (8ea7162)

  • Add sanity checks before major run (0908b18)

  • Improve normalization check (16a3c0e)

  • Add normalization for batchtopk SAEs (6a031bd)

  • Add matroyshka loader (1078899)

  • Add pythia 160m (b219497)

  • simplify process of evaluating dictionary learning SAEs (c2dca52)

  • Add a script to run evals on dictionary learning SAEs (3f4139b)

  • Make the layer argument optional (e53675d)

  • Add batch_top_k, top_k, gated, and jump_relu implementations (9a7fce8)

  • Add a function to test the saes (864b4b3)

  • Update demo for new relu sae setup (5d04ce5)

  • Ensure loaded SAEs are on correct dtype and device (a5d6d62)

  • Create a base SAE class (8fcc9fe)

  • Add blog post link (2d47229)

  • cleanup README (0e724df)

  • Clean up graphing notebook (c08f3f5)

  • Graph results for all evals in demo notebook (29ac97b)

  • Clean up for release (1c9822c)

  • Include baseline pca in every graph. (a45afd2)

  • Clean up plot legends, support graphing subplots (7ade8b0)

  • Merge pull request #45 from adamkarvonen/update_jsonschemas

update jsonschemas (879c7ca)

  • update jsonschemas (a14d465)

  • Use notebook as default demo, mention in README (298796b)

  • Minor fixes to demo (05808c7)

  • Add missing batch size argument (877f2e7)

  • Fixes for changes to eval config formats (e0cb629)

  • Add an optional best of k graphing cell (081b59c)

  • Ignore any folder containing "eval_results" (12f8d66)

  • Add cell to add training tokens to config dictionaries (38173c9)

  • Also plot all sae bench checkpoints (93563e0)

  • Add eval links (2216f99)

  • rename core results to match convention (51e47fd)

  • Ignore autointerp with generations when downloading (aa20644)

  • Use != instead of > for L0 measurement (83504b7)

  • Add utility cell for removing llm generations (67c9b03)

  • Add utility cell for splitting up files by release name (3cc51ea)

  • Add force rerun option to core, match sae loading to other evals (8676d5d)

  • Improve plotting of results (89e5567)

  • Consolidate SAE loading and output locations (293b385)

  • Plot generator for SAE Bench (c2cb78e)

  • Add utility notebook for adding sae configs (8508a01)

  • Improve custom SAE usage (e959f65)

  • Improve graphing (490cd2a)

  • Fix failing tests (ed88f65)

  • match core output filename with others (8ca0787)

  • Remove del sae flag (feaf1f8)

  • Add current status to repo (9c95af7)

  • Add sae config to output file (b2fbd6d)

  • Add a flag for k sparse probing batch size (6f2e38f)

  • Merge pull request #44 from adamkarvonen/absorption-tweaks-2

improving memory usage of k-sparse probing (6ae8235)

  • Merge pull request #43 from adamkarvonen/fake_branch

single line update (7984d50)

  • single line update (d9637e1)

  • improving memory usage of k-sparse probing (841842a)

  • Add documentation to demo notebook (2e170e1)

  • adapted graphing to np result filestructure (3629b90)

  • Improve reduced memory script (ecb9f46)

  • Script for evaluating 1M width SAEs (63a6783)

  • Use expandable segments to reduce memory usage (4f3967d)

  • Delete SAE at the correct location in the for loop (ff0beda)

  • Shell script for running 65k width SAEs on 24 GB GPUs (9b0bd9d)

  • Delete sae at end of loop to lower memory usage. Primarily required for 1M width SAEs (08f9755)

  • Add absorption (b2e89c9)

  • Add note on usage (07cbf3c)

  • Add shell scripts for running all evals (a832e09)

  • add 9b-it unlearning precomputed artifacts (93502c0)

  • Add example of running all evals to notebook (473081d)

  • Clean up filename (a067c5c)

  • Create a demo of using custom SAEs on SAE bench (49d5ecd)

  • Move warnings to main function, raise error if not instruct tuned (e798adf)

  • perform calculations with a running sum to avoid underflow (d842a1f)

  • Do probe attribution calculation in original dtype for memory savings (366dc4c)

  • Use api key file instead of command line argument (bb48a6c)

  • Add flags to reduce VRAM usage (322334a)

  • fix unlearning test (5039e5e)

  • add optional flag to reduce peak memory usage (735f988)

  • Ignore core model name flag for now (43ef711)

  • Don't try set random seed if it's none (d1d6f72)

  • Make eval configs consistent, require model names in all eval arguments. (d37e77c)

  • Add ability to pass in random seed and sae / llm batch size (d8f026b)

  • Describe how values are set within eval configs (365fb40)

  • Always ignore the bio forget corpus (3e6d36f)

  • Use util function to convert str to dtype (7281627)

  • update graphing scripts (ff38240)

  • Merge pull request #39 from adamkarvonen/add_9b

add gemma-2-9b default DTYPE and BATCH_SIZE (164b6f5)

  • also add for 9b-it (b93f3c9)

  • add gemma-2-9b (8030c03)

  • Update regexes and test data to match new SAE Bench SAEs (6da4692)

  • Update outdated reference, don't get api_key if not required (da9a2dc)

  • Add ability to pass in flag for computing featurewise statistics, default it to false (f6430af)

  • Move str_to_dtype() to general utils (8ab32f9)

  • Pass in a string dtype (f49d41c)

  • Merge pull request #35 from adamkarvonen/add_pca

Add pca (f4fbd0c)

  • Delete old sae bench data (55f9b6f)

  • Mention disk space, fix repo name (04a2b01)

  • mention WMDP access (26816e5)

  • Be consistent when setting default dtype (3ed82b3)

  • Rename baselines to custom_saes (067bb79)

  • Rename shift to SCR (bbbdfdc)

  • correctly save and load state dict (35d64c8)

  • Just use the global PCA mean (2317de9)

  • Increase test tolerance, remove cli validation as other evals aren't using it (395095b)

  • Match core eval config to others with dtype usage (9197e3f)

  • Also check for b_mag so we don't ignore gated SAEs bias (edd0de2)

  • consolidate save locations of artifacts and eval results (c9a18b0)

  • revert eval_id change for now (2cfbcc0)

  • Save fake encoder bias as a tensor of zeros (4ccbec6)

  • Ensure that sae_name is unique (cce38b6)

  • Change default results location (6987235)

  • Also compare residual stream as a baseline (d528f3b)

  • Don't require sae to have a b_enc (c979427)

  • Include model name in tokens filename (a255df0)

  • Check if file exists (f8c9ab2)

  • Fix regex usage demo (1c4117d)

  • remove outdated import (9dddb3e)

  • Simplify custom SAE usage demonstration (542c659)

  • Benchmark autointerp scores on mlp neurons and saes (4ba6cba)

  • Simplify code by storing dtype as a string (963c9c2)

  • Add option to set dtype for core/main.py (d649826)

  • Pass in the optional flag to save activations (d0b8091)

  • Don't check for is SAE to enable use with custom SAEs (36cfba8)

  • mention new run all script (916df28)

  • Script for running evals on all custom SAEs at once (b91c210)

  • Rename formatting utils to general utils (dedec93)

  • Clean up duplicated functions (28d2f2f)

  • Clean up old graphing code (d3e8e87)

  • Fix memory leak (65fa76a)

  • Make test file names consistent (70e2eaa)

  • Remove unused flag (4ed9602)

  • Improve GPU PCA training (01e6306)

  • Fix namespace error and expected result format error (776e5f4)

  • Enable usage of core with custom SAEs (e359d1a)

  • Add a function to fit the PCA using GPU and CUML (2632849)

  • Switch from nested dict to list of tuples for selected_saes (dbdfe19)

  • Make it easier to train pca saes (eb41438)

  • Format with ruff (e62a436)

  • Test identity SAE implementation (e95e055)

  • Add a PCA baseline (645a040)

  • Move unlearning tokenization function to general utils file, consolidate tokenization functions (98c4b5c)

  • Merge pull request #34 from adamkarvonen/fix-core-eval-precision

fixing excessively low precision (e0ddf06)

  • fixing excessively low precision (d1eea66)

  • Merge pull request #33 from adamkarvonen/add_baselines

Add baselines (20c2a40)

  • Update README (e5b2ba4)

  • Fix regexes (478b41c)

  • Rename selection notebook (1b24a46)

  • Remove usage of SAE patterns list (80ed74d)

  • Make sure batch is on the correct dtype (f8a1158)

  • Adapt auto interp to enable use with custom saes (7ea0e59)

  • Adapt absorption to match existing format (7e2ac58)

  • Enable easy usage of evals with custom SAEs (64d2e23)

  • Use sae.encode() for compatibility instead of sae.run_with_cache() (42cc9ce)

  • fix device errors (1725899)

  • format with ruff (737b788)

  • Set autointerp context size in eval config (c4bfa82)

  • Add autointerp progress bars (bcc14a9)

  • Use baseline SAEs on the sparse probing eval (d3e5e07)

  • Merge pull request #32 from adamkarvonen/core_evals_ignore_special

Added option to exclude special tokens from SAE reconstruction (186bdb4)

  • Added option to exclude special tokens from SAE reconstruction (54a55f7)

  • Example jumprelu implementation (0ca103e)

  • identity sae baseline (5f65ace)

  • Merge pull request #31 from adamkarvonen/activation_consolidation

Activation consolidation (3ddcceb)

  • Add graphing for pythia and autointerp (1b4cf2a)

  • Correctly index into sae_acts (8c938ad)

  • Adapt format to Neuronpedia requirements (9baec7c)

  • Update README.md (de3ce5c)

  • Rename for consistency (632f54d)

  • Add end to end autointerp test (360dfe0)

  • Remove college biology from datasets as too close to wmdp_bio (dcdbbc5)

  • Print a warning if there aren't enough alive latents (23fc5a5)

  • Include dataset info in filename (d37fc04)

  • Add functions to encode precomputed activations (bd742ee)

  • Eliminate usage of activation store (fa58764)

  • Adapt autointerp to new format (1de25f7)

  • prepend bos token (46e00a4)

  • Mask off BOS, EOS, and pad tokens (4e2b0d6)

  • Collect the sparsity tensor for SAE autointerp (0dd3a91)

  • Format with ruff (2afc772)

  • Updated question ids running with one BOS token (df3b9d4)

  • Zero out SAE activations on BOS token (0d30360)

  • Only use one BOS token at beginning (ad97556)

  • Remove redundant with no_grad() (528959f)

  • Merge remote-tracking branch 'origin/main' into activation_consolidation (777e9d4)

  • Move the get_sparsity function to general utils folder, mask bos, pad, and eos tokens for unlearning (f679f0f)

  • Make it easier to use get_llm_activations() with other evals (1ed9a29)

  • Merge pull request #8 from callummcdougall/callum/autointerp

Autointerp eval (6f81495)

  • Merge branch 'main' into callum/autointerp (466a37d)

  • Improve graphing notebook for current output format (36fb3ba)

  • Apply nbstripout (ca27e41)

  • Notebook specifically for graphing and analyzing mdl results (114cefb)

  • Merge pull request #30 from adamkarvonen/mdl_fixes

Mdl fixes (65c3c98)

  • Add example data and add details to README. (21f3c83)

  • Use torch instead of t for consistency (903324f)

  • Move calculations to float32 to avoid dtype errors (099a94f)

  • Add descriptions to unlearning hyperparameters and descriptions of shift, tpp, and sparse probing evals. (6c141c1)

  • Merge pull request #28 from adamkarvonen/update_unlearning

Update unlearning output format (2bfd70b)

Update JSON schema filenames (b6ed053)

  • remove unused (810afe8)

  • updated schema file names (e4df309)

  • update name of output schema file (af58f0f)

  • Merge pull request #26 from adamkarvonen/core_tests

Update ui_default_display, titles for display (371b80d)

  • Update titles (4885187)

  • default display (8f811b8)

  • Merge pull request #25 from adamkarvonen/core_tests

added tests for core eval output (92bc76a)

  • added tests for core eval output (591ce3f)

  • Add end to end unlearning test (79435ab)

  • clean up activations always defaults to false (5d545d1)

  • Further general cleanup of mdl eval (51e9b60)

  • Merge pull request #24 from adamkarvonen/core_update

New Core output format, plus converter (b01be8a)

  • New Core output format, plus converter (192e92b)

  • Save sae results per sae (7045ad2)

  • Fix variable name bug (d5ec2d2)

  • MDL is running (c486454)

  • Format with ruff (04b14b7)

  • Merge pull request #6 from koayon/mdl-eval

Implement MDL eval (829de0c)

  • Merge branch 'main' into mdl-eval (ad18568)

  • Generate bfloat16 question_ids and commit them to the proper location, remove old ones (d48c68b)

  • Add example unlearning output (9866453)

  • Merge pull request #23 from adamkarvonen/unlearning_adapt

Unlearning adapt (3c48cdb)

  • Allow plotting of gemma SAEs (c813b39)

  • Adapt unlearning eval to new format (26d1675)

  • pass artifact folder in to unlearning functions (87508a6)

  • Add a sparsity penalty when training the SHIFT / TPP linear probes (cc73c6f)

  • Merge pull request #21 from adamkarvonen/shift_sparse_probing_descriptions

Shift sparse probing descriptions (3e9555a)

  • Remove unnecessary test keys, add note to README (12f324d)

  • Merge pull request #22 from adamkarvonen/fix/handle_gated_in_core

handle case where gated SAEs don't have b_enc (586597b)

  • Finish rename of the spurious_corr variable (b28aab6)

  • handle case where gated SAEs don't have b_enc (355aaf4)

  • update doc about how to update json schemas files. add json schema files. (95fda67)

  • Update from uncategorized to shift_metrics and tpp_metrics (43bf1f4)

  • Improve titles and descriptions for sparse probing (177be38)

  • Improve descriptions, titles, and variable names in SHIFT and TPP (29e1ecc)

  • Merge pull request #20 from adamkarvonen/make_unstructured_optional

fix: eval_result_unstructured should be optional (76d72a6)

  • Merge pull request #19 from adamkarvonen/core_eval_incremental_saving

Core eval incremental saving (7ddd55a)

  • added error handling and exponential backoff (92abbbe)

  • added code to produce intermediate json output between SAEs (b15a2e2)

  • fix device bug, resolve test utils conflicts (9b2e909)

  • Merge pull request #18 from adamkarvonen/set_sparse_probing_default_display

set k = 1, 2, 5 default display = true for sparse probing (db93af6)

  • set k = 1, 2, 5 default display = true for sparse probing (bf9b5ac)

  • Merge pull request #17 from adamkarvonen/add_unstructured_eval_output

Feature: Support unstructured eval output (3b17927)

  • Merge pull request #16 from adamkarvonen/basic-evals

Added core evals to repo (c55f48f)

  • Support unstructured eval output (adf028a)

  • Added core evals to repo (9b1dd45)

  • Merge pull request #15 from adamkarvonen/json_schema_absorption

Use Pydantic for eval configs and outputs for annotations and portability (e75c8b5)

  • update shift/tpp and sparse probing to evaloutput format (2dbb6f8)

  • Merge remote-tracking branch 'origin/main' into json_schema_absorption (eb8c660)

  • Add pytorch cuda flag due to OOM error message (c8e74f4)

  • confirm shift_and_tpp to new output format (153c713)

  • Merge remote-tracking branch 'origin/main' into json_schema_absorption (b337d5f)

  • test pre-commit hook (648046f)

  • produce the JSON schema files and add as a pre-commit hook (6683e17)

  • Add example regexes for gemma 2 2b (aea66aa)

  • Merge pull request #14 from adamkarvonen/shift_sparse_probing_updates

Shift sparse probing updates (14b5025)

  • Add example usage of gemma-scope and gemma SAEs (f2dcacf)

  • Improve arg parsing and probe file name (ec8cd87)

  • Mention other use for GPU probe training (929cdc0)

  • Add early stopping patience to reduce variance (5285563)

  • Add note on random seed being overwritten by argparse (d5a215a)

  • Separate save areas for tpp and shift (c860deb)

  • Also ignore artifacts and test results (1a81702)

  • Add shift and tpp to new format (d7e4b8b)

  • Improve assert error message if keys don't match (c514f49)

  • force_rerun now reruns even if a results file exists for a given sae (1ff0b16)

  • Make shift and tpp tests compatible with old results (ff2f46d)

  • Make sparse probing test backwards compatible with old results (782a080)

  • fix absorption test (d62b752)

  • Create a new graphing notebook for regex based selection (65ef605)

  • Improve artifacts and results storage locations, add a utility to select saes using multiple regex patterns (1cdfdb7)

  • No longer aggregate over saes in a dict (0e5bccb)

  • Rename old graphing file (df72d30)

  • fix ctx len bug, handle dead features better (63c2561)

  • don't commit artifact file (2c5691a)

  • Add openai and tabulate to requirements.txt (f626447)

  • Begin shift / tpp adaptation (ab1f062)

  • No longer average over multiple saes (aaf06eb)

  • Add an optional list of regexes (41b86a4)

  • By default remove the bos token (4877424)

  • Match new sae bench format (5d484e3)

  • Add note on output formats (56c637f)

  • Add notes on custom sae usage (30d4f16)

  • Add a utility function to plot multiple results at once (a89c86e)

  • Ignore images and results folders (e07e65f)

  • Merge branch 'main' into mdl-eval (922fb14)

  • Update mdl_eval (bdefc02)

  • Merge pull request #12 from jbloomAus/demo-format-and-command-changes-absorption

Demo of Changes to enable easy running of evals at scale (using absorption) (a9603b8)

  • Merge branch 'main' into demo-format-and-command-changes-absorption (9a9d4b1)

  • delete old template (0092a1f)

  • add re-usable testing utils for the config, cli and output format. (03aee86)

  • delete old template (8d66a49)

  • Merge pull request #13 from adamkarvonen/minor_shift_improvements

Minor shift improvements (1b318d5)

  • Notebook used to test different datasets (d4d4fb5)

  • update stategy for running absorption via CLI (13f90d0)

  • Comment out outdated tests (f1e2f9e)

  • Add runtime estimates to READMEs (32da7aa)

  • Rename to match other readmes (8dc952f)

  • Reduce the default amount of n values for faster runtime (74fd9a1)

  • Lower peak memory usage to fit on a 3090 (1fecf15)

  • Skip first 150 chars per Neurons in a Haystack (46d9510)

  • Merge pull request #11 from adamkarvonen/add_datasets

Add additional sparse probing datasets (0512456)

  • Share dataset creation code between tpp and sparse probing evals (30f60b6)

  • Update scr and tpp tests (8561e91)

  • Add an optional keys to compare list, to only compare those values (d60388b)

  • Add ag_news and europarl datasets (82de70f)

  • Add a single shift scr metric key (b65f969)

  • Use new sparse probing dataset names (7b32c83)

  • Use more probe epochs, update to use new dataset names (b49db06)

  • Add several new sparse probing datasets (b4f5400)

  • Add dataset functions for amazon sentiment and github code (aa1a478)

  • Use full huggingface dataset names (d2d4001)

  • Merge pull request #10 from curt-tigges/main

Initial RAVEL code (fc6a59b)

  • Merge branch 'main' into main (d132ec3)

  • Change default unlearning hyperparameters (d4c1949)

  • Do further analysis of unlearning hyperparameters (8f6262c)

  • Add multiple subsets of existing datasets (ae36e81)

  • Retry loading dataset due to intermittent errors (dc38fd0)

  • Use stop at layer for faster inference (e752a9c)

  • Merge pull request #9 from adamkarvonen/unlearning_cleanup

Unlearning cleanup (209e526)

  • fix topk error (e39a9ab)

  • add sae encode function (b247f91)

  • Get known question ids if they don't exist (f3516f4)

  • Remove unused functions (ac3da4d)

  • discard unused variable (aea5531)

  • Get results for all retain thresholds (08eec18)

  • add regex based sae selection strategy (57e9be0)

  • Updated notebook (ae40301)

  • Save unlearning score in final output (f90b114)

  • Add file to get correct answers for a model (0361c07)

  • Fix missing filenames (6aad3ca)

  • Move hyperparameters to eval config (5d2b9d0)

  • restructure results json, store probe results (9ec59a8)

  • Move llm and sae to llm_dtype (56e8e43)

  • Fix utils import (79321f8)

  • Apply ruff formatter (1423c33)

  • Make sure we don't commit the forget dataset (63bc153)

  • Apply nbstripout (2ff3e72)

  • Merge pull request #7 from yeutong/unlearning

implement unlearning eval (42de6df)

  • Merge branch 'main' into unlearning (b2f6d68)

  • add version control utils (b516958)

  • first commit (4b23575)

  • Add pytorch flag due to CUDA OOM message (c57eef7)

  • Move sae to llm dtype (c510d95)

  • Add a README and test for absorption (a8f4190)

  • Add example main function to absorption eval (2f1c551)

  • Move sae to llm dtype (26ed7a0)

  • Merge pull request #3 from chanind/absorption

Feature Absorption Eval (8d80be6)

  • Added initial demo notebook (86dfd95)

  • Added initial RAVEL files for dataset generation (96e963f)

  • renaming dict keys (ff81c53)

  • Merge remote-tracking branch 'upstream/main' (daab3e2)

  • add analysis (5eb7dfa)

  • success (b67c97b)

  • add gemma-2-2b-it (5f502d7)

  • revert changes to template.ipynb (aac04e0)

  • fix detail (e6e0985)

  • fixing batching error in absorption calculator (8a425cd)

  • Merge branch 'main' into absorption (822ad11)

  • Merge pull request #5 from koayon/rename-utils

Rename utils to avoid name conflict (eb6cc7b)

  • update notebook imports (6b4ca4a)

  • notebook reversion (1391c79)

  • indentation (111a68b)

  • Utils renaming (cbd6b99)

  • rename utils to avoid name conflict (e2a380a)

  • Scaffold mdl eval (orange) (a6b3406)

  • arrange structure (6392064)

  • replace model and sae loading (91ce2fd)

  • moved all code (7a714b4)

  • Merge pull request #4 from adamkarvonen/shift_eval

Shift eval (22a2a72)

  • Update README with determinism (aed96e8)

  • Fixed shift and tpp end to end tests (2c2947a)

  • Merge branch 'main' into absorption (5db6ecb)

  • reverting to original sparse probing main.py (8d6779f)

  • fixing dtypes (280f51a)

  • Add README for shift and tpp (bcf3934)

  • Add end to end test for shift and tpp (4ac13ef)

  • Move SAE to model dtype, add option to set column1_vals_list (6ae4065)

  • adding absorption calculation code (704eb00)

  • Initial working SHIFT / TPP evals (fab86d4)

  • Add SHIFT paired classes (900a04b)

  • Modify probe training for usage with SHIFT / TPP (039fd29)

  • Pin dataset name for sparse probing test (26d85ed)

  • Correct shape annotation (dc1f3d7)

  • adding in k-sparse probing experiment code (0ab6d2c)

  • Merge pull request #2 from adamkarvonen/sparse_probing_add_datasets

Sparse probing add datasets (19b4c4a)

  • Check for columns with missing second group (1895318)

  • Run sparse probing eval on multiple datasets and average results (6e158b3)

  • Add function to average results from multiple runs (47af366)

  • Remove html files (9c42b2f)

  • WIP: absorption (6348fb2)

  • Update READMEs (e6f1e3b)

  • Create end to end test for sparse probing repo (1eb63e9)

  • Rename file so it isn't ran as a test (127924a)

  • Fix main.py imports, set seeds inside function (9ec51f2)

  • Deprecate temporary fix for new sae_lens version (13e5da0)

  • Don't pin to specific versions (146c1fc)

  • Remove old requirements.txt (2ea8bac)

  • Merge pull request #1 from adamkarvonen/restructure

Restructure (38e2721)

  • added restructure (c3952bd)

  • restructure (f328f7f)

  • created branch (374561a)

  • Update to use new eval results folder (69657d8)

  • Make selected SAEs input explicit (fdc0afd)

  • mention SAE Lens tutorial (0121f32)

  • Add pythia example results (e0a14be)

  • Fix titles (b7203b0)

  • Demonstrate loading of Pythia and Gemma SAEs (d158304)

  • Add new examples Gemma results (cc1a27c)

  • Include checkpoints and not checkpoints if include_checkpoints is True (f5b1a71)

  • Temp fix for SAE Bench TopK SAEs (8b7a6ec)

  • Use sklearn by default, except for training on all SAE latents (2b1e2b6)

  • Test file for experimenting with probe training (4cd9cff)

  • Add gemma-scope eval_results data (5f2bf51)

  • Use a dict of sae_release: list[sae_names] for the sparse probing eval (70100d8)

  • Define llm dtype in activation_collection.py (0f29194)

  • training plot scaled by steps (9b49ec4)

  • bugfix sae_bench prefix (da9f95f)

  • Merge branch 'main' of https://github.com/adamkarvonen/SAE_Bench_Template into main (54a6156)

  • add plot over training steps (065825e)

  • Also calculate k-sparse probing for LLM activations (5e816b0)

  • Move default results into sparse_probing/ from sparse_probing/src/ (2283cc5)

  • By default, perform k-sparse probing using sklearn instead of hand-rolled implementation (57c4216)

  • Optional L1 penalty (078fb32)

  • Separate logic for selecting topk features using mean diff (8700aa7)

  • Set dtype based on model (baa004e)

  • Add type annotations so config parameters are saved (337c21d)

  • add virtualenv instructions (68edafe)

  • added interactive 3var plot and renamed files for clarity (5afafb7)

  • adapted requirements (aeadbb4)

  • Merge branch 'main' of https://github.com/adamkarvonen/SAE_Bench_Template into main (5ee11cf)

  • debugging correlation plots (cee970f)

  • moved formatting utils to external file (8fcc5da)

  • Update README.md (98d5f57)

  • Update README.md (2e2e9f8)

  • clarified README.md (205b2fb)

  • clarify README.md (540123b)

  • added explanation to template (9df0fcb)

  • Improve README (68927c9)

  • Updated pythia and gemma results (57ecbea)

  • Improve graphing notebook (66d0b66)

  • Apply nbstripout (d955146)

  • Walkthrough notebook of dictionary format (5246390)

  • Add to .gitignore (c0b1b84)

  • Utility notebook to compare multiple run results (27352d3)

  • Improve SAE naming, use Gemma by default (c03e540)

  • Add READMEs (b1ea1b3)

  • Make determinstic, improve sae key naming (6a86bdb)

  • Make sure to type check shapes (e67a8bc)

  • Archive development notebooks (7ecd360)

  • Fix the recording name of saes (08ceced)

  • Add missing batch indexing (aeda80e)

  • Refactor batch processing to handle all sae_batch_size scenarios efficiently (c15d61d)

  • Fix reduced precision warning (36c9a8b)

  • correctly save results (f5dfabb)

  • example results (8b1ec0c)

  • Data on existing SAEs (7761e2b)

  • Cleanup (0f631d7)

  • Dev notebook (1412347)

  • sparse probing eval (fc789e5)

  • Create bias in bios dataset (0ff4cf1)

  • Apply nbstripout (b993ecc)

  • Beginning of plotting notebook (141fbd2)

  • initial commit (319300f)