Skip to content

nb15/codeSCM-naacl25

Repository files navigation

CodeSCM: Causal Analysis for Multi-Modal Code Generation

This repository contains the official implementation of our paper: CodeSCM: Causal Analysis for Multi-Modal Code Generation, accepted at the NAACL 2025 main conference.

Authors: Mukur Gupta*, Noopur Bhatt*, and Suman Jana
(* denotes equal contribution)

Overview

Abstract:
In this paper, we propose CodeSCM, a Structural Causal Model (SCM) for analyzing multi-modal code generation using large language models (LLMs). By applying interventions to CodeSCM, we measure the causal effects of different prompt modalities, such as natural language, code, and input-output examples, on the model. CodeSCM introduces latent mediator variables to separate the code and natural language semantics of a multi-modal prompt. Using the principles of Causal Mediation Analysis on these mediators we quantify direct effects representing the model’s spurious leanings. We find that, in addition to natural language instructions, input-output examples significantly influence model generation, and total causal effects evaluations from CodeSCM also reveal the memorization of code generation benchmarks.


Main Figure

Below is the main conceptual figure illustrating CodeSCM and the different causal pathways it models. It highlights how our interventions capture direct effects, total effects, and spurious direct effects of different prompt modalities on final code generation outcomes.

CodeSCM Diagram


Repository Organization

codeSCM-naacl25-main/
├── README.md                   # This file
├── requirements.txt            # Python package dependencies
├── evaluate_deltas.py          # Script to evaluate differences between model outputs and references
├── export_deltas.py            # Utility to export code diffs and parse them for analysis
├── llm.py                      # Abstractions for interacting with various LLMs (OpenAI, etc.)
├── models.py                   # Configurations and setup for different LLMs or code generation models
├── adapters/
│   ├── general_adapter.py      # Generic prompt adapter for code generation tasks
│   ├── humaneval_adapter.py    # Adapter for HumanEval dataset prompts
│   ├── manual_prompts_comp.py  # Customizable manual prompts for specialized experiments
│   └── mbpp_adapter.py         # Adapter for MBPP dataset prompts
├── codereval/
│   ├── DEBUG_write_to_file.py           # Debug tool for logging intermediate code outputs
│   ├── convert_to_raw_input.py          # Converts code generation results to raw input format for processing
│   ├── extract_java_deltas_wcoder-1.py  # Extracts Java-specific deltas to measure code differences
│   ├── general_adapter.py               # Shared logic for code evaluation tasks
│   ├── run_coder_eval-java_llama.py     # Run code generation eval on Java tasks using LLaMA
│   ├── run_coder_eval-java_openai.ipynb # Jupyter notebook for Java tasks with OpenAI models
│   ├── run_coder_eval-python_llama.py   # Run code generation eval on Python tasks using LLaMA
│   └── run_coder_eval-python_openai.ipynb   # Jupyter notebook for Python tasks with OpenAI models
└── merge_deltas/
    ├── add_deltas_col.py        # Adds code delta columns to merged JSON results
    ├── clean_up_merged_jsons.py # Cleans and formats merged JSON results
    ├── merge_jsons_humaneval.py # Merges partial results for HumanEval dataset
    └── merge_jsons_mbpp.py      # Merges partial results for MBPP dataset

Key Components

  1. evaluate_deltas.py

    • Evaluates model outputs against ground truth or reference solutions.
    • Summarizes how often the generated code diverges, quantifying differences to assess model performance.
  2. export_deltas.py

    • Exports code diffs to a structured format (e.g., JSON) for subsequent analysis.
    • Useful for data-driven debugging of spurious model outputs.
  3. llm.py and models.py

    • Provide abstractions to configure and interact with LLMs.
    • Define model parameters, session handling, and inference APIs.
  4. adapters/

    • general_adapter.py: Creates general-purpose prompts.
    • humaneval_adapter.py: Adapts prompts for the HumanEval benchmark.
    • mbpp_adapter.py: Prepares MBPP (Mostly Basic Programming Problems) dataset prompts.
    • manual_prompts_comp.py: Manages custom prompt sets for fine-grained experiments.
  5. codereval/

    • Scripts and notebooks for running code generation and evaluation.
    • *_openai.ipynb notebooks demonstrate how to run OpenAI’s models on Python/Java tasks.
    • *_llama.py scripts show how to integrate with a LLaMA-based pipeline.
    • Intermediate utilities (like convert_to_raw_input.py) help unify output formats for subsequent analysis.
  6. merge_deltas/

    • Tools to merge partial results from code generation tasks.
    • Facilitates combining experiment outputs across multiple runs or different dataset splits for final aggregate analysis.

Installation

  1. Clone the repository

    git clone https://github.com/YourUsername/codeSCM-naacl25-main.git
    cd codeSCM-naacl25-main
  2. Install Dependencies
    We recommend using a virtual environment (e.g., venv or conda). Then run:

    pip install -r requirements.txt
  3. Set Up API Keys (If Needed)

    • For OpenAI-based experiments, set your OPENAI_API_KEY environment variable:
      export OPENAI_API_KEY="your_openai_api_key_here"
    • Update llm.py or any relevant config if you have specialized environment settings for closed-source models.

Reproducing Experiments

Below are generic steps to replicate the causal analysis experiments showcased in our paper. Adapt paths, model options, and dataset references to your local environment.

  1. Prepare Datasets

    • Download or place your code generation benchmark tasks (e.g., HumanEval, MBPP) into the appropriate location.
    • The adapters in adapters/ expect data to be in a standardized format (details in each adapter’s docstrings).
  2. Generate Model Outputs

    • Choose a script or notebook in codereval/.
    • For example, to run Python tasks on LLaMA:
      python codereval/run_coder_eval-python_llama.py \
             --model_path path/to/llama/weights \
             --dataset_path path/to/mbpp/data.json \
             --output_path results_llama_mbpp.json
    • Or open run_coder_eval-python_openai.ipynb if you prefer a notebook workflow with the OpenAI API.
  3. Merge and Clean Up Results

    • After each experiment, merge partial output JSON files into a combined result:
      python merge_deltas/merge_jsons_mbpp.py \
             --input_dir path/to/partial_results \
             --output_file merged_results_mbpp.json
    • Use clean_up_merged_jsons.py if needed to unify or filter final outputs.
  4. Evaluate Deltas

    • Once you have final JSON files with ground truth references and model outputs, run:
      python evaluate_deltas.py \
             --merged_json merged_results_mbpp.json \
             --metrics_file metrics_summary.csv
    • This will produce a summary of correctness, differences, or any custom metrics you define for code accuracy.
  5. Export Data for Analysis

    • Use export_deltas.py to parse diffs and format them for deeper analytics (spreadsheets, statistical tests, etc.).
    • The resulting CSV or JSON can be plotted or fed into additional scripts to measure causal effects as discussed in the paper.

CodeSCM Causal Mediation Workflows

In the paper, we introduce CodeSCM to measure how each prompt modality (e.g., code, natural language, input-output examples) influences code generation. The main steps to replicate the causal interventions are:

  1. Create Interventions

    • Modify prompts to selectively remove or alter a specific modality.
    • Use either the existing adapters with flags (e.g., --no_nl) or manually define them via manual_prompts_comp.py.
  2. Run Generation

    • Generate outputs for each intervention scenario (e.g., original prompt vs. prompt without input-output examples).
    • Keep track of seeds or random states to maintain comparability across runs.
  3. Compare Direct & Total Effects

    • Use the evaluate_deltas.py or a custom script to compare outputs.
    • Quantify how significantly each intervention changes the final code.
    • This aligns with the direct effect definitions introduced in the paper (spurious direct effect, etc.).
  4. Perform Statistical Analysis

    • For advanced mediation analysis, you may integrate the final output metrics into an external tool (R, Python stats libraries, etc.)
    • Evaluate significance and confidence intervals for direct, indirect, and total effects.

Contact

If you have any questions or run into difficulties, please email Noopur Bhatt at noopur.bhatt@columbia.edu.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published