Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add RAG for better PR feedback attempt2 #62

Merged
merged 7 commits into from
Dec 28, 2024
Merged

add RAG for better PR feedback attempt2 #62

merged 7 commits into from
Dec 28, 2024

Conversation

amiicao
Copy link
Collaborator

@amiicao amiicao commented Dec 3, 2024

closes #55

This PR improves the PR feedback functionality. Previously, only the PR title, description, and code diff is used for review.
Now, relevant function code from the codebase is considered in conjunction with the previous feed data using RAG and vectorDB.

Workflow:

  • all functions from the code base are stored into a vectorDB
  • a json file dependencies.json is generated to store relations between file names and the imported internal functions used. The data format is a Dict :
  • when a PR is created, the changed files will be used as a key in dependencies.json to extract all the function names the code is dependent on.
  • The extracted function names will be queried for in the vectorDB and thus we obtain all relevant function information (function name, file path, code) from 1 level above our modified code files
  • The information is passed to a LLM as code base context to augment the LLM prompt

Copy link

The code snippets you provided appear to be parts of a larger project that uses Flask for web development and interacts with GitHub APIs. Here's an analysis of the changes made:

1. New function in vector_db.py:

A new function called embed_code_base has been added, which embeds the main branch code into ChromaDB.

def embed_code_base(repo_id, root_dir):
    # Clone repo if needed and clone to specific branch (main)
    temp_dir = clone_repo_branch(installation_id, repo_full_name, "main")
    
    collection = get_collection_for_repo_branch(repo_id)
    embedding = model.encode(f"{func['source_code']}").tolist()
    add_to_chromaDB(collection, ids=[repo_id], embeddings=[embedding])

This function clones the repository to a specific branch (main) and then embeds its code into ChromaDB.

2. Changes in webhook_handler.py:

The handle_pull_requests function has been updated to include two new parameters: repo_id and changed_files. The function now:

  • Retrieves the repository ID from the GitHub API.
  • Clones the repository to a specific branch (main) if needed.
  • Embeds the main branch code into ChromaDB using the new embed_code_base function.
def handle_pull_requests(data, installation_id):
    # ...
    
    embed_code_base(
        repo_id, f"{ROOT_DIR}/src"
    )  # TODO: 1. add embeddings when a repository is added, ensure root dir is from the repo of rep_id and main branch code is embedded
    
    # ...

3. New function in vector_db.py:

A new function called add_issues_to_chroma has been updated to include an additional parameter: file_extensions.

def add_issues_to_chroma(collection, func):
    # ...
    
    if func["function_path"] is not None:
        func["function_path"] = _format_function_path(
            func["function_path"], file_extensions
        )
    # ...

This function now formats the function_path parameter using a new _format_function_path function, which takes an additional file_extensions parameter.

Overall, these changes appear to be part of a larger refactoring effort that aims to improve code organization and handling in the project. The updates focus on embedding repository code into ChromaDB, cloning repositories for specific branches, and formatting file paths correctly.

@amiicao amiicao changed the title add RAG for better PR feedback add RAG for better PR feedback attempt2 Dec 3, 2024
Copy link

This code snippet is part of a larger application that handles webhooks from GitHub. It specifically deals with pull requests and uses a vector database (likely ChromaDB) to store and retrieve information about the functions in the codebase.

Key Components:

  1. Vector Database Functions:

    • embed_code_base(repo_id, root_dir): This function is responsible for embedding the entire codebase into the vector database. It reads the source code of each function from the specified directory, creates an embedding using a pre-trained model, and adds it to the collection in the database.
    • query_by_function_names(function_paths, repo_id): This function retrieves the full code and metadata for a list of specific function paths from the vector database.
  2. Webhook Handling:

    • The handle_pull_requests function processes pull request events from GitHub webhooks. It checks if the action is "opened" or "edited", clones the repository (if necessary), embeds the codebase, and then handles the pull request by calling handle_new_pull_request.
  3. Embedding Code Base:

    • The embed_code_base function iterates through each file in the specified directory, reads the source code of functions within those files, creates embeddings for them, and stores these embeddings in the vector database.

Example Usage:

  • When a pull request is opened or edited, the webhook handler will clone the repository (if not already done), embed the entire codebase into the vector database, and then handle the pull request by processing the changes and possibly creating new issues.

Todos:

  • The code includes placeholders for actions that need to be implemented, such as adding embeddings when a repository is added and ensuring embeddings are updated after a pull request.
  • It also mentions cloning a repository branch temporarily, which could be optimized or removed if not needed.

This approach allows the application to efficiently store and retrieve information about the codebase, enabling features like code search, similarity detection, and more.

@dannyl1u dannyl1u merged commit 3e45602 into main Dec 28, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat/use RAG for PR feedback
2 participants