Skip to content

agentsea/r1-computer-use

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

R1 Computer Use

Applying the ideas of Deepseek R1 and Open R1 to computer use.

Overview

r1-computer-use is an experimental project that applies large-scale Reinforcement Learning techniques similar to DeepSeek-R1 to computer usage scenarios. The primary goal is to train an agent to interact with a computer environment (e.g., file system, web browser, command line) while utilizing a neural reward model to validate the correctness of the agent’s actions and reason about intermediate steps.

Architecture

DeepSeek-R1 has shown that large language models can develop powerful reasoning skills through iterative reward optimization. Traditionally, such projects rely on hard verifiers or rule-based scripts to determine correctness in tasks like math or coding. However, these methods are too difficult to reproduce at scale for general computer usage.

We aim to replace hard-coded verifiers with a neural reward model that itself reasons about whether or not the agent’s actions are correct or helpful.

Both the actor and reward models follow a three-step cycle which can be seen as an extention of ReACT into reinforcement learning.

diagram

Agent

observation = "Current directory contains: setup.py requirements.txt"
reasoning = """
1. Project appears to be a Python package
2. No virtual environment detected
3. Should create venv before proceeding
"""
action = "python -m venv .venv"

Reward Model

analysis = """
1. Correctly identified project type
2. Appropriate prerequisite check
3. Standard venv location chosen
"""
reward = 0.85

Usage (in progress)

from r1_computer_use import Agent, RewardModel

agent = Agent()
reward_model = RewardModel()

result = agent.run(
    task="Set up Python development environment",
    observe_reasoning=True
)

feedback = reward_model.evaluate(
    actions=result.actions,
    reasoning=result.reasoning
)

Training Pipeline

The training pipeline consists of multiple stages:

  1. Cold Start

    • Expert demonstrations with reasoning traces
    • Initial reward model training
    • Base model fine-tuning
  2. Reasoning-Focused GRPO

    • Group-based sampling from current policy
    • Reward model evaluates each group
    • Compute advantages within groups
    • Policy updates with clipped probability ratios
    • KL divergence constraint with reference policy
  3. Rejection Sampling Stage

    • Filter top-k solutions based on reward model
    • Create new training dataset from best examples
    • Fine-tune base model on filtered data
  4. General Preference Alignment

    • Apply RL to full task distribution
    • Use reward models for general preferences
    • Focus on helpfulness and safety
    • Evaluate complete responses
  5. Evaluation

    • Task completion metrics
    • Reasoning quality assessment
    • Safety verification
    • Distribution shift analysis

Roadmap

  • Collect cold startand neural reward model data (in progress)
  • SFT train base model
  • GRPO RL training
  • Rejection sampling
  • General preference alignment
  • Evaluation

Research

Current areas of investigation:

  • Reward model architectures
  • Base model evaluations

License

MIT

Citation

@software{r1_computer_use,
  title     = {R1-Computer-Use: Reasoning-First Computer Interaction},
  author    = {Barker, Patrick},
  year      = {2025},
  url       = {https://github.com/agentsea/r1-computer-use},
}

About

Applying the ideas of Deepseek R1 to computer use

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages