Applying the ideas of Deepseek R1 and Open R1 to computer use.
r1-computer-use is an experimental project that applies large-scale Reinforcement Learning techniques similar to DeepSeek-R1 to computer usage scenarios. The primary goal is to train an agent to interact with a computer environment (e.g., file system, web browser, command line) while utilizing a neural reward model to validate the correctness of the agent’s actions and reason about intermediate steps.
DeepSeek-R1 has shown that large language models can develop powerful reasoning skills through iterative reward optimization. Traditionally, such projects rely on hard verifiers or rule-based scripts to determine correctness in tasks like math or coding. However, these methods are too difficult to reproduce at scale for general computer usage.
We aim to replace hard-coded verifiers with a neural reward model that itself reasons about whether or not the agent’s actions are correct or helpful.
Both the actor and reward models follow a three-step cycle which can be seen as an extention of ReACT into reinforcement learning.
observation = "Current directory contains: setup.py requirements.txt"
reasoning = """
1. Project appears to be a Python package
2. No virtual environment detected
3. Should create venv before proceeding
"""
action = "python -m venv .venv"
analysis = """
1. Correctly identified project type
2. Appropriate prerequisite check
3. Standard venv location chosen
"""
reward = 0.85
from r1_computer_use import Agent, RewardModel
agent = Agent()
reward_model = RewardModel()
result = agent.run(
task="Set up Python development environment",
observe_reasoning=True
)
feedback = reward_model.evaluate(
actions=result.actions,
reasoning=result.reasoning
)
The training pipeline consists of multiple stages:
-
Cold Start
- Expert demonstrations with reasoning traces
- Initial reward model training
- Base model fine-tuning
-
Reasoning-Focused GRPO
- Group-based sampling from current policy
- Reward model evaluates each group
- Compute advantages within groups
- Policy updates with clipped probability ratios
- KL divergence constraint with reference policy
-
Rejection Sampling Stage
- Filter top-k solutions based on reward model
- Create new training dataset from best examples
- Fine-tune base model on filtered data
-
General Preference Alignment
- Apply RL to full task distribution
- Use reward models for general preferences
- Focus on helpfulness and safety
- Evaluate complete responses
-
Evaluation
- Task completion metrics
- Reasoning quality assessment
- Safety verification
- Distribution shift analysis
- Collect cold startand neural reward model data (in progress)
- SFT train base model
- GRPO RL training
- Rejection sampling
- General preference alignment
- Evaluation
Current areas of investigation:
- Reward model architectures
- Base model evaluations
MIT
@software{r1_computer_use,
title = {R1-Computer-Use: Reasoning-First Computer Interaction},
author = {Barker, Patrick},
year = {2025},
url = {https://github.com/agentsea/r1-computer-use},
}