This is a collection of resources for computer-use agents, including videos, blogs, papers, and projects. The repository is currently under construction and will be continuously updated. We welcome contributions and feedback as we continue expanding this collection!
- Claude | Computer use for automating operations
- Claude | Computer use for coding
- Claude | Computer use for orchestrating tasks
- Bill Gates | AI is about to completely change how you use computers
- Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku
- Simon Willison | Initial explorations of Anthropic’s new Computer Use capability
- Ethan Mollick | When you give a Claude a mouse
- Nathan Lambert | Claude's agentic future and the current state of the frontier models
- Computer Use by Anthropic: A 5-Minute Setup Guide and Demo
- Automating macOS using Claude Computer Use
- Mind-Blowing Experience with Claude Computer Use
- Instant Claude Computer Use Demo
- Notes on Anthropic’s Computer Use Ability
- Anthropic Computer Use: Automate Your Desktop With Claude 3.5
Many papers here are organized and identified using a visualization tool developed by ranpox/openreview-visualization.
- AI Agents for Computer Use: A Review of Instruction-based Computer Control, GUI Automation, and Operator Assistants
- OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use
- UI-TARS: Pioneering Automated GUI Interaction with Native Agents
- Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
- PC Agent: While You Sleep, AI Works - A Cognitive Journey into Digital World
- Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
- Agent Workflow Memory
- Guiding VLM Agents with Process Rewards at Inference Time for GUI Navigation
- SpiritSight Agent: Advanced GUI Agent with One Look
- Agent S: An Open Agentic Framework that Uses Computers Like a Human
- Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
- OSCAR: Operating System Control via State-Aware Reasoning and Re-Planning
- AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant
- Cradle: Empowering Foundation Agents towards General Computer Control
- Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation
- Simulate Before Act: Model-Based Planning for Web Agents
- Proposer-Agent-Evaluator (PAE): Autonomous Skill Discovery For Foundation Model Internet Agents
- NNetscape Navigator: Complex Demonstrations for Web Agents Without a Demonstrator
- Learning to Contextualize Web Pages for Enhanced Decision Making by LLM Agents
- AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents
- Digi-Q: Transforming VLMs to Device-Control Agents via Value-Based Offline RL
- The Impact of Element Ordering on LM Agent Performance
- Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems
- Tree Search for Language Model Agents
- [NeurIPS 2024] ICAL: Continual Learning of Multimodal Agents by Transforming Trajectories into Actionable Insights
- [LLM Agents Workshop@ICLR 2024] OS-Copilot: Towards Generalist Computer Agents with Self-Improvement
- [AAAI 2025] Attention-driven GUI Grounding: Leveraging Pretrained Multimodal Large Language Models without Fine-Tuning
- Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
- OS-ATLAS: Foundation Action Model for Generalist GUI Agents
- UI-Pro: A Hidden Recipe for Building Vision-Language Models for GUI Grounding
- Grounding Multimodal Large Language Model in GUI World
- OmniParser for Pure Vision Based GUI Agent
- Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms
- [ACL 2024] SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
- AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents
- Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents
- AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials
- OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis
- AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLMs
- GUI-World: A GUI-oriented Dataset for Multimodal LLM-based Agents
- [NeurIPS 2024] Synatra: Turning Indirect Knowledge into Direct Demonstrations for Digital Agents at Scale
- AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
- CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents
- Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale
- AgentStudio: A Toolkit for Building General Virtual Agents
- MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents
- [NeurIPS 2024] Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?
- [NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
- MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control
- Attacking Vision-Language Computer Agents via Pop-ups
- GuardAgent: Safeguard LLM Agent by a Guard Agent via Knowledge-Enabled Reasoning
- EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage
- Adversarial Attacks on Multimodal Agents