March 2026
Benchmarks, datasets, and infrastructure for AI agents that use computers and browsers. Every datapoint links to its source.
How well can AI agents actually use computers?
First scalable real computer environment for multimodal agents. 369 tasks across Ubuntu, Windows, macOS involving real web and desktop apps.
Self-hosted web environment with 812 realistic tasks across e-commerce, forums, project management, and content editing sites.
532 tedious web tasks requiring massive memory, calculation, and long-term cross-page reasoning. Exposes weaknesses hidden by standard WebArena.
Extension of WebArena with 910 tasks requiring visual understanding — images, spatial reasoning on pages, in addition to navigation.
Scores not yet aggregated.
General AI Agent benchmark from Meta AI & HuggingFace. Multi-step reasoning, tool use, web browsing across 3 difficulty levels.
First benchmark specifically for computer & browser use. 106 end-to-end workflows across 7 industries including finance, e-commerce, construction.
Internet-scale dataset for web agents. 2,350 tasks across 137 websites in 31 domains with real web interaction.
Scores not yet aggregated.
Training and evaluation data for computer-use and browser agents. Full catalog (75+)
Successful computer-use agent trajectories from OSWorld. 313 rows with screenshots, actions, accessibility trees, and reasoning traces.
Dynamic evaluation dataset with critical intermediate states for web agent assessment.
Conversational GUI agents dataset with real-world web navigation demonstrations.
Largest-scale web trajectory dataset from dynamic web page exploration.
Internet-scale dataset for training GUI-based web agents, generated through automated LLM pipeline without human annotation.
Large-scale cross-platform GUI agent trajectories with multimodal grounding and reasoning annotations.
Multi-level large-scale dataset for training GUI agents with rich element annotations.
Large-scale dataset enhancing GUI agents' text-rich visual understanding from rendered web pages.
Cloud browser platforms for AI agents. Pricing and features verified from vendor websites.