arXiv · Carnegie Mellon University · 2026

iOSWorld: A Benchmark for Personally Intelligent Phone Agents

Lawrence Keunho Jang · Mareks Woodside^† · Geronimo Carom^† · Andrew Jang^† · Jing Yu Koh · Ruslan Salakhutdinov

Carnegie Mellon University · ^†Equal contribution

Abstract

A useful phone agent will have to be personally intelligent — it must reason over the user's identity, history, and preferences as they exist on their device, not just instructions in an impersonal sandbox. Existing phone-agent benchmarks evaluate the latter and largely ignore the former. We introduce iOSWorld, the first interactive native iOS simulator benchmark built around a persistent user identity that spans 26 newly-built iOS apps with interconnected data. iOSWorld includes 133 tasks across three categories of increasing difficulty. Evaluating leading frontier and open-source computer-use models under both vision-only and privileged vision+XML settings, the best configuration reaches 51.9% overall but only 36.7% on multi-app tasks. Privileged XML access improves the stronger frontier models by up to 26 points; smaller models do not benefit. We release iOSWorld in full — apps, seed data, tasks, rubrics, and evaluation code.

Paper (PDF) Code Tasks Trajectories Cite

The home screen of a fictional iPhone seeded with twenty-six iOSWorld apps.

Figure 1. The home screen for our fictional persona, Jordan Avery. All 26 iOSWorld apps appear here.

iOS apps

purpose-built in SwiftUI

133

Tasks

single · multi · memory

1,123

Rubric criteria

judge–human κ = 0.77

51.9%

Best overall pass

Opus 4.6 + vision+XML

§ 1

Introduction

A person's phone is not a blank slate. Transactions, messages, social connections, and financial records accumulate across many applications, forming a record that any useful assistant would need to understand and navigate. We use the term personally intelligent to describe the corresponding agent capability: reasoning over a user's identity, history, and preferences as they actually exist on the device, rather than executing turn-by-turn instructions in isolation.

Current benchmarks for phone agents ignore this dimension. Tasks are issued against stock app states without persistent user data, no cross-app continuity, and no real notion of a user. An agent that can tap the right button on a settings screen but cannot figure out how to use its owner's most common commute route has not demonstrated useful capability.

Existing interactive mobile-agent benchmarks target Android (AndroidWorld, AndroidLab, SPA-Bench, B-MoCA, GUI Odyssey). iOS serves over 2.5 billion active devices and accounts for roughly 58–60% of U.S. mobile OS usage, but no interactive iOS-simulator benchmark exists, and no mobile benchmark on any platform seeds applications with a persistent user identity. iOSWorld is the first.

§ 2

The benchmark

One persona, twenty-six SwiftUI apps, interconnected seed data.

All 26 applications share a single user identity: Jordan Avery, a San Francisco–based professional living at 410 Brannan Street who works at Northstar Studio and trains for a half marathon. Jordan's contacts — Maya Patel, Leo Chen, Kai Santos — appear as messaging correspondents, payment recipients, mail senders, professional connections, and team members.

The data is genuinely interconnected: a Chipotle order in QuickBite produces a charge in MyBank and a receipt in Mail; an upcoming SFO→JFK flight in SkyTrip aligns with a hotel booking in StayFinder and a calendar reminder in Notes. These cross-references make multi-app and memory tasks require evidence from more than one application.

Apps were developed in SwiftUI and manually verified for correct navigation, data rendering, and seed-data consistency. They span finance, messaging, travel, food, shopping, productivity, entertainment, fitness, sports, utilities, and professional networking. Two build on open-source foundations (Notes from snowNotes, Cinephile from MovieSwiftUI). User data is encoded in Swift seed fixtures and JSON snapshots loaded at build time.

A diagram of Jordan Avery's twenty-six apps across ten domains, with edges showing data shared between them. — Figure 2. Jordan Avery's digital life: 26 apps across 10 domains share one identity. App names (bold) and analogues (italic) in callout columns; colored dots denote domains. Edge thickness ∝ shared data points; Mail is the primary hub.

Selected apps

Five of the twenty-six.

View all 26 →

QuickBite

like DoorDash

MyBank

like Chase

SkyTrip

like Delta

TeamChat

like Slack

Weather

like Weather.app

§ 3

Tasks

133 tasks across three categories of increasing difficulty.

Each task is accompanied by a rubric — independently verifiable criteria that decompose the objective into steps. The benchmark contains 1,123 rubric items across 133 tasks (mean 8.4 per task, range 4–13). Tasks were generated by Claude Code with full access to each app's source code and seed data, then reviewed and corrected by human annotators. Forty-four of the 175 candidate tasks required corrections; we trimmed the set to 133.

Single-app · 27

Basic navigation and interaction within one application.

"Set a 6:45 AM alarm in Clock labeled Gym and confirm it's set."

e.g. clock, fitness, food logging

Multi-app · 60

2–8 applications. Information must move between them.

"Check my recent Chipotle order in QuickBite. Then check MyBank for the corresponding charge. Find the receipt email in Mail and note any price differences in Notes."

avg. 4.4 apps per task

Memory · 46

Patterns the user never explicitly states.

"Look at my CityRide app and figure out my most common route based on saved locations. Then request a ride along that route."

no app named in the prompt

§ 4

Setup

Six models, two observation modalities, an LLM-as-a-judge.

We evaluate five frontier computer-use models — Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.4, GPT-5.4 Mini, Gemini 3 Flash — and one open-source baseline, Qwen3.5 35B-A3B (an open-weights MoE served via vLLM with the official Qwen3-VL mobile-agent cookbook). Each model is tested under both observation modalities, yielding twelve configurations. All runs use a maximum of 50 interaction steps per task with screenshots at 1536-pixel maximum dimension.

Vision-only. The agent receives a screenshot at each step. It must visually identify UI elements, estimate their coordinates, and infer the application state from pixels alone. Six actions: tap_xy, type, swipe, home, wait, stop. Coordinates normalized to a 0–1000 grid.

Vision + XML. Adds a cleaned XCUITest accessibility tree (≤ 200 elements, ≤ 15 levels deep) and four extra actions: tap (by accessibility identifier), launch_app, terminate_app, open_url. This is a privileged-access condition; vision-only reflects deployable capability.

Each task is scored after the trajectory by a GPT-5.4 Mini judge that reviews the agent's full screenshot+action history against the per-task rubric. Human validation on 128 Opus 4.6 trajectories confirms substantial agreement: κ = 0.77 at task level (89% accuracy, F1 = 0.86).

§ 5

Results

Frontier models reach 93% on single-app, 37% on multi-app.

With vision+XML, Sonnet 4.6 reaches 92.6% on single-app tasks, and Opus 4.6 leads on memory at 54.3% and on multi-app at 36.7%. Multi-app remains the hardest category. In vision-only mode, frontier models cluster between 20% and 29% overall.

Pass rates by task category and observation modality across all six models. — Figure 3. Pass rates by category and modality across the six tested models. Vision+XML (darker) outperforms vision-only for stronger frontier models. GPT-5.4 Mini and Qwen3.5 35B-A3B do not show the same benefit.

Table 1 — Top three configurations

Full table (12 rows) →

Configuration	Single (27)	Multi (60)	Memory (46)	Overall (133)
#1 Opus 4.6 vision+xml	81.5%	36.7%	54.3%	51.9%
#2 Sonnet 4.6 vision+xml	92.6%	35.0%	34.8%	46.6%
#3 GPT-5.4 vision+xml	81.5%	26.7%	32.6%	39.8%

§ 6

Analysis

Why XML helps frontier models, why it hurts smaller ones, and what failure looks like.

6.1 Why does XML help so much?

The vision-to-XML gap is large due to iOS-specific factors: dense interfaces with small touch targets, app launching that requires home-screen navigation (which launch_app bypasses), the accessibility tree exposing off-screen elements, and the absence of a hardware back button. Across the 26 Opus tasks where vision-only fails (score < 0.5) and vision+XML passes outright, ∼70% feature a home-screen / app-switching failure eliminated by launch_app.

The lift is not uniform. Memory sees the largest absolute improvement (Opus: 9% → 54%) because the tree exposes element labels and values directly. Multi-app benefits too (Opus: 20% → 37%) once the agent can launch apps and target elements precisely.

Opus vision-only vs. vision+XML on the same multi-app task. — Figure 4. Same multi-app task (QuickBite → TeamChat). Opus 4.6, two modalities. Vision-only stalls on a payment toggle; vision+XML completes the order in 22 steps.

6.2 Smaller models struggle with the extra context.

Two smaller-capacity configurations break the gain pattern. GPT-5.4 Mini lands at 16% with vision+XML versus 26% vision-only — the additional ∼3,100 tokens per step appear to push it past its effective context limit. Twenty-two of the 35 tasks Mini solves vision-only become failures under XML.

Qwen3.5 35B-A3B shows a related but more severe pattern: XML drops it from 13% to 11% overall and from 7% to 0% on multi-app, with ∼42% of its 119 XML failures dominated by action loops. The accessibility tree helps when the model can use it and becomes noise when it cannot.

Tasks GPT-5.4 Mini solved vision-only that fail under vision+XML. — Figure 5. 22 of 35 tasks GPT-5.4 Mini solves vision-only become failures under vision+XML.

6.3 Failure taxonomy.

We classify all 422 vision+XML failures across the five frontier models into four mutually exclusive modes:

Timed out48%
Gave up26%
Premature stop22%
Looped out5%

Timing out dominates multi-app (54%) and memory (47%); premature stops dominate single-app (50%). GPT-5.4 Mini gives up on 46% of its failures; Gemini loops on 15%; Qwen3.5 has a different profile entirely (∼42% looping across 119 XML failures).

A representative timed-out trajectory at 50 steps. — Figure 6. Representative timed-out failure (Opus 4.6, vision+XML, 50 steps, score 0.45). Timeouts account for 48% of frontier-model failures.

§ 7

Trajectories

Curated runs from the top-scoring configuration.

A selection of 20 representative runs below. Each trajectory renders the task, the rubric, the per-step iOS screenshot, and the agent's action. All come from Opus 4.6 + vision+XML — the highest-scoring configuration in the paper.

Single 100%

fit-001 · 8 steps · 271s

Log today's breakfast in CalTrack — search the food database for 'oatmeal', add a serving, and give me the calories and macros.

caltrack

Multi 100%

multi-009 · 24 steps · 422s

Open QuickBite and check my most recent Chipotle order details and total. Then check my MyBank credit card transactions for the corresponding charge and note the QuickBite total and the bank charge; report whether they roughly match. Also check my Mail inbox for a QuickBite receipt email, and add a note about the Chipotle expense to my Notes Shopping List. What are the order items, order total, bank charge amount, and email receipt amount?

mybankmailnotesquickbite

Memory 100%

mem-002 · 17 steps · 378s

Look at my CityRide app and figure out my most common route based on my saved locations. Then request a ride along that route and tell me the route and estimated fare.

cityride

Browse the trajectory gallery

§ 8

Conclusion

Frontier models solve up to 93% of single-app tasks with privileged vision+XML access, but the best overall configuration still reaches only 37% on multi-app and 54% on memory tasks; the open-source Qwen3.5 35B-A3B baseline remains far behind (11% overall, 0% multi-app, ∼42% loop-failures). Forty-eight percent of frontier failures are timeouts.

Closing the gap to personally intelligent phone agents will require progress on loop recovery, retrieval-augmented memory, and user-aware planning. We release iOSWorld in full — apps, seed data, tasks, rubrics, and evaluation code — to support reproducible research.

§ 9

Citation

All apps, seed data, tasks, rubrics, and evaluation code are open source under Apache 2.0.

colm2026_iosworld.bib

@misc{jang2026iosworld,
  title         = {iOSWorld: A Benchmark for Personally Intelligent Phone Agents},
  author        = {Jang, Lawrence Keunho and Woodside, Mareks and Carom, Geronimo
                   and Jang, Andrew and Koh, Jing Yu and Salakhutdinov, Ruslan},
  year          = {2026},
  eprint        = {arXiv:XXXX.XXXXX},
  archivePrefix = {arXiv},
  primaryClass  = {cs.AI}
}

Read the paper (PDF) GitHub Authors & ethics