iOSWorld
iOSWorld  ·  Leaderboard

Leaderboard.

Six models, twelve configurations. Each task receives a binary pass/fail from a GPT-5.4 Mini judge that reviews the agent's full trajectory against the rubric. Values in bold lead each column; step counts are averages.

Human–judge agreement on 128 Opus 4.6 trajectories: κ = 0.77 · 89% accuracy · F1 = 0.86.

Model Modality Single (27) Multi (60) Memory (46) Overall (133)
Pass Steps Pass Steps Pass Steps Pass Steps
Opus 4.6
Anthropic
vision-only 70.4% 23.1 20.0% 45.5 8.7% 49.4 26.3% 42.3
vision+xml 81.5% 16.4 36.7% 38.2 54.3% 39.3 51.9% 34.1
Sonnet 4.6
Anthropic
vision-only 77.8% 26.3 21.7% 46.9 10.9% 49.8 29.3% 43.7
vision+xml 92.6% 15.4 35.0% 42.0 34.8% 44.8 46.6% 37.5
GPT-5.4
OpenAI
vision-only 63.0% 32.5 11.7% 48.5 6.5% 48.3 20.3% 45.2
vision+xml 81.5% 12.1 26.7% 37.1 32.6% 37.2 39.8% 32.1
GPT-5.4 Mini
OpenAI
vision-only 70.4% 24.0 18.3% 46.3 10.9% 48.9 26.3% 42.7
vision+xml 66.7% 14.9 1.7% 43.8 4.3% 41.5 15.8% 37.2
Gemini 3 Flash
Google
vision-only 70.4% 11.0 13.3% 23.9 21.7% 23.5 27.8% 21.2
vision+xml 70.4% 11.5 18.3% 38.0 17.4% 33.3 28.6% 31.0
Qwen3.5 35B-A3B
Alibaba (open)
vision-only 40.7% 23.0 6.7% 36.2 4.3% 33.4 12.8% 32.6
vision+xml 48.1% 31.4 0.0% 43.9 2.2% 37.9 10.5% 39.3
Best · multi-app
36.7%

Opus 4.6 + vision+XML. Multi-app remains the hardest category in the suite.

Best · single-app
92.6%

Sonnet 4.6 + vision+XML. With XML access, single-app tasks are nearly solved.

Best · memory
54.3%

Opus 4.6 + vision+XML. Largest absolute lift from XML (9% → 54%).

Pass rates by category and modality across six models.
Figure 3. Vision+XML (blue) outperforms vision-only (gray) for stronger frontier models. GPT-5.4 Mini and Qwen3.5 do not show the same benefit.
Vision-only vs vision+XML deltas across models.
Figure 4. The privileged-access lift is highly model-dependent.
Pass rate vs. step budget across models and categories.
Figure 5. Single-app pass rate saturates by step 20; multi-app keeps improving through 40; memory varies widely.