A research artifact,
open and citable.
An iOS simulator running 26 purpose-built apps, populated with one fictional persona, and scored against per-task rubrics by an LLM judge validated against human annotators at κ = 0.77.
Authors
- Lawrence Keunho Jang
- Mareks Woodside †
- Geronimo Carom †
- Andrew Jang †
- Jing Yu Koh
- Ruslan Salakhutdinov
Citation
If you use iOSWorld in your research, please cite the arXiv preprint.
@misc{jang2026iosworld,
title = {iOSWorld: A Benchmark for Personally Intelligent Phone Agents},
author = {Jang, Lawrence Keunho and Woodside, Mareks and Carom, Geronimo
and Jang, Andrew and Koh, Jing Yu and Salakhutdinov, Ruslan},
year = {2026},
eprint = {arXiv:XXXX.XXXXX},
archivePrefix = {arXiv},
primaryClass = {cs.AI}
} Ethics
Synthetic data. All data in iOSWorld is entirely synthetic. The Jordan Avery persona is fictional, and no real user data was collected, processed, or used at any stage. Benchmark runs use deterministic seeded data and do not depend on real user accounts, real services, or external databases.
Malicious agents. Phone agents capable of operating autonomously on a user's device carry significant dual-use risks. We encourage researchers to develop agents with explicit user consent mechanisms and action confirmation for irreversible operations.
iOS access and reproducibility. iOSWorld requires macOS with Xcode to run the iOS Simulator, which limits reproducibility to researchers with access to Apple hardware. We release all source code, seed data, and evaluation scripts. Vision-only numbers reflect deployed capability; vision+XML represents an upper bound with privileged access via XCUITest.
Accessibility. Capable phone agents could improve accessibility for users with visual, motor, or cognitive impairments. iOSWorld is a research benchmark for measuring progress in a controlled simulator; results should not be interpreted as indicating readiness for deployment on real devices with real user data.
License
Apache License 2.0. See the LICENSE file in the benchmark repository.