Trajectories.
A curated selection of 20 runs from Opus 4.6 + vision+XML — the top-scoring configuration. 13 clean successes and 7 instructive failures across all three task categories.
Each video shows the task instruction, rubric checklist, iOS screen, and agent action per step. For all 133 task definitions visit the task list.
Single-app
5 runs · Basic interaction within one application.Pull up my upcoming Catalina Island trip (dates shown in StayFinder) on StayFinder and check the listing details. What's the listing name, price per night, number of beds, and number of baths?
Set a new alarm for 6:45 AM labeled 'Gym' in the Clock app and confirm it's set.
Open the Product Strategy folder in CloudDrive, find the most recently modified file, and star it. What's the file name and last modified date?
Log today's breakfast in CalTrack — search the food database for 'oatmeal', add a serving, and give me the calories and macros.
Search for restaurants in San Francisco with the 'Outdoor Seating' tag on DineSpot and make a reservation at Harborline Seafood. Confirm the reservation and give me the date, time, and party size.
Multi-app
8 runs · Information moves between apps.Open lockedin and find a technical program manager job posting. Note the requirements and benefits, then create a new file in CloudDocs called 'Interview Prep' with a brief summary of the job and 2 questions to ask a recruiter. Email Leo Chen the doc sharing link and ask for any advice on my preparation. What's the company name and key skills, and confirm the doc was created and email sent.
Check my SkyTrip app for my upcoming SFO to JFK flight details — date, time, and terminal. Then open the Weather app and add New York to check the forecast for my travel dates. Create a note in Notes titled 'NYC Trip Prep' with the flight details and expected weather. What's the flight date, departure time, terminal, and weather summary?
Request a CityRideX from Home (410 Brannan St) to Work (201 Mission St) in the CityRide app. Then post a status update in TeamChat #support-ops that I'm running late, and check my Mail for any morning meeting invites. What's the CityRide ETA and are there any meeting invites?
Open QuickBite and check my most recent Chipotle order details and total. Then check my MyBank credit card transactions for the corresponding charge and note the QuickBite total and the bank charge; report whether they roughly match. Also check my Mail inbox for a QuickBite receipt email, and add a note about the Chipotle expense to my Notes Shopping List. What are the order items, order total, bank charge amount, and email receipt amount?
Find the Ali Wong show on TicketBox and check ticket prices at Great American Music Hall. Compare CityRide ride prices (CityRideX vs Black) to Nob Hill (the neighborhood near Great American Music Hall). Book dinner beforehand at Golden Gate Izakaya on DineSpot for 2 after 5pm. Check Weather in SF for the evening. Message Elena Brooks in QuickChat with the full plan — show time, dinner reservation, and ride estimate. What's the ticket price, ride estimate, dinner reservation confirmation, and weather forecast?
Check my pending SplitPay requests — there's one I don't recognize. Note the amount and memo, then search my QuickChat conversations with the requester to see if they mentioned it. Check my MyBank credit card transactions for a matching charge around the same amount. If there's a match, go ahead and pay the SplitPay request; if not, message the requester asking what it's for. What's the request amount and memo, was a matching charge found, and what action did you take?
Check my SkyTrip app for my upcoming SFO to SEA flight dates, then check my StayFinder bookings to see if the Catalina Island trip (dates shown in StayFinder) overlaps with any flights. If there's a conflict, post in TeamChat #support-ops that I need to reschedule. Either way, update my 'Travel Packing List' note in Notes with the confirmed travel dates for both trips. What are the flight dates, is there a conflict, and confirm the note was updated.
Browse DineSpot for Embarcadero Grill in San Francisco and check its seating options. Look up the neighborhood on TasteRank to see if friends have reviewed nearby spots. Check Weather in San Francisco for this evening — if it's clear, book a table with outdoor seating; if rain is expected, book indoor. Message Leo Chen in QuickChat with the restaurant name, time, and whether I got outdoor or indoor seating. What's the restaurant name, seating type chosen, weather conditions, and any TasteRank reviews found, and confirm the reservation and message.
Memory
7 runs · Patterns the user never explicitly states.Check my SplitPay and MyBank transaction history for any recurring monthly payments I receive. Figure out who they're from and how much. If this month's payment hasn't come in yet, send a request for it. Let me know the recurring pattern — who, amount, frequency — and confirm the request was sent.
Look at my CityRide app and figure out my most common route based on my saved locations. Then request a ride along that route and tell me the route and estimated fare.
I have a loved one's birthday coming up — check Notes, QuickChat, and Mail to figure out whose and when. Find a gift on MegaMart within the budget mentioned in my notes and make a DineSpot reservation for a birthday dinner. Let me know the person's name, birthday date, budget, the gift you chose, and the reservation confirmation.
Figure out what neighborhood I spend the most time in by looking at my QuickBite delivery addresses, TasteRank restaurant visits, and Mail receipts for ride destinations. Then find things to do in that area — check DineSpot for nearby restaurants and TicketBox for nearby events. Let me know the neighborhood and what's going on there.
Give me a full picture of my finances. Check MyBank balances, SplitPay balance and pending requests, recent MegaMart orders, and FreshCart spending. Figure out my net financial position and enter the summary in the 'Budget Tracker' document in CloudDocs. Let me know my net position with all account balances.
Analyze my shopping habits from MegaMart order history and MyBank credit card transactions. Figure out which product categories I spend the most on, total my MegaMart spending this month, and search MegaMart for deals in my top category. Update the Monthly Expense Log in CloudSheets with the findings. Let me know the top spending categories, monthly MegaMart total, and any deals found.
Build a 'Life Dashboard' summarizing key metrics across all my apps. Check MyBank for financial health (net worth from checking + savings - credit), TrailBlaze and CalTrack for health metrics (weekly miles, weight trend, calorie compliance), TeamChat and Mail for work load indicators, QuickChat and SplitPay for social activity frequency, and ScoreZone for upcoming games I'm tracking. Create a comprehensive CloudDocs document titled 'Life Dashboard' with sections for Finance, Health, Work, Social, and Entertainment. Let me know the key metrics for each category and confirm the document was created.