Talking to Computers: The Shift from Typing to Agents That See, Click, and Speak
Computer-use agents that drive software and voice agents that speak back are colliding. Here's what that actually looks like to build and what's still broken.
The Demo That Broke My Mental Model
I was sitting in a co-working space in Ho Chi Minh City in early 2025, jet-lagged and trying to book a guesthouse in Hanoi. I had a tab open with a hotel aggregator, another with Google Maps, and a third with a Vietnamese bank’s wire transfer portal. Fifteen minutes of copy-pasting later, I gave up and just called the place.
A few weeks later I watched a demo of OpenAI’s Computer-Using Agent — what became the backbone of their “Operator” product — book a restaurant reservation. The agent opened a browser, navigated to OpenTable, filled in the party size and date, handled the confirmation screen, and did the whole thing while a voice narrated what it was doing. It wasn’t flawless. It paused in a weird spot. It misread one dropdown. But it completed the task.
That broke something in my head. Not because it was impressive, but because I suddenly understood what interface paradigm we’d been living in for forty years and what was about to replace it.
Every interaction I’ve ever had with a computer has been mediated by me translating my intent into a form the machine accepts: I learn the shortcut keys, I find the right menu, I figure out which field maps to which database column. The computer is not meeting me where I am. I am meeting the computer where it is.
Computer-use agents and voice agents, taken together, are the first credible attempt to flip that relationship.
What a Computer-Use Agent Actually Is
The term gets thrown around loosely, so let me be precise about the technical picture.
A computer-use agent is a vision-language model that takes a screenshot of a screen as input and outputs actions: move cursor, click at coordinate, type text, press key combination, scroll. That’s the entire action space. The model doesn’t have access to the DOM, doesn’t hook into application APIs, doesn’t read memory. It sees what you see and acts like a person would act — one pixel-level observation at a time.
The three implementations worth knowing about:
Anthropic’s computer use tool shipped in October 2024 as part of Claude 3.5 Sonnet, initially in public beta. The core interaction loop: the model receives a screenshot via the computer tool, reasons about what it sees, returns a computer_action (click at x,y, type a string, press a key). Your application wraps a VM or container, captures the screen, passes it to the API, executes the returned action, captures the next frame. Repeat until done or until the model decides it’s finished. Claude Sonnet 4.5 as of 2025 reports 61.4% on OSWorld — the canonical benchmark for computer use tasks across real operating system environments — up substantially from the ~20% that was the frontier a year earlier. By early 2026 Anthropic launched what they’re calling “Cowork,” a more integrated desktop product that doesn’t require you to build the sandbox yourself.
OpenAI’s Computer-Using Agent (CUA), which powered Operator, was trained specifically on GUI interaction using reinforcement learning on top of GPT-4o’s vision capabilities. When OpenAI launched it in January 2025, it hit 38.1% on OSWorld, 58.1% on WebArena, and 87% on WebVoyager (a web-only subset). Those numbers look impressive until you understand that OSWorld is full computer use on actual operating systems — not web-only — and 38% means six out of ten tasks failed. OpenAI folded Operator into ChatGPT Agent in July 2025 because the developer API story was losing to Anthropic and standalone Operator had reliability gaps on real commerce workflows. The underlying CUA model lives on inside ChatGPT Agent and the Agents SDK.
Google’s Project Mariner launched as a research prototype in December 2024, built on Gemini 2.0, and scored 83.5% on WebVoyager. It ran as a Chrome extension and could parallelize up to ten tasks simultaneously. Google shut it down in May 2026 and folded the technology into Gemini Agent and Chrome’s “Auto-Browse” capability. The pattern with all three is the same: the standalone computer-use product becomes infrastructure inside the main product surface.
OSWorld is the benchmark worth following. It was presented at NeurIPS 2024 and tests agents on real tasks across Ubuntu environments with real applications — LibreOffice, Chrome, file management, terminal commands. When the benchmark shipped, the best models hovered around 15–20%. By late 2025, agentic frameworks like Simular’s Agent S were reporting 72.6%, just above the human baseline of 72.36%. That trajectory — from 20% to near-human in roughly 18 months — is the actual story of how fast this category is moving.
The Hard Technical Problems Nobody Shows in the Demo
The demos are always the happy path. Let me describe what breaks.
Screen grounding. The model has to identify exactly where a UI element is located in a screenshot to click it. On a clean web page with large buttons this is easy. On a dense enterprise SaaS dashboard with 40 clickable elements in a 1400x900 viewport, it is not. Research published at ICML 2025 on OSWorld showed that planning — deciding what to do next — accounts for over half of total task latency, while error recovery accounts for another 20-35%. The vision step of locating and clicking the right element is one of the most common failure points, especially for small targets, overlapping elements, or anything that requires scroll to reveal.
The deeper issue is that pixel coordinates are fragile. A responsive layout that shifts because a sidebar collapsed, a pop-up modal that appeared over the button you were about to click, a page that rerendered while the model was mid-action — any of these break the grounding between the model’s last observation and the current state of the screen. Humans handle this instantly and implicitly; we re-orient. Current agents often don’t. They proceed with the action they planned based on the last frame and it lands in the wrong place.
Action space mismatches. The set of allowed actions (click, type, scroll, key press) is actually quite narrow compared to what humans do with a keyboard and mouse. Drag-and-drop — moving a file from one folder to another, rearranging items in a list — is notoriously unreliable across current implementations. Anthropic’s own documentation calls out scrolling, dragging, and zooming as known weaknesses as of early 2026. These aren’t edge cases in real workflows. Dragging columns in a spreadsheet, reordering slides in a presentation, moving a task card across a Kanban board — these are the motions of knowledge work.
Latency per step. Each step in a computer-use loop is: capture screenshot, encode it, send to API, receive response, parse action, execute action, wait for page/application to settle, capture next screenshot. On a good connection with a fast model, this is 1.5 to 3 seconds per step. A task that takes a human 90 seconds of fluid action might take a CUA agent 8 to 15 minutes of sequential steps. That’s not a product latency problem — it’s a cost and session reliability problem. The longer the session, the more chances for something to go wrong, the more tokens consumed, the more expensive the run.
The compounding reliability problem. I’ve written about the march of nines for agent reliability elsewhere. For computer use it’s especially acute because tasks are long-horizon by nature. A ten-step workflow where each step succeeds 92% of the time has an end-to-end success rate of about 43%. That’s the math. The 72% on OSWorld that looks impressive represents expert-level framework tuning, optimal prompting, retry logic, and careful task selection. Real-world enterprise workflows have longer steps, more ambiguity, and less tolerance for retries.
Realtime Voice: The Other Half of the Shift
A computer-use agent is compelling on its own. An agent that can also talk to you while it works — receiving your spoken instructions, narrating its actions, asking for clarification before it submits the purchase — is a qualitatively different experience.
This is where realtime voice agents enter. And the engineering is different enough that it’s worth treating separately.
OpenAI’s Realtime API shipped in October 2024, went through significant maturation through 2025, and by late 2025 had a production-grade gpt-realtime model and a cost-optimized gpt-realtime-mini. The architecture processes audio directly rather than doing speech-to-text → LLM → text-to-speech in a pipeline. The model hears your audio, reasons, and speaks back. No intermediate text. This preserves prosodic information — tone, hesitation, emphasis — that gets destroyed in a text transcription, and it eliminates two conversion steps from the latency budget.
The practical latency numbers: with the Realtime API, time-to-first-audio is in the 300–600ms range for typical queries, which puts it inside the window where conversation feels responsive. (The human expectation for conversational response is 300–500ms. Go beyond 800ms and it starts feeling like you’re waiting for something.) OpenAI’s published engineering work on their low-latency infrastructure describes the system as requiring global point-of-presence to keep network round-trip under 50ms, because the model’s own latency already consumes most of the budget.
Google’s Gemini Live API (formerly called Multimodal Live) reached general availability on Vertex AI with the Gemini 2.5 Flash Native Audio model. The architecture is similar in philosophy: native audio in, native audio out, via WebSocket or WebRTC session. It handles interruptions mid-sentence — if you start talking while the model is speaking, it detects this and stops, which is the foundational requirement for natural turn-taking. Google has productized this for exactly the kinds of ambient assistant scenarios you’d expect: shopping assistants, gaming NPCs, vehicle interfaces.
The core voice engineering problem is interruption handling. Human conversation is not turn-taking with a gavel. We talk over each other, we say “uh-huh” while the other person is speaking, we trail off and let the other person continue, we interrupt to redirect. Building a system that handles this correctly — that stops speaking within 200ms of a genuine user interruption, while not stopping every time there’s background noise or a filler word — is genuinely hard. The academic literature on this is growing fast; a paper called Full-Duplex-Bench-v3 from 2026 specifically tests tool use under realistic disfluency conditions and finds that current systems still have meaningful gaps in understanding when an interruption is a redirect versus a confirmation signal.
The other latency budget challenge: the Realtime API costs about $32 per million input tokens and $64 per million output tokens in audio terms (roughly $0.06/min input, $0.24/min output as of 2026). A 30-minute voice agent session could run several dollars of API cost before you’ve done anything productive with it. That’s fine for high-value transactions. It’s not fine for ambient always-on assistant scenarios.
What Combining Them Actually Looks Like
Here’s the demo I’ve been describing to founders at Zero. Imagine this scenario.
You’re onboarding a new business in Dubai — you need to register with the relevant free zone, upload documents, fill out a multi-page application form, and then coordinate a video call with their licensing office. You start a session with a combined voice+CUA agent. You describe what you need verbally. The agent, which can hear you and see your screen, opens the free zone’s website, starts filling out the application. It reads the form fields aloud as it encounters them, asks you to confirm before it submits anything sensitive, and flags when it hits a section it’s not sure about (“This field says ‘share structure’ — do you want me to use the percentage you mentioned or should I leave it for you to fill in?”). When you get to the document upload section it tells you what’s needed and waits while you drag the files over. Then it submits.
What’s technically happening: a voice pipeline (Realtime API or Gemini Live) is running in parallel with a computer-use loop. The voice agent receives your speech, the CUA observes the screen. They share state — the voice model knows what page the CUA is on, the CUA receives instruction updates from the voice model. Both run on the same underlying context window.
This sounds simple. It’s not. The hardest part is state synchronization. The CUA is advancing through a long sequence of actions asynchronously. The voice agent needs to know, at any moment, what step the CUA is on so it can narrate correctly and respond to interruptions (“wait, go back one page” has to actually stop the CUA mid-action, re-capture state, and replan). Action-level interruption handling is a harder engineering problem than conversational interruption handling because the “speech” in this case is a sequence of irreversible actions on a live application.
The safety architecture is also non-trivial. You don’t want the voice agent to have unbounded control. The permission model needs to match what you’d give a capable but not fully trusted assistant: submit forms, yes; authorize payments over a threshold, no; access your password manager, absolutely not. Both Anthropic and OpenAI have built in confirmation patterns — the agent pauses before high-stakes irreversible actions — but the criteria for what counts as high-stakes are not always calibrated correctly for a given workflow.
Where This Is Headed
My honest read of where we are: the technology is real but the production reliability is at three to four nines for narrow, well-defined tasks in controlled environments. For broad autonomous operation on arbitrary enterprise workflows, we’re still at two nines on a good day.
The trajectory matters more than the current number. OSWorld went from ~20% to near-human baselines in 18 months. The Realtime API went from a research demo to a production model with 128K context window, configurable reasoning levels, and 20% price cuts in roughly the same period. The direction is clear.
I think the near-term winners are not “do everything” general agents — those are still reliability nightmares — but agents that own a well-defined workflow end-to-end within a constrained domain. Travel booking within a corporate policy envelope. Vendor invoice processing where the fields are consistent. Government form submission where the outcome is high-value enough to justify a longer, more careful session. Customer onboarding flows where the CUA knows the company’s own web application intimately.
The longer-term shift — and I’ll frame this as my opinion — is that the concept of a “software interface” as we currently understand it becomes optional. Applications that expose APIs will be consumed by agents via APIs. Applications that don’t will be consumed by agents via their visual interface. The distinction between “is this app agent-friendly?” stops mattering because every app is agent-usable regardless of whether it was designed for it. That’s a profound shift for how software gets built and how organizations think about access, authorization, and audit.
I’m building workflows at Zero that combine CUA and voice for specific student onboarding and administrative tasks. The honest state of things as of mid-2026: I wouldn’t let it run fully autonomously on anything I couldn’t reverse. I have a human reviewing what it did every few hours. That’s not a criticism — it’s an honest calibration of where we are in the march. The direction is clear. The gap between demo and production is the usual gap: all the edge cases that only show up when real users, with their non-standard files and unexpected form states and different browsers, start using the thing.
The interface between humans and computers is changing. Not slowly. The change is happening in quarterly increments of reliability. Voice plus computer use is the direction. The question isn’t whether it gets there — it’s how fast you get your processes ready for the agents that will run them.