AI that listens to your calls and tells you what to ask next

How does real-time coaching work?

Inside the engineering: the sub-two-second pipeline, dual-stream audio capture, and the discreet overlay that puts guidance on screen without anyone else seeing it.

Works on Zoom, Teams & Google Meet · Mac & Windows · 7-day free trial

ConversationPilot — live overlay
Objection Handling
They're comparing you to a competitor.
↳ “What would make us the clear choice over them for your team?”
Next best question
“When does your current contract renew?”
Speaking analytics
You 38%Prospect 62%
12
Questions
2
Interruptions
0
Monologues

Real-time coaching works by completing a full audio-to-guidance pipeline in under two seconds: it captures both speakers as separate streams, transcribes them live, detects the moment that matters, generates a prompt, and renders it in a discreet on-screen overlay — fast enough that the rep can act before their next sentence. The entire system is engineered around latency, because a coaching prompt is only real-time if it arrives while the moment is still open. ConversationPilot is built around exactly this constraint.

The difference between real-time coaching and ordinary call analysis is not what it detects but when. The same objection, the same buying signal, the same talk-time problem can be surfaced after the call by many tools. Surfacing it inside two seconds, on the rep's screen, while the call is live, is an engineering problem — and solving it is what makes the coaching change the call rather than merely explain it.

This page goes inside the pipeline: how the dual-stream capture works, why latency is budgeted across fast and slow models, and how the overlay delivers guidance invisibly. We use ConversationPilot as the worked example throughout.

The sub-two-second budget

Real-time coaching lives or dies on latency, so the system is designed around a strict time budget: from the moment the prospect stops speaking to the moment a prompt appears, the whole pipeline must complete in under two seconds. That window is short enough to feel like a colleague whispering in your ear and long enough to do real reasoning.

Meeting it requires discipline at every stage. Transcription runs continuously so there is no wait for a chunk of audio to finish. Detection and prompt generation run on a fast model — Claude Haiku 4.5 — chosen specifically for low latency. Anything that does not have to happen live is deliberately moved off the critical path. ConversationPilot treats the two-second budget as a hard constraint that shapes the architecture, not a nice-to-have. The result is guidance you can act on before your next sentence, which is the entire point of real-time.

Signal detection
Budget mentionedDecision makerCompetitor: LookerRenewal: March

Dual-stream audio capture

The pipeline starts with how the audio is captured, and the design choice is consequential. ConversationPilot captures two separate streams — your microphone and the meeting or system audio — rather than recording one mixed channel and trying to separate the speakers afterward.

This buys two things that matter for real-time. First, accuracy: the system knows from the first word who is speaking, so attribution never has to be inferred and is never wrong. Second, speed: there is no diarisation step on the critical path, no time spent disentangling a single track, so the pipeline shaves off latency it would otherwise spend untangling who said what. Exact attribution also makes the speaking analytics — talk-to-listen ratio, interruptions, monologue detection — precise rather than estimated. Dual-stream capture is the foundation that lets the rest of the pipeline be both fast and trustworthy.

Live scorecard
NeedCovered
BudgetPartial
AuthorityCovered
TimelineOpen
CompetitionCovered
78
Call score — strong qualification

Live transcription and fast detection

With clean, separated audio flowing in, the pipeline transcribes continuously using Whisper, converting both streams to text as the conversation happens. Because transcription is continuous rather than batched, the latest words are always available the instant the model needs them.

Detection then reads that live transcript and identifies the moment to coach on — an objection landing, a buying signal surfacing, a qualification gap, a monologue running long. This runs on the fast model so it keeps pace with the conversation. Crucially, detection interprets intent rather than matching keywords, so it understands that "we already have something for this" is a status-quo objection. The combination of continuous transcription and fast, intent-aware detection is what lets the system know, within the time budget, both that a coachable moment has arrived and exactly what kind it is.

Splitting work across models

The cleverest part of the design is what runs where. Not all coaching work has the same latency requirement, so ConversationPilot deliberately splits it. The live, latency-critical work — detection and the in-call prompt — runs on Claude Haiku 4.5, a fast model that fits the two-second budget. The heavy, latency-tolerant work — the post-call report and deep analysis — runs separately on Claude Sonnet 4.6, a stronger model that can afford to be thorough because it is no longer racing a live call.

This split is why real-time coaching does not force a trade-off between speed and depth. If everything ran on one model, you would have to choose: fast but shallow, or deep but too slow to be live. By routing each job to the model suited to it, the rep gets instant guidance during the call and a rich, accurate report afterward. The architecture, not a compromise, is what delivers both.

The overlay: guidance without exposure

A prompt is only useful if the rep can see it and the prospect cannot. The final stage of the pipeline renders the guidance in a discreet desktop overlay on Mac and Windows, sitting on top of Zoom, Microsoft Teams and Google Meet — and the overlay is hidden from screen sharing, so even when the rep shares their screen, the prompts and scorecard never appear in what the prospect sees.

No bot joins the meeting, so there is no extra participant and nothing to break the natural feel of the call. The overlay is designed to be calm and glanceable: a single line of guidance when it genuinely helps, silence otherwise, never a dashboard demanding attention. That restraint keeps the rep present with the prospect rather than buried in a screen. The overlay is the visible end of an otherwise invisible pipeline — and getting it discreet and unobtrusive is what makes real-time coaching usable on real calls rather than only in practice.

Why the pipeline works on any call type

Because the pipeline is built around audio rather than a specific meeting platform, it works wherever there is a conversation. On video, the overlay sits over Zoom, Teams and Meet. On phone calls and in-person meetings, the same dual-stream capture and the same sub-two-second loop apply. The coach works from the audio, not the app, so no call type is left uncovered.

The same engine also coaches recruitment conversations, not just sales. The pipeline is conversation-agnostic: swap in the recruitment playbook and the detection looks for talent signals — notice period, salary expectations, motivation, eligibility, counteroffer risk — while the scorecard tracks Salary, Notice Period, Motivation, Eligibility, Availability and culture-fit. Nothing about the underlying real-time mechanics changes; only the definitions of what counts as a coachable moment do. That generality is a direct benefit of building the pipeline around the universal raw material of every call — the audio of two people talking — rather than around any one tool or use case.

Keeping the guidance glanceable, not noisy

Speed is necessary for real-time coaching but not sufficient — guidance that arrives fast but constantly would simply pull the rep out of the conversation. The final design constraint is restraint. ConversationPilot condenses each suggestion to a single glanceable line and surfaces it only when it genuinely helps, rather than narrating every moment of the call.

This is a deliberate part of the pipeline, not an afterthought. The system decides not just what to say but whether to say anything at all, so the overlay stays calm and the rep stays present with the prospect. The mental model is a colleague whispering one useful thing in your ear, not a dashboard demanding to be watched. Because prompts are sparse and well-timed, they reduce the cognitive load of holding a whole playbook in your head rather than adding to it. A real-time pipeline that fired on everything would be technically impressive and practically unusable; tuning for the right moment is what makes sub-two-second coaching something a rep actually wants running on a live call.

Building the overlay to disappear from screen shares

One specific engineering requirement defines whether real-time coaching is usable in customer-facing selling: the guidance must be invisible to the other side even when the rep shares their screen. A prompt the prospect can see is worse than no prompt at all. ConversationPilot's overlay is built to be excluded from screen capture, so it renders on the rep's display but not in the shared stream.

This is harder than it sounds and is exactly the kind of detail that separates a real-time tool meant for live calls from a demo. Combined with the no-bot approach — nothing joins the meeting as a participant — it means the call looks completely ordinary to everyone but the rep. The rep gets a private coaching layer; the prospect gets an attentive conversation with someone who happens to ask great questions and never fumbles an objection. Getting this invisibility right is the last mile of the real-time pipeline, and it is what lets the whole system live on genuine customer calls rather than being confined to internal practice and role-play sessions. You remain responsible for complying with applicable call-recording and consent laws in your jurisdiction.

Real-time pipeline vs. post-call processing

CapabilityConversationPilot AIPost-call processing
Latency to guidanceUnder 2 secondsMinutes to hours later
Audio captureDual-stream, separatedMixed recording
Speaker attributionExact, no diarisation lagInferred afterward
Model strategyFast live + strong for reportsOne batch pass
DeliveryDiscreet on-screen overlayA report you open later
Changes the live callYesNo

Frequently asked questions

How does real-time coaching work?

It completes a full audio-to-guidance pipeline in under two seconds: capture both speakers as separate streams, transcribe live, detect the moment that matters, generate a prompt, and render it in a discreet overlay. ConversationPilot is engineered around that sub-two-second budget so the rep can act before their next sentence.

What is dual-stream capture and why does it matter?

ConversationPilot captures your microphone and the meeting audio as two separate streams instead of one mixed channel. It knows who's speaking from the first word, which makes attribution exact and removes a diarisation step from the critical path — improving both accuracy and speed.

How does it stay under two seconds?

Transcription runs continuously so there's no wait, detection and prompts run on a fast model (Claude Haiku 4.5), and anything that doesn't need to be live — like the deep report — is moved off the critical path onto a stronger model. The two-second budget shapes the whole architecture.

Why use two different models?

Live prompts need speed; deep analysis needs depth. Running both on one model would force a trade-off. ConversationPilot routes live detection and prompts to a fast model and the post-call report to a stronger one, so it delivers instant guidance and a thorough report without compromise.

Can the prospect see the overlay?

No. The overlay is hidden from screen sharing and only the rep can see it, and no bot joins the meeting. The call looks completely ordinary to the other side. You remain responsible for complying with call-recording and consent laws in your jurisdiction.

Does the real-time pipeline work on phone and in-person calls?

Yes. Because the pipeline is built around audio rather than a specific app, the same dual-stream capture and sub-two-second loop work on Zoom, Teams and Meet, plus phone and in-person conversations. The coach works from the audio, not the platform.

Have a world-class coach in every conversation

Real-time prompts, objection handling and qualification — while the call is happening.

Explore more