What is multimodal analysis on Tough Tongue AI?

Multimodal analysis is a new feature on Tough Tongue AI that evaluates video and audio recordings across multiple dimensions of delivery: vocal authority, pacing, physical presence, composure, audience engagement, and more. It uses both audio and visual signals to provide timestamped, evidence-backed feedback.

How is multimodal analysis different from transcript-based feedback?

Transcript-based tools only evaluate what you said: word choice, structure, and content. Multimodal analysis evaluates how you said it: your vocal tone, pacing, pauses, physical presence, and emphasis. It analyzes the same signals your audience experiences, not just the text of your words.

How do I use multimodal analysis on Tough Tongue AI?

Create a scenario, enable multimodal analysis in the session analysis settings, define your evaluation rubric with the dimensions and scoring criteria you care about, then upload your video or audio recording. The AI produces a structured report with scores, timestamped evidence, and specific drills for improvement.

How do I write a good rubric for multimodal analysis?

A strong rubric defines specific audio and visual signals for each dimension, includes clear scoring levels with concrete behaviors, requires timestamps for every claim, assigns percentage weights to prioritize what matters most, and includes quick drills for each parameter. The more specific the rubric, the more actionable the feedback.

What are the best use cases for multimodal analysis?

Multimodal analysis is built for high-stakes presentation prep, sales coaching, communication coaching at scale, executive communication programs, and public speaking training. Individuals use it to get detailed delivery feedback before important moments. Coaching organizations use it as a first pass to identify exact timestamps and issues before investing human coaching time.

Launching Multimodal Analysis on Tough Tongue AI: Evaluate Delivery, Not Just Words

Most communication tools analyze what you say. They take your transcript, run it through some NLP, and give you feedback on word choice, structure, filler words, maybe sentence complexity.

That’s useful. But it misses the thing that actually changes outcomes.

In high-stakes communication, whether you are pitching investors, presenting to a board, closing a deal, or delivering a keynote, what separates good from great is not what you say. It’s how you say it. Your voice. Your pacing. Where you pause. What you emphasize. Whether you project authority or uncertainty. Whether your physical presence commands attention or bleeds nervous energy.

Transcript analysis can’t see any of that.

Today we are shipping multimodal analysis on Tough Tongue AI. Upload a video or audio recording, and the AI evaluates your full delivery, not just the words. Here is a quick walkthrough of how it works:

The Problem with Transcript-Only Analysis

Most analysis tools on the market today follow the same pattern. They take your audio, convert it to text, and analyze the transcript. Some count filler words. Some measure speaking pace. Some evaluate sentence structure and vocabulary.

This gives you feedback on content. But content is only half of communication.

Think about what actually happens when you are in the room. The people listening to you are not reading a transcript. They are watching your face, reading your posture, hearing the confidence or hesitation in your voice, noticing where you pause and where you rush. Their judgment of your effectiveness is based on the full experience.

A transcript-based tool evaluates a different thing than what your audience is experiencing. That gap is where the most important feedback lives.

Multimodal analysis closes that gap. It evaluates the same signals your audience evaluates: voice, video, and words together.

How It Works

The setup takes 3 simple steps:

Create a Scenario with multimodal analysis switched on
Define Your Rubric: The rubric tells the AI exactly what to evaluate and how to score it. Define the dimensions, the scoring criteria, the evidence the AI should look for, and even the format of the output. Check the reference rubic in the scenario shared at the end of blog for reference.
Upload and Analyze: Once the rubric is set, users upload their video. The AI processes the recording across all dimensions simultaneously, then produces a structured report.

Best Practices for Writing Rubrics

The rubric is what makes this powerful. A vague rubric produces vague feedback. A specific rubric produces specific, actionable feedback. Here is what we have learned from building rubrics across dozens of use cases.

Tell the AI What Signals to Look For

For each dimension, do not just name it. Describe the specific audio and visual evidence the AI should watch for.

For example, instead of just writing “Evaluate confidence,” write: “Look for nervous laughter, rushed pace bursts, fidgeting (visual), throat tightness (audio), over-qualifying or over-apologizing. Note how the speaker recovers from stumbles: do they freeze, or do they move through it smoothly?”

This specificity is what turns generic AI output into expert-level analysis.

Define Clear Scoring Levels

For each parameter, describe what each score range looks like in practice. This prevents the AI from clustering everything at 7/10.

In our reference rubric, each parameter has five explicit levels:

Score	Level	What it looks like
9-10	Elite	Exceptional; polished, high-stakes ready
7-8	Strong	Confident and effective; minor refinement needed
5-6	Developing	Message lands but gaps are noticeable
3-4	Inconsistent	Significant issues weaken the delivery
0-2	Needs Work	Fundamental barriers to effective communication

When you describe each level with concrete behaviors (not just adjective labels), the AI calibrates its scoring much more accurately.

Require Timestamps for Every Claim

Add an explicit instruction: “Every piece of evidence MUST include a timestamp in MM:SS format.” This forces the analysis to be grounded in specific moments rather than making general impressions. It also makes the feedback immediately actionable because the speaker can jump to exactly the moment the AI is referencing.

Use Weights to Prioritize

Not every dimension matters equally for every use case. Assign percentage weights that reflect what matters most in the context you are evaluating. Our reference rubric weights physical presence and vocal authority at 15% each (because they are the most impactful delivery signals in live presentations) while visual storytelling is at 9% (important but secondary to mechanics). Adjust these for your context. A podcast rubric might weight vocal authority at 25% and physical presence at 0%.

Where to Use Multimodal Analysis

Preparing for High-Stakes Presentations

You have an investor pitch next week. You have rehearsed the content. You know your numbers. But you have no idea how your delivery is landing.

Record yourself giving the pitch. Upload the video. Get a full multimodal breakdown that tells you: your vocal authority was strong through the problem statement but dropped during the pricing slide. Your pacing was excellent in the opening but you rushed through your differentiation section without pausing after your strongest data point. You lost eye contact at 4:23 when you started reading from your notes and did not recover it for 18 seconds.

Now you know exactly what to fix. Record again. Upload again. Compare scores. Repeat until the delivery matches the quality of the content.

This practice loop is what separates adequate preparation from thorough preparation. Content gets you in the room. Delivery determines what happens once you are there.

Sales Coaching and Enablement

Sales teams need feedback on delivery, not just messaging. A rep might say all the right things on a discovery call but deliver them with hesitation that undermines credibility. Transcript analysis would give them a passing grade. Multimodal analysis catches the gap.

Build a rubric specific to sales delivery: vocal confidence during pricing conversations, composure when handling objections, pacing during discovery (slowing down to listen vs. rushing to the next question), and physical presence during video calls. Upload call recordings and get specific, timestamped feedback that a sales manager would take hours to produce manually.

Communication Coaching at Scale

If you run a coaching practice or a training organization, you know the bottleneck: your coaches’ time.

Watching full recordings, identifying issues, noting timestamps, formulating feedback. This is time-intensive work. And a lot of it is pattern recognition that happens before the real coaching even starts.

Multimodal analysis handles that first pass. It watches the entire recording, identifies the specific moments worth discussing, notes the timestamps, and provides the initial assessment across every dimension in your rubric.

Your coaches then start their sessions already knowing: “At 3:12, the client’s authority dropped during the pricing discussion. At 7:45, their pacing was excellent during the story but fell apart during the transition to the ask. At 11:20, they lost eye contact for 15 seconds during the objection response.”

The coach’s time goes into the deep work: understanding why these patterns happen, helping the client develop strategies to address them, building confidence in specific areas. Not into watching hours of video hunting for moments.

This changes the economics of coaching. More clients get detailed feedback. Coaches spend their limited hours on the work only humans can do.

Executive Communication Programs

Leaders are evaluated on presence as much as substance. How a CEO delivers a quarterly update, how a VP presents a strategy to the board, how a manager handles a difficult all-hands question. These moments are defined by delivery.

Build a rubric focused on executive presence: composure under pressure, vocal authority during difficult messages, physical grounding, audience engagement during Q&A, and the ability to create impact through strategic pausing. Upload recordings of rehearsals or actual presentations and get feedback calibrated to what “executive-ready” looks like.

Public Speaking and Keynote Preparation

We tested this on one of the most famous speeches of the last two decades: Barack Obama’s 2004 Democratic National Convention keynote. The analysis scored it 9.7 out of 10.

The value was not the score. It was the specifics. The AI identified exact timestamps where Obama’s vocal modulation created emphasis, where his pacing shifted to build tension, where pauses let key phrases resonate, and where his physical presence reinforced the authority of his words. It also found the rare moments where language was less precise or a transition could have been sharper.

See the full analysis: Obama Speech Analysis

This is the kind of feedback that used to require an experienced communication coach watching your performance frame by frame. Now it takes minutes.

Try It Yourself

We have built a speech analysis scenario with the full 8-dimension multimodal rubric described in this post, ready to go. Upload your own video and see the analysis in action.

Try the Speech Analyst scenario

Feel free to remix the scenario and customize the rubric for your specific use case. Adjust the dimensions, change the weights, add your own scoring criteria, and define the evidence cues that matter for your context. The rubric is what makes this yours.

If you want to build a rubric for a different use case (sales delivery, executive presence, classroom teaching, interview performance) and want help structuring it, reach out. We are happy to help.

About Tough Tongue AI:

We build AI agents for high-stakes communication practice and evaluation. Multimodal analysis is available today for all users uploading video or audio recordings.

Platform: app.toughtongueai.com Contact: help@getarchieai.com

Links: