What to measure quarterly: a GEO scorecard

TL;DR

A quarterly GEO scorecard tracks eight numbers across three categories: visibility (where you stand), velocity (where you're going), and competitive position (where everyone else is). The scorecard isn't a dashboard for the sake of having one — each metric maps to a specific decision the marketing leader needs to make next quarter. Measure what changes a budget, not what looks good on a slide.

Why quarterly

Daily measurement is wasteful. AI engine answers vary day-to-day for reasons that don't matter (model temperature, retrieval timing, weekday vs weekend traffic). Tracking those fluctuations is signal-free noise.

Annual measurement is too slow. AI search is moving fast enough that you'll miss compounding losses or compounding wins for 12 months — too long to course-correct.

Quarterly is the cadence that matches actual decision-making in marketing functions. Quarterly is when you set budget, when you brief agencies, when you report to executives, when you adjust priorities. Match the measurement to the decision.

The other thing quarterly cadence forces: enough data accumulation to smooth out noise. A 90-day window of AI engine queries gives you statistical signal in a way that 7- or 30-day windows don't.

The 8 metrics

Three categories. Three or four metrics each. Each one is something the marketing leader can act on next quarter — not something that just looks good in a slide.

Category 1 — Visibility (where you stand)

Metric 01

Share of AI Voice (overall)

Across a representative prompt set (50–150 commercial queries about your category), what percentage of AI engine responses mention your brand?

This is the headline number. Every other metric drills into it.

How to gather: Run your prompt set 3 times each across ChatGPT, Claude, Gemini, Perplexity, and AI Overviews. Parse responses for brand mentions. Compute (your brand mentions ÷ total responses) × 100.

What to compare against: Your prior quarter, your top 3 competitors, your aspirational benchmark (the category leader's number).

What it tells you: Whether the work you're doing is moving the needle at the highest level. Trending up? Keep going. Flat? Diagnose where the gap is. Trending down? Urgent.

Metric 02

Share of AI Voice by funnel stage

SoAIV broken down across awareness, consideration, and decision queries. Most brands have wildly uneven performance across stages.

How to gather: Tag each prompt in your prompt set as awareness ("what is X?"), consideration ("how do I choose X?"), or decision ("best X for Y"). Compute SoAIV per tag.

What it tells you: Where the gap is concentrated. "Strong awareness, weak decision" is the most common pattern — the AI knows your category but doesn't recommend you specifically when buyers ask. The fix is different in each case (more category-defining content for awareness; more comparison and decision content for decision).

Metric 03

Citation rate

The percentage of AI engine responses that cite your domain as a source (with a clickable link).

Distinct from SoAIV: your brand can be mentioned without your URL being cited (the AI knows about you from training data, not from a retrieved page). Citation rate measures direct referral potential.

How to gather: When parsing responses, log not just brand mentions but also which URLs are cited as sources. Compute (responses citing your domain ÷ total responses) × 100.

What it tells you: How well your content performs in live retrieval. High SoAIV with low citation rate = strong training-data presence, weak retrieval game. Low SoAIV with high citation rate (rare but happens) = the opposite. Both have different remedies.

Category 2 — Velocity (where you're going)

Metric 04

Quarter-over-quarter SoAIV change

The single most important velocity metric. Are you gaining, losing, or flat?

How to gather: Last quarter's SoAIV vs this quarter's. Same prompt set (re-running the prompt set with new prompts each quarter destroys comparability — keep a stable core, refresh selectively).

What it tells you: The direction of momentum. If you're flat or down despite spending budget, the budget is going to the wrong things. If you're up, you can keep doing more of what's working.

Common watch-out: Variance is real. A 1–2 point quarterly move can be noise. A 5+ point move is usually signal. Watch for sustained directional change across 2–3 quarters before declaring a trend.

Metric 05

Editorial volume (high-authority mentions)

The number of mentions your brand earned in high-authority editorial publications during the quarter.

This is the leading indicator. Editorial earned this quarter shows up in AI training data 6–12 months later (and in live retrieval immediately).

How to gather: Use a media monitoring tool (Mention, Meltwater, or even Google Alerts at lower budget) to count brand mentions. Filter to "high-authority" using your own definition (mainstream press + reputable trade publications, typically). Exclude press release wires and republished content.

What it tells you: Whether your Digital PR motion is producing the volume needed to shift AI presence. The threshold for "enough" is roughly 30–60 high-authority mentions per year for sustained AI presence growth in most B2B categories.

Category 3 — Competitive position (where everyone else is)

Metric 06

Competitive Share of AI Voice

Your top 3 competitors' SoAIV alongside your own.

You can be improving while still losing — if your competitors are improving faster.

How to gather: Same prompt set, but parse responses for your competitors' brands too. Track their SoAIV across the same period.

What it tells you: Whether you're winning or losing the relative game. Your absolute SoAIV could rise while your share of category mentions falls — which means competitors are taking even more share.

Read carefully: The total share is not 100% (an AI response can mention multiple brands). But the ranking (you 4th, competitor A 1st, competitor B 2nd, etc.) is meaningful.

Metric 07

Sentiment frame distribution

When AI engines mention your brand, how do they describe you? Positive, neutral, negative? And what positioning frame do they apply?

How to gather: Categorise each response that mentions your brand by sentiment (positive/neutral/negative) and frame ("enterprise leader," "cheaper alternative," "for beginners," etc.). Compute distributions.

What it tells you: Whether being mentioned is helping or hurting. Negative-framed mentions are worse than no mention. A consistent positioning frame ("the X for Y") is a strong sign your brand narrative is unified across sources. Inconsistent or muddy frames mean your positioning is incoherent at the source level.

What to do with it: If sentiment is dominantly positive, keep doing what's working. If negative or mixed, audit what sources the AI is drawing from and address the underlying narrative — not the AI itself.

Metric 08

Engine-level variance

How does your SoAIV differ across ChatGPT vs Claude vs Gemini vs Perplexity?

How to gather: Already gathered if you're running the prompt set across multiple engines. Just compare the SoAIV-per-engine numbers.

What it tells you: Where to focus. Strong on ChatGPT but absent from Perplexity? Your training-data presence is good but your live-retrieval game is weak. Strong on Perplexity but absent from ChatGPT? Opposite — your retrieval-side content is strong but you're not in the AI's baseline knowledge yet.

Strategic context: Different engines matter differently for different audiences. Enterprise B2B tends to over-index on Claude. Consumer B2C over-indexes on ChatGPT and AI Overviews. Research-heavy buyers over-index on Perplexity. Weight which engines you optimise for based on which engines your buyers actually use.

Reading the scorecard

Eight metrics is a lot for a quarterly review. The discipline that makes the scorecard useful: each metric maps to a specific decision. If a number doesn't drive a decision, drop it.

Metric	Decision it drives
SoAIV (overall)	Are we winning or losing the channel? Affects total GEO budget allocation.
SoAIV by funnel stage	Where to invest content — top, middle, or bottom of funnel?
Citation rate	How much to invest in your own site's content vs external editorial work.
QoQ SoAIV change	Continue current strategy or pivot? Direction of momentum.
Editorial volume	Whether to scale Digital PR investment up or down.
Competitive SoAIV	How much faster you need to move to overtake or hold position.
Sentiment frame	Whether to invest in PR offence (more mentions) or PR defence (fix narrative).
Engine-level variance	Which engine to prioritise next quarter, given your audience.

Setting up the scorecard the first time

A practical sequence for a team setting this up from scratch:

Step 1: Build the prompt set (week 1)

Aim for 50–150 commercial queries spanning your category. Tag each by funnel stage. Get the set agreed by enough stakeholders that you won't change it next quarter — comparability over time matters more than perfect prompt design.

Step 2: Choose your engines (week 1)

ChatGPT and Perplexity are mandatory. Claude and Gemini are recommended unless you have strong reason to skip them. AI Overviews if your buyers Google heavily. Lock the engine list and don't change it.

Step 3: Run the baseline (week 2)

Execute the prompt set across all chosen engines. Three runs each. Log raw responses; don't parse yet.

Step 4: Build the parser (week 3)

Code (or buy a tool) that takes raw responses and produces brand-mention counts, citation lists, sentiment classifications, frame categorisations. This is the part most teams under-estimate. Hand-parsing 1,500 responses is grim.

Step 5: Compute the eight metrics (week 4)

Run the parser. Generate the scorecard. Share with stakeholders. Establish baseline.

Step 6: Lock the methodology (end of quarter 1)

Document everything: which prompts, which engines, how runs are scheduled, how mentions are counted, what counts as "high authority" for editorial volume, etc. Hand this to whoever inherits the work. Without methodology lock, quarterly comparisons become impossible.

Step 7: Re-run quarterly (every quarter after)

Same prompts, same engines, same parser. Generate the scorecard. Compare to prior quarter. Make decisions.

Common scorecard mistakes

Tracking too many metrics

If your scorecard has 20+ metrics, nobody will use it. Eight is enough to surface the important patterns without drowning the reader. Resist the temptation to add more.

Changing prompts every quarter

The single biggest enemy of quarterly comparability. If your prompt set this quarter is different from last quarter, you can't tell whether your SoAIV moved because of your work or because of the prompt change. Lock the core prompt set for at least 4 quarters before refreshing — and refresh by adding new prompts alongside, not replacing old ones.

Including only your own content's performance

Some teams measure GEO by "how many of our pages are cited by AI?" That's one metric, not the whole picture. The more important question is whether your brand appears in AI answers, regardless of whether your URLs are cited. The two are different.

Reading single-quarter numbers as trends

One quarter of data is a snapshot, not a trend. Wait for 2–3 quarters of consistent direction before declaring something is "working" or "not working." Variance in AI engine responses is real.

No competitive context

Tracking your own SoAIV in isolation can produce false confidence. "We grew from 12% to 14%" feels good. "Our top competitor grew from 30% to 40% in the same period" reframes it. Competitive context is non-negotiable.

Skipping sentiment

Volume metrics without sentiment metrics give you half the picture. Brands have been blindsided by SoAIV growth that turned out to be growth in negative mentions. Always check sentiment alongside volume.

The version that fits on one slide

Most CMOs want a one-slide quarterly summary. The version that works:

Q3 2026 GEO scorecard (template)

Visibility: SoAIV 18.4% (vs 14.2% Q2, +4.2 pts). Funnel-stage breakdown: awareness 28%, consideration 16%, decision 9%. Citation rate 11.2%.

Velocity: +4.2 pts QoQ. 47 high-authority editorial mentions (vs 32 Q2).

Competitive: Ranked 3rd in category SoAIV. Competitor A 31%, B 24%, us 18%, C 14%. Gap to leader narrowing (was 17 pts, now 13).

Sentiment: 64% positive, 31% neutral, 5% negative. Dominant frame: "specialist for X use case" (good — matches positioning).

Engine variance: Strong on ChatGPT (24%), weak on Perplexity (8%). Action: prioritise live-retrieval content next quarter.

That's the whole scorecard. Five sentences. Drives a budget conversation. Drives content prioritisation. Drives PR investment decisions.

The discipline of measurement isn't producing impressive dashboards — it's producing the smallest set of numbers that change behaviour. The scorecard above does that. The fancier you make it, the less likely it is to actually drive decisions.

How often the scorecard surfaces something surprising

In our experience working with brands across categories, every quarterly scorecard surfaces at least one surprise per quarter. Usually one of:

A competitor pulled ahead on a metric you weren't tracking
An engine's algorithm shifted and your performance changed without you doing anything
A campaign you thought worked didn't move the needle; one you didn't notice did
Sentiment shifted negatively before the volume numbers reflected it
A funnel stage you assumed was strong turned out to be your weakest

This is the value of measurement: surfacing what intuition doesn't see. The brands that do this consistently end up with calibrated intuition over time. The brands that don't end up confidently allocating budget to things that aren't working.

Which is the entire point of having a scorecard in the first place.