Modern LLMs answer prompts in two distinct modes — from training data, or from live web retrieval. When end users open ChatGPT, Claude, Gemini, or Perplexity, they almost always get the second mode. When most GEO audit tools query the same engines via API, they almost always get the first. The audit then measures something different from what the user actually sees. Below: how to spot the issue, what each engine's grounding parameter actually does, and the five questions that separate methodology from marketing.
The single variable that produces 30-point swings
Run an experiment. Take a brand you know to be active in its category, and run a GEO audit through any consumer-facing AI visibility tool. Note the share of voice in ChatGPT.
Now open ChatGPT directly, type one of the same prompts, and watch what happens. The model will visibly browse — you'll see "Searching the web…" appear before the response comes back. The answer will reference current articles, recent reviews, today's pricing pages.
That visible search is the foundational accuracy variable in GEO measurement, and it has a clean technical name: web search grounding. To get an unbiased measurement of how a brand surfaces in AI answers as users actually experience them, every prompt must be sent to the API with grounding (or web search, depending on the engine) explicitly enabled. The moment a measurement is collected without grounding, it captures what the model remembers — which can be 6, 12, or 18 months out of date — instead of what the model actually shows users today.
This isn't a quirk of any particular tool. It's a property of the major LLM APIs themselves. Every one of them defaults to off. Web search must be opted into, every time, on every call.
What each engine does (and how to verify it)
Each major engine has its own way of enabling web search via API. If you're auditing the audit, these are the parameter names worth knowing.
ChatGPT (OpenAI)
OpenAI ships specific search-enabled model variants on the Chat Completions API: gpt-4o-search-preview, gpt-4o-mini-search-preview, and gpt-5-search-api. Enabling search requires both selecting one of these models and passing web_search_options: {} in the request. Without it, the model answers from training data only — which for niche or new categories means hallucination, evasion, or "the category is still emerging" boilerplate. We've seen audits where ChatGPT scored 0% for clearly active brands because the model was queried in this configuration.
Claude (Anthropic)
Claude's web search is a server-side tool, GA on the Anthropic Messages API since April 2026. It's enabled by passing a tool definition ({"type": "web_search_20250305", "name": "web_search"}) to the request. Claude then decides whether to invoke it per-prompt, can run multiple searches in series, and returns the actual queries it generated as part of the response — useful for auditing the audit. Without the tool, Claude answers from training data with more humility than ChatGPT, but with the same fundamental limitation. Pricing is $10 per 1,000 searches on top of standard token costs.
Gemini (Google)
Gemini calls it grounding with Google Search. Enabled by passing tools: [{googleSearch: {}}] to the generateContent endpoint. The response includes a groundingMetadata.webSearchQueries field showing the searches the model issued. Pricing is metered per grounded query (currently $35 per 1,000 grounded queries on the paid tier). Without grounding, Gemini answers from a training cutoff and will openly tell you "this is a rapidly evolving area" when the answer requires current data.
Perplexity
Perplexity has search built in by design — there's no toggle. Every API response includes citations natively. This is why Perplexity tends to score consistently across audit tools regardless of methodology. It's the only major engine where the grounding question doesn't apply.
AI Overviews (Google)
AI Overviews don't have a direct API. They appear in Google SERPs and have to be scraped from a real search query. This means AIO results in any audit are necessarily "live" — there's no training-data fallback to worry about. The risk here is different: scrapes have to handle Google's asynchronously-loaded AIO blocks, which not every provider does correctly.
A worked example: same brand, two methodologies
To make the magnitude of this concrete, here's a real audit run on an anonymised B2B SaaS brand in an emerging category. Same brand. Same prompts. Same competitors. The only variable changed is whether web search was enabled.
Each engine ran 45 prompts × 5 runs = 225 total queries. The first table is what an audit returns without web search enabled — the configuration most tools default to. The second is what it returns with web search enabled on every call.
Without web search enabled
| Engine | Brand SOV | What the audit captured |
|---|---|---|
| ChatGPT | 0% | Hallucinated unrelated tools (Datadog, Arize AI, etc.) |
| Claude | 12% | Knew some brands from training data, missed others |
| Gemini | 0% | "The category is still relatively new and evolving" |
| Perplexity | 97% | Search built in — unaffected |
| AI Overviews | 73% | Scraped live — unaffected |
| Headline SOV | 33.7% | The audit "verdict" |
With web search enabled on every engine
| Engine | Brand SOV | What the audit captured |
|---|---|---|
| ChatGPT | 78% | Cited brand by name, with sources |
| Claude | 76% | Searched the web, returned current data |
| Gemini | 72% | Grounded in real Google results |
| Perplexity | 97% | Unchanged — already grounded |
| AI Overviews | 78% | Unchanged — already live |
| Headline SOV | 78.3% | The audit "verdict" |
The brand didn't change. The prompts didn't change. The competitors didn't change. The methodology changed.
A 33.7% audit and a 78.3% audit lead to two completely different conversations with a client. One of those conversations is wrong. And in this category — emerging, niche, recently launched — the brand is genuinely visible to the AI engines today. It just isn't visible to a model that hasn't been retrained since before the brand existed.
Beyond grounding: the second methodology variable
Even when web search is properly enabled, there's a second methodology question that almost no GEO tool surfaces today. What did the AI actually search for?
When you type "what's the best AI visibility tool" into Gemini, Gemini doesn't just type that into Google. It silently decomposes the prompt into multiple sub-queries — "AI visibility tracker reviews 2026", "answer engine optimisation tools comparison", "GEO audit platform pricing" — runs each one separately, reads dozens of results, and synthesises an answer.
This is called query fan-out or query decomposition, and it's where most of the actionable signal in a GEO audit lives. Why? Because the fan-out queries reveal:
- The keywords the AI actually uses to find category leaders — which may not match the keywords you're targeting
- The competitive framings the AI assumes — "Brand A vs Brand B" decompositions tell you which competitors the AI clusters you with
- Specific content gaps — if the AI fan-outs to a query you're not ranking for, that's a discrete content recommendation
Most GEO audits show you the final AI answer. They don't show the reasoning trail that produced it. That's a missed opportunity, because the reasoning trail is where the action items live. Capturing fan-outs requires querying APIs that expose them — Gemini's groundingMetadata.webSearchQueries, Claude's server_tool_use blocks — and parsing them out. Not difficult, but not turned on by default in most pipelines.
Why this matters for content strategy
If Gemini fan-outs to "AI brand monitoring tools 2026" while researching your category, and your client isn't ranking for that exact query, that's a content gap you can close this week. Without the fan-out data, you'd never know it was a query worth targeting. Fan-out capture turns a static visibility report into a content-roadmap input.
Five questions to ask any GEO audit provider
Practical evaluation criteria. If you're considering a GEO audit — SaaS dashboard, one-off audit, monitoring service, anything — these are the questions that separate substantive methodology from marketing copy.
- For each engine, is web search or grounding explicitly enabled in the API call? Ask for the parameter names.
web_search_options,googleSearch,web_search_20250305. If the provider can't tell you, the answer is probably no. - Which models are being queried, and are those models web-search-capable? Some models — like
gpt-4.1-minion the standard Chat Completions API — don't support web search at all. Querying them is no different from querying without search. - Are LLM responses retrieved live, or pulled from a cached database? Some tools query an LLM once, store the response, and serve cached results to multiple customers. That's faster and cheaper for the provider — but the data is stale by the time you see it.
- Are fan-out queries surfaced anywhere in the report? If not, you're getting the conclusion without the reasoning. Useful, but incomplete.
- Is every prompt in the audit logged and traceable? Reproducibility is the bar. If a client questions a finding, you need to be able to point at the exact prompt, exact engine, exact response.
If any of those answers are "we don't know," "we don't expose that," or "we use a third-party data source" — keep asking. Methodology transparency is a fair thing to expect from anyone selling decisions to your clients.
What "Am I Visible?" does differently
The audit tool we built — Am I Visible?, run by Marketing Signals — was designed around exactly these questions, after we ran into discrepancies in our own client work that we couldn't explain.
The methodology in plain English:
- All five engines query live. ChatGPT, Claude, Gemini, and Perplexity go through their native APIs with web search and grounding enabled on every prompt. AI Overviews are scraped from a fresh Google SERP that runs at audit time.
- No middleware between us and the model. Four of five engines are queried via direct API. No third-party rerouting, no shared caches, no rate-limit bottlenecks introducing artefacts.
- Fan-out queries are captured per prompt and surfaced as a Query Decomposition section in every report, including a side-by-side worked example comparing how each engine decomposed the same query, and a Keyword Opportunities table cross-referencing fan-outs against the brand's current organic rankings.
- Every prompt is logged. The audit JSON contains the full request and response for every prompt × engine × run combination, so any number in the report is traceable to its source data.
- Methodology is documented. The audit report includes a methodology appendix that names the model versions, parameter values, and pricing per engine. No "secret sauce."
That's not the only way to build a GEO audit. But it's the methodology we'd want a vendor to use on us.
The takeaway
GEO is a young enough discipline that "industry standard methodology" doesn't really exist yet. Different tools are making different choices, and the choices materially affect the numbers your clients see. The 30-point gaps between audits aren't measurement noise — they're methodology gaps that anyone can spot once they know to look.
If you only take one thing from this piece, take this: ask the methodology question. Of yourself, of your vendors, of anyone selling you "AI visibility" data. The right question isn't "what's my share of voice?" It's "what's the methodology behind the share-of-voice number?"
Get that one right and the rest follows.
See your real numbers.
Five engines queried live. Web search enabled on every prompt. Query Decomposition surfaced per engine. Full methodology disclosed in the report. From $997, delivered on demand.