GCP

Vertex AI Gemini 3.x: agent billing, token costs, and Cloud Run GPU patterns

Gemini 3.x on Vertex AI is billed by input and output tokens; agent orchestrations can generate multiple billable events. Track tokens, retrieval, and compute.

June 16, 2026·3 min read·AI researched · AI written · AI reviewed

A single Vertex AI "agent" call is no longer one billable API hit and a predictable model-response size — it's a choreography that can spawn model token charges, retrieval/search costs, session/memory storage, and multiple downstream LLM calls. That's the operational surprise rolling through community threads this week: clarified billing guidance for Gemini 3.x and the Agent Platform makes the multiplicative cost model explicit, and teams that treat agents like a single black-box call are going to see unpleasant postmortems.

Google's recent AI announcements reiterate the move toward the Gemini Enterprise Agent Platform on Vertex AI and update model availability to Gemini 3.x. The important clarification — reiterated across pricing guides and community posts — is that Vertex AI bills Gemini models by input and output tokens (check Google's pricing page for per-unit rates), and agent-style orchestrations can fan out. A single user prompt can trigger: an initial model invocation, one or more retrieval/search calls, memory/session syntheses, and downstream tool calls that each produce their own token accounting. Older model variants lingering in templates only make auditing harder.

Where costs multiply

This is where most teams trip up: they think in terms of "requests" but Google charges in tokens and services. Consider three cheap optimizations that become expensive at scale:

  • Retrieval loops that re-tokenize long context windows on every agent step instead of caching embeddings produce repeated input token charges.
  • Agents that fan out to multiple models (for example, a small planning model plus a larger reasoning model) multiply output token costs per user interaction.
  • Using managed search or external tool calls inside an agent (vector search, web search) adds non-model billable events that accompany token billing.

None of that is new, but the recent clarifications make it explicit: one conversational turn can map to several billable events. That means cost visibility needs to be at the same granularity as your agent orchestration.

What to do now — practical patterns that actually work

Treat tokens as first-class telemetry. Track input and output tokens per request, correlate them with cost, and surface a per-agent cost metric in your platform dashboards. Separate static from dynamic context: push stable knowledge into embeddings or a dedicated memory store and avoid re-sending long static readings to Gemini every time. When you can, do retrieval and embedding lookups off-model (vector search) and only send the minimal context to Gemini; where low-latency local inference matters, run smaller models on your own compute and avoid repeated remote token charges.

Cloud Run remains a recommended stateless execution plane for inference patterns; where available, consider GPU-backed Cloud Run options and tune concurrency for throughput. Use containerized inference (Cloud Run or GKE) for batching and serving predictable, high-throughput models, and pair that with Vertex AI for managed LLM calls when you need the latest Gemini family models. If your architecture needs both, isolate the two billing domains: model token events on Vertex AI and compute/GPU costs on your container plane.

GKE release notes this week included minor control-plane and node-image fixes — nothing that changes this core calculus, but check your cluster versions and autoscaler behavior if you're using GKE for hosting retrieval/embedding services. (If you missed it, our previous coverage of Cloud Run inference and Gemini 2.5 patterns is still relevant: Vertex AI: Gemini 2.5 Flash–Lite GA — Cloud Run GPUs GA and GKE Inference Updates.)

My take: Google is doing the right thing by making the agent billing model explicit — opacity was the real problem. But the platform design burden now shifts to teams: bundling model calls inside opaque agent semantics was convenient, but convenience without instrumentation is a cost time-bomb. If your platform treats LLMs like HTTP microservices, you win; if you keep thinking in single-request terms, you're going to get surprised.

Prediction: over the next 6–12 months we'll see two distinct platform patterns emerge: (1) low-latency, high-throughput inference clusters on Cloud Run or GKE for predictable per-request compute costs, and (2) a managed-LLM fabric for variable, agent-driven interactions instrumented at token granularity. The teams that separate these responsibilities now will avoid the billing chaos others are about to learn the hard way.

Sources

vertex-aigeminicloud-rungkeai-pricing
← All articles
GCP

GKE per-node-pool maintenance exclusions, 90-day no-upgrade window, and concurrent node-pool upgrades (Preview)

GKE adds per-node-pool maintenance exclusions, an extendable 90-day 'No upgrades' exclusion, and Preview concurrent node-pool upgrades—tradeoffs for operators.

Jun 15, 2026·3mgkekubernetes
GCP

GKE per-node-pool maintenance exclusions and 90-day no-upgrade window (release channels)

GKE adds per-node-pool maintenance exclusions in release channels and extends the default no-upgrade exclusion window to 90 days letting teams freeze upgrades.

Jun 14, 2026·3mgkebigquery
GCP

BigQuery fluid scaling GA: per-second billing for autoscaling reservations

BigQuery fluid scaling goes GA with per-second billing and no minimum for autoscaling reservations, enabling bursty, near-zero cost analytics and slot costs.

Jun 13, 2026·3mbigqueryvertex-ai