We deployed two agents to Google's Agent Engine this sprint. The surprising part was how little code changed. Locally you run the agent in-process. In the cloud you resolve it by name and stream from it. That is the entire code diff. Everything that is actually hard about production is not in that line. It is in what now has to surround it.
This is the second post in our series on Google's Agent Platform. The first covers building the agent; the third covers trusting it once it is live.
TL;DR
- The local-to-cloud code change is one line: a local runner becomes
agent_engines.get(resource).stream_query(...). We deployed two agents and the calling code was otherwise identical. - The runtime bills per request with no idle charge, cold-starts in under a second, and runs operations up to seven days. The gotchas that bit us were not in the code: a region mismatch fails silently, and the resource is permanent until you delete it.
- State splits in two: Sessions for this chat, Memory Bank for every chat. We costed Memory Bank at roughly $13 a month for 200 active users.
- The governance floor (registry, policies, gateway, identity, audit) is configured, not coded. It is the wire-level half of safety.
- Cost control is a set of levers you pull only when traffic forces it.
The one line
Here is the whole code difference between local and production. Locally, you invoke the agent in-process. In the cloud, the only thing that changes in your calling code is the line that resolves the agent:
from vertexai import agent_engines
# the one line that changes: resolve the deployed agent by name
eng = agent_engines.get("projects/YOUR_PROJECT/locations/us-central1/reasoningEngines/AGENT_ID")
sess = eng.create_session(user_id="alice")
for ev in eng.stream_query(user_id="alice", session_id=sess["id"], message="..."):
print(ev)
Deploying is one CLI call (adk deploy agent_engine), which packages the code, builds a container, and registers it as a Reasoning Engine. After that the agent is addressable from anywhere with credentials. The same code path runs locally with adk run; only the resource name changes.
So if the code barely changes, what does? Four things the agent now has to do that it never had to do on your laptop: it has to run, remember, behave, and stay affordable.
It has to run
On your machine the agent runs because your terminal is open. In production something has to host it, survive restarts, scale, and not cost money while idle. That is the Agent Engine runtime.
It bills per request past a monthly free tier, with no idle GPU or CPU charge, cold-starts in under a second, and supports long-running operations up to seven days. A small agent handling a hundred user-minutes a day fits inside the free tier. That part is smooth. The parts that bit us were operational, not in the code:
- The staging bucket region must match the engine region. An engine in
us-central1with a bucket somewhere else fails silently. No error, just no deploy. - The resource is permanent. A Reasoning Engine stays registered, and billable-capable, until you explicitly delete it. Track them in infrastructure-as-code or you will lose one.
- The first deploy takes three to five minutes; later ones reuse the base image and drop to about a minute.
If your agent serves a live interface, streaming lives here too: bidirectional streaming over WebSocket is the substrate, and the Live API adds parallel ("async") function calling so multiple tool calls fire at once during a response. Google positions that as roughly halving latency on multi-tool queries. (We have not measured that ourselves; it is a platform capability in preview, not our number.)
It has to remember
Locally, conversation state lives in a variable. In production it has to persist across restarts and, often, across sessions. The platform splits memory in two.
- Sessions are short-term: the events and a scratchpad for one conversation, persisted by the runtime with no external Redis. Every user turn, model reply, tool call, and sub-agent transfer is one event in the log.
- Memory Bank is long-term, the cross-session memory. It uses an LLM to extract durable facts from session events ("this user prefers metric units"), versions them per user, and injects the relevant ones back into the prompt at request time.
Memory Bank bills at $0.25 per thousand events extracted, and the back-of-envelope is reassuring. Two hundred active users, five chats a week, a dozen events a chat is about 52,000 events a month, which is roughly $13. Session storage is bundled into the runtime's compute, and retrieval at query time is free.
It has to behave
A demo agent answers to you. A production agent answers to your security team, your auditors, and every other agent in the org. The Govern pillar is the enterprise floor, and the important thing about it is that it is mostly configured, not coded.
- Agent Registry is the catalog of approved agents and tools, the source of truth for what is deployable.
- Policies are declarative rules for IAM, compliance, and allow/deny per data class.
- Agent Gateway is a front-door proxy with Private Service Connect for private connectivity and auth offload.
- Agent Identity gives each agent its own service account, which makes every action auditable end to end. That is what lets you answer "what did this agent touch?" with a single Cloud Logging filter on the agent's service-account principal, and get back every tool call, every query, every read, with caller and timestamp.
- The security stack underneath is the familiar GCP perimeter: VPC Service Controls, customer-managed encryption keys, audit logs, and data-loss prevention.
Concretely, that audit question is one Cloud Logging filter on the agent's service-account principal:
resource.type="aiplatform.googleapis.com/ReasoningEngine"
protoPayload.authenticationInfo.principalEmail="triage-bot@YOUR_PROJECT.iam.gserviceaccount.com"
severity>=NOTICE
and it returns every tool call, query, and read that agent made, with caller and timestamp. The payoff shows up under pressure. Picture fifty agents in one project and an exfiltration alert at 2am. Without per-agent identity the audit log says agent-platform-sa did X, which is any of the fifty; with it the log says triage-bot did X, for a specific user, in a specific session, and you are root-causing in minutes instead of days. (That incident is illustrative; the filter is real.)
This is org-level infrastructure: the wire-level, identity-level half of safety. The code-level half, the guardrails that live inside the agent, is the subject of the trust post, and you want both. Stand up the registry and per-agent identity first, the moment more than one agent is deployable, because the catalog is what every other control reads from.
It has to stay affordable
The last thing production adds is a bill that can surprise you, because an agent's cost is variable in a way a server's is not: variable input length times variable output times variable tool fan-out. The mistake that actually bites early is not a missing lever, though, it is a self-hosted endpoint left running: deploy a model to an endpoint, forget it, and you pay hourly per replica whether or not a single request arrives. That is the one we watch for. The rest is a set of levers, and the discipline is to pull them only when traffic forces it.
- Revisions and traffic splitting let you canary a new version with a single API call that sets the split (say 95/5) on the live engine, instead of standing up a second service behind a load balancer. (Public preview; we have not wired it into our pipeline yet.)
- Context Caching caches a stable prompt prefix so you stop paying to re-send the same long system instruction every call.
- Tuning (supervised, LoRA, or distillation) is a last resort. Base
gemini-2.5-flashusually wins; tune only when style or schema cannot be specified in a prompt. - Endpoints for self-hosted models bill hourly per replica even when idle. The default trap is deploy, forget, get billed. Set scale-to-zero on day one.
- Batch Inference is about half the price of online and runs asynchronously, right for evals, backfills, and classification where latency does not matter.
- Provisioned Throughput reserves capacity with a tokens-per-second guarantee, and pays off past roughly 70% sustained utilization of a unit.
Leave these untouched until traffic forces them, and the math is why. A small deployment sits inside the runtime's monthly free tier, memory costs land in the low tens of dollars at a couple hundred users, and per-request billing beats any reservation until you have sustained load. Reaching for cost levers before you have a cost problem is its own kind of waste. The one setting to apply on day one is scale-to-zero on any endpoint, so the idle-replica trap can never spring.
When you do not build it yourself
One shortcut is worth naming. If your problem is retail or restaurant customer experience, you may not build any of the above. Gemini Enterprise for CX is a 2026 wrapper SKU that ships prebuilt Shopping, Food Ordering, and Agent Assist agents, a visual builder called CX Agent Studio, and an omnichannel gateway, with a deploy-in-days pitch rather than build-in-weeks. Kroger, Lowe's, Papa Johns, and Woolworths are named launch deployments. (Those are external references from the launch, not our deployments.) The trade is the usual one: less control, faster time to value.
The take
Going to production is not a rewrite. It is one line of code and a set of responsibilities that line quietly takes on: somewhere to run, a memory, a rulebook, and a budget. The platform has a managed answer for each, and the discipline is to turn them on in that order, only as far as your traffic and your auditors actually require. The agent was the easy part. What surrounds it is the job.




