On Building an Agent

April 21, 2025

During one of my first few weeks at work, I remember talking to a senior engineer about the surging hype around AI products. I brought the derisive outlook characteristic of inexperience, brushing the concerted efforts of some very smart engineers and entrepreneurs away with "they're just making OpenAI API calls". I have since developed a dislike for people who speak with the self-assurance of my old self from a few months ago, and I've aligned myself more closely with this post I saw on TikTok.

The engineer explained in a not so unserious tone that the term "wrapper" had dubious utility in the software engineering domain. Any metal shell be could fashioned to house a supercar engine, but that shell does not assume supercar status so easily. That engine would have to be placed in an equally imposing body, and only out of the harmony of their strengths, would a supercar be born. To bring about such a union, you would need equally competent builders, a defiant vision, and so on. The LLM APIs, then, can only be said to be the engine. What we do with it is a whole other engineering pursuit.

A few weeks after this conversation, I was handed an engine, the Vertex AI API, and tasked with building an AI agent for internal use at my company. I had little knowledge for defining an agent, let alone building one. My first spike landed me on this article from Anthropic, which, foundationally, defined "agent" for me. It drew the important distinction between workflows and agents, and I quote, "agents [...] are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks." I had grown fond of Sonnet 3.5, so Anthropic's words were good enough to be gospel to me.

The next few spikes got me through the foundational knowledge needed to start building, although that wouldn't start until a month later. I had a good deal of context already from a few relevant classes I had taken in university (I had just taken Applied Machine Learning in my final semester), so none of this was exactly new to me, but my course content was too low level to be of much help for the task I had at hand. I wasn't being asked to build a convolutional neural network from scratch anymore but rather being asked to solve a business problem. So I got myself up to speed with the basics of chunking text, working with emebeddings and vector databases, LLM function calling — the engine had been taken care of; all that was left to do was to build a worthy shell.

We looked into a few different agent building frameworks that are already quite popular in this space, including LangGraph, but also CrewAI, Flow, and, Agno, to name a few others. They promised to to be abstracting away some major heavy-lifting involved in building an agent, but on the other hand, each of them seemed to be imposing their own philosophies of agent building. In the minimum, CrewAI required us to name our modules in accordance with their naming convention, while frameworks like Controlflow did not natively support parallel agent calls. Not that nothing in the likes of the internal tool we're building had been done before, but we had certain constraints that required us to jump through extra hoops were we to implement it with the help of a framework.

Eventually we realized that frameworks offered us little value. We still had to define the functions (LLM's tools) ourselves, and an off-the-shelf framework would only be wrapping our code, while largely closing off customization options. At this stage, we decided to build our own agentic framework, one that would be modular and extensible by default, and allow future developers to easily add more agents as they wished. We built a few necessary autonomous agents necessitated by our functional requirements, but the way that our system has evolved, adding another tool to our system involves only three steps: creating your tool module, writing prompts, and placing it in a folder where the orchestrator can access them.

Framework

We built an AI agent that sits in the Zendesk ticket sidebar and has access to the support ticket that a customer support agent is viewing. Internally, it can access enterprise knowledge necessary to help customer support answer customers' queries more efficiently and alleviate much of the research work involved in resolving a ticket. To achieve this functionality with our custom framework, we came up with our own nomenclature, and in the following sections, I'll describe a few basic concepts.

Key Concepts

  1. Orchestrator: The Orchestrator is the central component of our system that presides over the agentic architecture. It is responsible for receiving user queries, invoking appropriate agents, and returning the final, synthesized response to the user. It maintains state for the entire duration of a chat between a user and the system in the context of a single ticket by using the ChatSession class from the Vertex AI SDK.

    The Orchestrator uses a Router to determine which agents to invoke based on the user's query, ticket context, and conversation history, and it invokes the agents in parallel, allowing for faster response times and more efficient processing of user queries. It generates an enhanced query, a plain-text version of the user's query augmented with further context, to send to the selected agents.

    Each chat session has its own Orchestrator instance, which helps demarcate concurrent chat sessions. When a connection is first established, the backend's entry point sets up a chat session and hands it off to the Orchestrator to handle subsequent requests.

  2. Agent: An agent, inheriting the BaseAgent class, is a semi-autonomous retrieval module that is capable of generating results based on plain text queries sent to it. Since agents are independent, some are more complex than others, depending on the data they work with.

Utility Classes

  1. Router: When the Orchestrator receives a message from the user, it decides whether to respond directly to the user or invoke retrieval agents. The Router handles this decision-making, and if agents do need to be invoked, the Router will also generate a query enhanced with context to send to the agents. As mentioned earlier, the agents do not maintain a state and cannot see the base chat history, so the enhanced query gives them enough context to retrieve the required information.

    Agents will also often need to augment the query that they receive with additional context. This is because different agents work different data sources and retrieval methods, and the context of the query may change based on the agent that is being invoked. Currently, each agent implements its own enhancement strategy, which yields queries tailored for vector search or full-text search. For vector search, enhancers will generate a semantically rich query that is suitable for vector search, while for full-text search, they will generally generate a string of tokens.

  2. SocketIO Connection Handler: The connection protocol for connecting the frontend and the backend services was a point of contention in the early days of the project, and we considered various options, including REST, SSE, gRPC, and WebSocket. We eventually settled on using WebSockets, since it is flexible and relatively easy to set up. For the first few development sprints, the WebSocket connection was implemented simply using FastAPI's built-in WebSocket module, but this proved to be more cumbersome than we had expected. We had to implement a complex queue system for emitting messages and a singleton connection manager. On the other hand, libraries like SocketIO already take care of a lot of these mechanics of establishing and keeping a WebSocket connection alive, such as retry, heartbeat, and polling as fallback mechanisms. These are all particularly important for our use case, since it is a chat-based application, and a long running synchronous process could easily cause the connection to time out.

    Eventually, we migrated to SocketIO, writing a SocketIOConnectionHandler class wrapping SocketIO sessions that allowed us to simplify the WebSocket implementation. To reduce cognitive load, we named the class methods mimicking the FastAPI WebSocket module, such as send_text and send_json. It also implements a special send_thought method, which may be used by agents or other modules to send "thinking" messages that are shown to the user in the chat interface while the system is processing their request. These are short, informative messages that hint at what the system is up to, and this keeps the user engaged while giving them the opportunity to better engineer their prompts.

  3. Response Handler: The ResponseHandler is a simple utility class that the Orchestrator uses to interface with the user. A lot of the times, the user's message can be responded to using the context already available to the Orchestrator's driver LLM, where invoking agents is not necessary. If the user is being conversational, or if they're asking about the current ticket, the LLM can generate a response using its context window. On the other hand, if the user is asking for external information, for example, "can you find relevant documentation?", the Orchestrator will invoke the appropriate agents to retrieve the necessary information. In either case, the ResponseHandler is used to send the LLM's response to the user.

  4. LLM Manager: VertexAI is at the core of our project, and we made use of several APIs from VertexAI's Python SDK to generate responses with the gemini-2.0-flash-001 model. The LLMManager class is a wrapper around the VertexAI SDK that streamlines the code for interacting with the LLM. It mimics VertexAI's methods, which helps avoid confusion when referring to the VertexAI documentation. It also helps us cut down on the amount of boilerplate code we had to write for each LLM call, such as converting the response to a Python dictionary or streaming the response to the frontend.

    In our first iteration, we were storing the chat history in memory as one string, appending new messages from both the user and the LLM to it as they came. This approach was a little rudimentary, and we had to be very careful not to append the wrong message to the chat history. The system makes several calls to LLMs outside the context of the base chat, for example, when agents make decisions, and the responses from these requests do not need to be added to the main chat history. Later, we abstracted this away using the ChatSession class from VertexAI, which makes chat session handling much easier. The ChatSession class has methods for sending messages to the LLM, and we can simply pass around an instance of ChatSession to reference a particular chat context.

    We also heavily leveraged the controlled generation API to get structured responses from the LLM. As an agent progresses through its workflow, at each juncture, it might need to make a decision and generate input for the next step. Instead of making two seperate calls to the LLM and incurring the latency of two calls, we can define a JSON schema for the response, and easily convert the LLM's response to a Python dictionary. For example, when the orchestrator receives a message from the user, it can choose to either respond directly, or invoke retrieval agents to help ground its response in relevant information. Instead of making one call to make a decision, a second to generate a routing pathway or a direct response, and possibly a third to generate a query for the retrieval agents, we can condense all of this into a single call to the LLM. The LLM will then generate either a direct response, or a routing pathway with a query for the retrieval agents.

A Word on Databases

Spanner

We did rounds around CloudSQL, Pinecone, Weaviate, and a few other databases popular in AI applications. Eventually, we landed on Google Cloud Spanner. Although Spanner is much more powerful than what our app currently leverages, it also shines in several other areas where other vector databases fell short. For example, Spanner let us store vector embeddings and tokens in one table, and we could perform hybrid searches and perform reciprocal rank fusion (RRF) on Spanner using a single query. We ran several tests on our data, and performance gains after 400 processing units seemed to level off, which meant that we would not have to scale beyond 0.4 Spanner nodes, create as many separate databases as we wanted, and only pay for compute and storage.

It might be interesting to note that, on a lower level note, creating connections to Spanner is expensive, and google-cloud-spanner is already designed to handle connection pooling. Our SpannerConnectionHandler class implements the singleton factory pattern that keeps a single connection to a Spanner instance alive for the entire lifetime of the application while maintaining a dictionary of references to the separate databases in the Spanner instance. For example, you can "ask" for a reference to the cache database by writing SpannerConnectionHandler("cache").get_database().

Firebase

To persist the chat history, since we didn't have relational data and the data itself wasn't really structured, we decided to go the NoSQL route. We also didn't need to show our users their past chats, and their chat data would only be visible to us, developers, so we could improve the app (which also let users submit feedback). For our use case, then, the Firebase Firestore free tier was good enough.

When a client connects to the backend, the main module will create a ChatHistory object for the chat session. The ChatHistory object stores messages in memory until the client disconnects, which is when the socket disconnect event handler calls the save_chat_history method from FirestoreHandler to insert the chat history into Firestore. FirestoreHandler is a singleton class that handles the connection to the Firestore database and allows the SDK to handle connection pooling.

Application Flow

To understand the application flow, it might help to imagine a sample conversation with the agent. From this point on, we will refer to the LLM driving the Orchestrator as "Agent".

When you first open a ticket on Zendesk, the frontend will establish a connection with the backend using the socketio-client library and send the current ticket ID and your username as connection parameters. This connection will stay open for as long as the ticket is open as a tab on Zendesk. The backend accepts this connection, and creates a chat session state dictionary, which stores an Orchestrator instance, an LLMManager instance, and an is_ready flag that indicates whether the system is ready to accept messages.

Upon connection establishment, the frontend sends the Zendesk ticket object to the backend, which the main module will process and extract key information such as metadata and conversation history from and invoke multiple agents in parallel to help it generate a summary of the ticket augmented with relevant information from other sources. This "augmented summary" is cached and used to initialize Agent's base chat session. When this is done, the is_ready flag is set to True, the Orchestrator sends a greeting message to the user, and the frontend allows the user to send messages to Agent.

If you write "hi" in the chat window, the frontend emits a message event over the WebSocket connection. The prompts for Agent are written in a way that that positions the Orchestrator as a facilitator between the user and Agent. When Agent was initialized, it was already informed of its role, the context of the conversation, and some other relevant system instructions that help personify it. The Orchestrator will now tell Agent, using its LLMManager, that the user has sent a message and ask it for the next step. Agent's Router will most likely decide to respond to the user directly, and the Orchestrator will use ResponseHandler to send Agent's response back to the user.

Next, if you ask "can you find relevant documentation?", the Orchestrator will again forward your message to Agent. Agent's Router will now decide which retrieval agents to invoke. Let us imagine that the Router decides to invoke the guides_agent, and so it generates an enhanced query for it.

The guides_agent receives the enhanced query, and it will generate its own pair of enhanced queries for vector and token search. The vector search query is vectorized, and the vector array and token string is used to construct a Google Spanner query. This query will return a list of relevant user guides, their contents, and links, and guides_agent will take these results and format them into a string. This string is then sent back to the driver LLM of the guides_agent, which decides which results to include in its response to the Orchestrator.

The Orchestrator receives this response from the guides_agent, and it will inform Agent that the retrieval agents that it had just invoked have returned their results. Agent will now synthesize the information and generate a final response for the user, which is sent back to the frontend and displayed to the user.

At each LLM call, the LLMs are instructed to generate "thinking messages" to keep the user updated on what Agent is doing. These are short, playful, and ephemeral messages that tell the user if the Agent is looking up information, synthesizing a response, or deciding on the next step.

A Word From Barthes

In an internal document, I wrote the following, and it seems fitting to quote it here to end this already long post.

Homaging the Citroën DS, the French essayist Roland Barthes wrote, "I think that cars today are almost the exact equivalent of the great Gothic cathedrals: I mean the supreme creation of an era, conceived with passion by unknown artists, and consumed in image if not in usage by a whole population which appropriates them as a purely magical object" (Barthes). It would be a fair estimation to say that had Barthes experienced artificially intelligent multi-agent systems, he would have dignified the technology with a similar, if not more profound, sentiment.