156 reads

Can Google's Agent Development Kit Replace Data Pipelines?

by Vishal ChaurasiaMay 22nd, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

How I rebuilt a fragile data pipeline using modular agents and a chatbot framework. From Messy to Modular.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - Can Google's Agent Development Kit Replace Data Pipelines?
Vishal Chaurasia HackerNoon profile picture
0-item

How I rebuilt a fragile data pipeline using modular agents and a chatbot framework — yes, seriously.

From Messy to Modular

If you’ve ever nervously deployed a tiny change in a data pipeline, hoping nothing explodes downstream, you’re not alone.
That was me. The pipeline worked technically, but it became harder to control with every new edge case. It was like playing Jenga with production code.

Around the time LLMs and agent-based workflows started getting serious attention, I was exploring them out of curiosity — mostly to see if they could help with the kind of brittle pipeline logic I was dealing with. This is when I stumbled across Google’s newly released Agent Development Kit (ADK). I wasn’t really looking for a chatbot toolkit, but something about how it handled tasks with modular agents caught my eye.

A toolkit for chatbots? Sure. But it turned out to be much more than that.

What if I could use modular, memory-aware agents to take over parts of my pipeline logic — and let the system handle coordination? That became the starting point for a small but surprisingly useful experiment.

In this post, I’ll walk you through how I redesigned part of my data pipeline using Google’s ADK. I’ll share what worked, what didn’t, and whether this modular, LLM-powered approach is actually ready for real-world use.

What I Was Working With

Here’s the original setup — a classic three-step job:

  • Scheduler: Triggers the pipeline daily
  • Validator: Filters out malformed events
  • Deduplicator: Cleans up redundant records

As with any growing system, the codebase got bloated. Every time I patched one edge case, another cropped up, and debugging felt like whack-a-mole.

Eventually, I thought, what if each part of the pipeline had just enough intelligence to adapt independently?

Discovering Google’s ADK

At first glance, ADK looked like a framework for chatbots built on large language models (LLMs). But once I skimmed the docs and examples, it clicked — agents weren’t limited to chat.

With ADK, I could:

  • Define modular agents with memory.
  • Use simple tools to validate, transform, or format data.
  • Chain them together in a declarative flow.
  • Pass structured state between agents using keys.

It felt like infrastructure-as-code for pipeline logic — but powered by LLMs.

Rebuilding a Slice with Agents

I didn’t go for a complete rewrite. Instead, I picked one small part of the pipeline—a transformation module that handles session data. Whenever something new comes in, adding any further logic to the code becomes incredibly challenging. In this case, I have divided the overall flow into three steps: validation, deduplication, and sessionization.

Each step became its own agent, coordinated by the root agent (data_pipeline_agent)

Not all agents in ADK are the same. ADK offers several types of agents:

  • LLMAgent — Uses a large language model to interpret instructions and decides how to act.
  • WorkflowAgent — This type of agent is a deterministic rule-based agent, where LLM is not involved. This type of agent is further divided into three different types as Sequential, Parallel, or Loop.
  • ToolOnlyAgent – This agent is also a rule-based agent, whose sole purpose is to run the associated tool when invoked.
  • SubAgents — This type is used to define the hierarchy among different agents.
  • CustomAgent — This allows you to define a custom implementation of the agent you want to build with more fine-grained programmatic control.

For my use case, I selected LlmAgent for each component, which acted as a subagent to data_pipeline_agent. Why?

  • My use case needed to handle changing and sometimes unpredictable input (e.g., evolving event schemas and slightly changing file formats).
  • I wanted the agents to determine their next course of action autonomously.
  • Using LlmAgent allowed me to move the logic outside the code into editable instructions, reducing code churn.

I would have chosen the SequentialWorkflow agent type if I wanted complete control and reproducibility. But in this case, LLMs' flexibility and adaptability made more sense.

Each LlmAgent reads a prompt file (like validation_instruct.txt) and uses it to drive decision-making. That’s how I kept logic dynamic without rewriting the code every time business rules changed.

It felt clean. Declarative. And honestly, fun.

Code Snippet — Agent Setup

validation_agent = LlmAgent(
    name="validation_agent",
    model="gemini-2.0-flash",
    tools=[validate_events],
    instruction=load_instruction_from_file("instructions/validation_instruct.txt"),
    description="You are a data validation agent. Use the tool to validate event logs.",
    output_key="validated_state",
)

All the validation logic lives in a text file. Want to change the rules? Just update the prompt — no need to rewrite code.

Contents of validation_instruct.txt, used to guide the validation agent’s behavior, source: author

Tools Behind the Agents

Each agent had a simple tool associated with it — a lightweight Python function implementing the core logic. What was more interesting was not the tool — it was that each agent could decide when (or whether) to use its tools based on the instructions provided:

  • validation_tool: This tool enforced the expected schema by checking the presence of the required fields (user_id, event_type, and timestamp) and returning a dictionary with a list of valid and invalid events.
  • dedupe_tool: Utilizes a hashing logic to remove the exact matches while preserving the original order of events.
  • session_tool: Grouped the events into sessions based on user activity gaps. The default threshold is 30 minutes, but the agent can override this by passing another value in the session_gap_minutes parameter when calling the tool, if the user provides a different threshold to the agent.

The beauty of this setup is that the agents weren’t tightly coupled with the tools—they had context awareness. An agent can selectively trigger the tool or choose to skip it based on the incoming state. This flexibility of making intelligent decisions based on input data is hard to achieve in traditional pipelines without chaining a bunch of if-else logic.

Example Workflow

Suppose we start with a raw JSON file:

  1. load_event_file — A tool provided to the root agent that acts as a utility to load the input file provided by the user. Again, this is a selective behaviour that the agent may or may not use based on the data provided.
  2. validation_agent — Validated the records based on the data schema and segregated them into valid and invalid records.
  3. dedup_agent — Removes duplicate records.
  4. session_agent — groups events into sessions by looking at inactivity gaps — typically 30 minutes between user actions.

The final output includes JSON or markdown summaries: count of invalid records, deduplicated items, and session totals. Debugging? A dream.

Modular pipeline flow powered by ADK. The data_pipeline_agent manages state across each specialized agent, source: author

Modular pipeline flow powered by ADK. The data_pipeline_agent manages state across each specialized agent, source: author

What Worked Well

  • Modularity: Each agent could be tested in isolation.
  • No Glue Hell: ADK handled state transitions between agents.
  • Debugging: Markdown output made it easy to inspect.
  • Extendability: Want to add a step? Just add another agent.

What Still Hurts

  • Sub-Agent Tooling: You can’t use tools inside sub-agents yet.
  • Silent Failures: Mismatched state keys cause quiet errors.
  • No Tracing: ADK lacks built-in logs or failure tracebacks.
  • Missing File Upload: Web UI doesn’t support file uploads natively.
  • Docs Are Early-Stage: Good for basics, light on advanced patterns.

So while it’s promising, you’ll still be doing some trial-and-error — especially for complex orchestration.

Try It Yourself

🔗 GitHub Repo: https://github.com/vchaurasia95/adk-data-pipeline

Final Thoughts

I didn’t expect this to work so smoothly. But rethinking my pipeline with ADK turned out to be a pleasant surprise. Instead of layers of brittle logic, I had small, declarative pieces that played nicely together.

No, an LLM agent won’t replace your entire pipeline. But it can definitely make parts of it think, especially those repetitive tasks like validation or formatting.

It’s not magic. But it’s modular, testable, and maybe even a little fun.

I’ll be building on this more. Curious to see what else it can do.

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks