Available for opportunities

Ashok
Tak

Building resilient data platforms and intelligent AI systems. Currently at Lumenalta as Senior Data & GenAI Engineer — architecting data lakehouses and RAG-powered applications across finance, agriculture, energy and e-commerce. 2× Databricks Certified.

6 min read

How AI Agents Will Disrupt Classical ETL and Data Engineering

Permalink

For two decades, data engineering has followed roughly the same playbook: extract data from sources, transform it to fit your schema, load it into a warehouse. Rinse, repeat, debug at 2am when it breaks.

The rise of autonomous AI agents doesn't just optimize this workflow — it makes much of it structurally obsolete.

The problem with classical ETL

Classical ETL pipelines are brittle by design. They're built on assumptions: that your source schema stays fixed, that your transformation logic is knowable in advance, that the orchestration DAG accurately reflects reality. Every assumption is a future incident.

The result is a data engineering culture dominated by maintenance. By some estimates, data engineers spend 60–70% of their time on pipeline upkeep rather than building new value. Schema changes break ingestion. New data sources require weeks of integration work. Data quality failures surface quietly — sometimes only when a dashboard shows a CEO a wrong number.

"The most expensive bug in data engineering isn't the one that crashes your pipeline. It's the one that silently passes wrong data downstream for six weeks."

What agents change

AI agents — systems that can perceive state, reason over it, and take autonomous actions — break the core assumption of classical pipelines: that transformation logic must be written by humans in advance.

Instead of a hardcoded pipeline, imagine a system that:

  • Observes a new source's schema and infers transformation rules automatically
  • Detects when upstream data drifts from expectations and heals or quarantines it
  • Generates and tests transformation code from a natural language specification
  • Replans its execution graph dynamically when upstream conditions change

This isn't speculative. These capabilities exist today in nascent form across tools like dbt, Fivetran, Anomalo, and the emerging wave of AI-native data platforms. The trend is clearly toward composing them into autonomous, self-managing data workflows.

Where disruption hits hardest

Schema management. Classical pipelines break when schemas change. Agents can monitor schema drift in real time, propose and apply migrations, and validate that downstream consumers still receive what they expect — without human intervention for routine changes.

Data quality. Rule-based quality checks — nulls, ranges, referential integrity — are table stakes. Agents can apply semantic reasoning, flagging that "revenue for January looks inconsistent with last year's trend" and routing anomalies for review rather than silently passing bad data downstream.

Transformation logic. dbt models, Spark jobs, SQL transforms — these are code artifacts that encode business logic. Agents can generate first-draft transformations from plain-language specifications, iterate with tests, and explain what they do in plain English. The human role shifts from writing SQL to reviewing it.

Orchestration. Static DAGs assume a fixed execution graph. Agents can determine dynamically what needs to run based on data availability, freshness requirements, and downstream dependencies — closer to intent-based orchestration than schedule-based cron jobs.

What doesn't change

Not everything is disrupted. Data contracts — formal agreements between producers and consumers about what data looks like — become more important, not less. Agents need clear interfaces to reason against. Without well-defined contracts, autonomous agents will make reasonable-sounding mistakes with confidence.

Data governance and lineage remain deeply human concerns. Knowing where data came from, who can access it, and what decisions it influences is a compliance and trust question, not a technical one. Agents can surface lineage information, but humans must own the accountability.

The new data engineer

The data engineer's role doesn't disappear — it elevates. The work shifts from pipeline plumbing to defining the systems and contracts that agents operate within. Writing transformation logic gives way to reviewing agent-generated transformations. Debugging broken pipelines gives way to auditing agent decisions.

The engineers who thrive in this shift understand both the data domain and the AI systems operating on it — a combination that's rarer and more valuable than either skill alone. The data engineer becomes part systems architect, part AI auditor.

Classical ETL isn't dying today. But the economic pressure is real: when an agent can onboard a new data source in minutes instead of weeks, the question isn't whether to adopt this approach — it's how fast your organisation can.

2024 — Now

Lumenalta

formerly Clevertech

Senior Data Engineer · Mar 2024 — Present

Architected and developed data lakehouses for companies in finance, agriculture, energy and e-commerce using cloud partners, Databricks and open-source data stack.

Generative AI Engineer · May 2024 — Apr 2025

Built RAG-enabled GenAI applications automating knowledge-based enterprise workflows. Developed context retrieval systems, skills and tool use in an enterprise chatbot.

Python Spark Kafka Delta Lake Iceberg LangChain LangGraph pgVector AWS GCP Databricks
2023 — 2024

Learn AI Daily

Toronto, ON

Machine Learning Engineer · Oct 2023 — Mar 2024

Created a Generative AI application for a personalized tutoring start-up. Built the backend using open-source GenAI stack.

LangChain llama-index Python GenAI
2021 — 2024

Scotiabank

Toronto, Canada

Data & Machine Learning Engineer · May 2022 — Feb 2024

Built Data Profiling and Data Quality Engines as part of Enterprise Governance Initiatives. Developed ML-based approaches to detect data issues in production pipelines.

Software Engineer, Data Engineering · Jun 2021 — May 2022

Enterprise Data Lake (EDL) enhancement team in Global Technology Solutions.

PySpark Hadoop / Hive Python SQL Machine Learning
2018 — 2021

University of Toronto

Research & Teaching

Machine Learning Researcher · Sep 2018 — Dec 2020

Cyber-Physical Security and Smart Grid Security under Prof. Deepa Kundur. Developed ML models for fault detection and explored adversarial attacks on LSTM-based classifiers.

Teaching Assistant · Jan 2019 — Apr 2021

Computer Networks (ECE 361), Programming Fundamentals (ECE 244), Probability & Statistics (ECE 286), Computer Fundamentals (APS 105).

Deep Learning TensorFlow Keras LSTM Adversarial ML Python
2016 — 2018

Tata Steel

Jamshedpur, India

Project Manager, Utilities · Jul 2017 — Jul 2018

Led a team of 22 electrical engineers and technicians at the power station to achieve reliability through digital improvements and automation.

Management Trainee · Jul 2016 — Jun 2017

Load Dispatch Center — designed and evaluated U/F settings for load shedding schemes for power network stability (Grid Islanding contingencies).

Power Systems Automation Team Leadership
2015

Carnegie Mellon University

Kigali, Rwanda

Summer Research Assistant · May — Jul 2015

Designed a smart microgrid testbed integrating renewable energy sources for developing countries. Modelled IEC 61850 communication for Solar Home Systems and developed advanced DC protection schemes.

MASc

University of Toronto

Master of Applied Science · Electrical and Computer Engineering

Research focus on applied machine learning to power infrastructure protection and security. Thesis on adversarial machine learning applied to LSTM-based smart grid classifiers.

BTech

IIT Roorkee

Bachelor of Technology · Electrical Engineering · Indian Institute of Technology

Gold Medalist · Merit-Cum-Means Scholarship · Summer Undergraduate Research Award · Best Paper Certificate

Data Engineering

Databricks Apache Spark Kafka Delta Lake Apache Iceberg Apache Airflow Hadoop / Hive dbt SQL ETL / ELT

GenAI & ML

LangChain LangGraph LangSmith RAG pgVector LLMs TensorFlow Keras Deep Learning

Cloud & Platforms

AWS GCP Azure Databricks Python PySpark Docker

Certifications

Databricks Certified Data Engineer Professional
Deep Learning Specialization — Sequence Models · deeplearning.ai
Deep Learning Specialization — Neural Networks & Deep Learning
Deep Learning Specialization — Structuring Machine Learning Projects

Design of a generic microgrid testbed with novel control and smart technologies

Carnegie Mellon University · Smart Grid Research

Design and simulation of communication architecture for differential protection in IEC 61850 based substations

Substation Automation · IEC 61850 Protection

A Review of New Trends in Power Systems: Microgrids for Rural Electrification and DC Microgrids

Power Systems Review · Renewable Energy

Wireless power grid: Leapfrogging in power infrastructure of developing countries

Energy Access · Developing Countries

MASc Thesis

Adversarial Machine Learning on LSTM-based Smart Grid Fault Classifiers

View on UofT TSpace