Resume Details Professional Profile

ALLU AKHILESHWARI

Data Engineer

📧 akhileshwariallu@gmail.com 📱 +918008505795

Professional Summary

  • Databricks Engineer with 3+ years of experience in building and supporting cloud-based data platforms.
  • Strong hands-on expertise in Azure Databricks, PySpark, and Spark SQL for large-scale data processing.
  • Experienced in designing and implementing ETL/ELT pipelines using Azure Databricks and Azure Data Factory.
  • Proven ability to transform raw, structured, and semi-structured data into analytics-ready datasets.
  • Hands-on experience with Delta Lake concepts, including optimized storage and reliable data handling.
  • Skilled in developing metadata-driven data pipelines for full, incremental, and snapshot data loads.
  • Experience working with Azure Data Lake Storage Gen2 and Azure Blob Storage as primary data sources.
  • Strong understanding of data warehousing concepts, including fact and dimension modeling.
  • Proficient in SQL and PySpark for complex data transformations and aggregations.
  • Experienced in performance tuning through partitioning, efficient joins, and Spark optimizations.
  • Comfortable working in Agile delivery environments, collaborating with cross-functional teams.
  • Strong analytical and problem-solving skills with a focus on data quality and reliability.

Work Experience

Software Engineer
DC Sand Staffing Private Ltd, Hyderabad
11 Nov 2023 - 20 Feb 2025
Associate
Wipro Limited, Hyderabad
13 June 2021 - 03 March 2023

Technical Skills

Azure Technologies: ADF, ADLS Gen2, Azure Blob Storage, ADB, Databricks
Cloud Data Warehouses: Azure Synapse Analytics, Snowflake
Database: SQL Server
Visualization Tools: Power BI

Projects

Project #3: AzureMX
Internal Project
Technologies: Azure Databricks, PySpark, Azure Data Factory, ADLS Gen2, SQL Server
Role: Databricks Engineer
Project Description:

AzureMX (Maximum Acceleration) is a 100% Azure-native, metadata-driven data migration and modernization framework designed to accelerate enterprise data estate transformation. The framework automates full, incremental, and snapshot data loads from multiple source systems into Azure storage and analytics platforms using Databricks and Azure Data Factory.

Features:
  • Designed Databricks transformation logic using PySpark for full, incremental, and snapshot loads.
  • Processed structured and semi-structured data (CSV, Parquet, TXT) in Azure Databricks.
  • Integrated Databricks pipelines with Azure Data Factory for orchestration and dependency handling.
  • Implemented metadata-driven processing to avoid creating multiple pipelines.
  • Supported backdated data processing and auditing requirements.
  • Optimized Databricks jobs for performance and scalability.
  • Participated in Agile ceremonies and provided regular delivery updates.
Project #2: HPI_TeraRun_Fy22
HEWLETT PACKARD INC — Manufacturing
Technologies: Azure Databricks, PySpark, Azure Data Factory, ADLS Gen2, Azure SQL Server
Cloud Data Warehouse: Azure SQL Server
Role: Databricks Engineer
Project Description:

HPI TeraRun is a large-scale enterprise analytics transformation program aimed at building a unified big data and analytics platform. The solution processes manufacturing, finance, and supply chain data to enable enterprise-wide reporting and analytics using Azure Databricks and cloud-native data services.

Responsibilities:
  • Supported Databricks-based data pipelines processing manufacturing and enterprise datasets.
  • Performed data transformations using PySpark and Spark SQL in Azure Databricks.
  • Ingested data from Azure Data Factory into ADLS Gen2 for downstream Databricks processing.
  • Debugged and resolved pipeline failures by analyzing Databricks jobs and logs.
  • Optimized Spark transformations to improve batch processing performance.
  • Collaborated with business and support teams to resolve incidents and data issues.
Project #1: HedgeAct
Role: Azure Data Engineer
Technologies: SQL Server, Azure Data Factory, Power BI
Cloud Data Warehousing: Azure Blob Storage
Project Description:
Responsibilities:
  • Developed Azure Data Factory pipelines to load data from SFTP sources into Azure Blob Storage.
  • Performed data migration from on-premises systems to Azure SQL Database.
  • Loaded dimension and fact tables into Azure Synapse Data Warehouse.
  • Used Power Query for data transformation based on reporting requirements.
  • Created and deployed tabular models and analytical cubes.
  • Designed Power BI dashboards using charts, KPIs, slicers, maps, and tree maps.
  • Implemented scheduled refresh using on-premises data gateway.
  • Built incident management dashboards with drill-down capabilities.
  • Managed report access, security roles, and user-level permissions.
Animated run flow Press Play to simulate: SQL → ADF → ADLS → Databricks → Synapse → Power BI
Runs execute one at a time: SQL → Legacy → Files. Each run pauses at ADF/ADLS/Databricks/Synapse/Power BI.
🔒
Premium content locked
End-to-end flow Always visible pipeline
Sources SQL • Legacy • Files
Metadata Repo Config drives behavior
Azure Data Factory Orchestration
ADLS Gen2 (Raw) Landing zone
Databricks → Curated Merge • Snapshot • Audit
Tip: In an interview, keep this pipeline on-screen and use the right panel to go deep on "metadata-driven", "watermark", "audit", and "backdated replays".
Details Click a node

Sources

AzureMX supports multiple enterprise source systems without code changes.

🔒
Premium content locked
Project 2 mins explanation in Interview

"So, AzureMX—which stands for Maximum Acceleration—was a metadata-driven data migration framework I built entirely on Azure. The main problem we were solving was that every time we had a new source system or a new table to migrate, we were writing custom pipelines from scratch. That was taking too long and wasn't scalable.

What I did instead was create a central metadata repository—basically configuration tables in SQL Server—that would drive all the ingestion, transformation, and orchestration behavior. So now, whether it's SQL Server, legacy databases, or flat files, the same framework handles it. Same for load types—full loads, incremental loads, snapshot loads—all controlled through metadata. And we can send data to ADLS curated zones, Synapse Analytics, Power BI—all without changing any code. To onboard a new table? Just add a row to the metadata table. That's it.

The impact was significant—we reduced development effort by about 70% and were able to support hundreds of tables across multiple enterprise systems. Azure Data Factory reads the metadata and orchestrates everything, Databricks does the PySpark transformations based on the config, and we built a robust auditing framework that tracks everything. It really demonstrates how you can build reusable, scalable frameworks that eliminate repetitive work while maintaining production-grade quality."

SQL Server — sample tables Expand a source type to see sample assets
🔒
Premium content locked
Troubleshooting Issues Common problems and solutions
Issue 1: Incremental loads missing records
Problem:

During production runs, we noticed that some records with recent update timestamps weren't being captured in incremental loads. Business users reported missing data in analytics dashboards, particularly for tables with high update frequency. The issue was intermittent—some runs would capture all records, while others would miss 5-15% of expected updates. This created data quality concerns and required manual data reconciliation.

Root Cause Analysis:

After investigating audit logs and comparing source vs. target row counts, we identified three root causes:

1. Watermark Column Inconsistency: The metadata table was configured to use `last_updated_ts` as the watermark column, but some source systems had triggers that updated this column inconsistently. In some cases, bulk updates would modify the data but not update the timestamp column, causing those records to be skipped.

2. Timezone Mismatches: Source systems were in different timezones (EST, PST, UTC), while our watermark comparison logic was using local server time. When comparing timestamps, records updated just before midnight in one timezone would be missed if the comparison happened in another timezone.

3. Clock Skew: Some legacy source systems had clock drift, causing timestamps to be slightly behind actual update times. Our strict "greater than" comparison (`watermark > last_watermark`) missed records that were updated milliseconds before the watermark timestamp.

Solution Implementation:

We implemented a multi-layered solution:

1. Watermark Validation Logic: Added a pre-processing step in Databricks that compares the source system's maximum timestamp with the last successful watermark stored in control tables. If the difference exceeds a threshold (configurable in metadata, default 1 hour), the job logs a warning and can optionally trigger a full reload instead of incremental.

2. Timezone Normalization: Implemented a timezone conversion function that normalizes all timestamps to UTC before comparison. The function reads the source system's timezone from metadata and converts accordingly. This ensures consistent comparisons regardless of source system location.

3. Buffer Window for Clock Skew: Modified the watermark query to use a small buffer window (configurable, default 1 minute). Instead of `watermark > last_watermark`, we use `watermark >= (last_watermark - buffer_minutes)`. This captures records that might have been updated just before the watermark due to clock differences. We handle potential duplicates in the merge logic using primary keys.

4. Backdated Processing Feature: Added a metadata flag `allow_watermark_override` and a manual override column. When reprocessing is needed, administrators can update the watermark value in metadata, and the next run will extract all data from that date forward. This is tracked in audit tables with a "reprocess" flag to distinguish from normal runs.

Impact & Results:

After implementing these fixes, data completeness improved from 85-90% to 99.8%. The validation logic caught 12 instances where source systems had timestamp inconsistencies, allowing us to proactively fix those issues. The backdated processing feature reduced manual data reconciliation time by 80%, as we could now reprocess specific date ranges without affecting current production loads.

Issue 2: Databricks job performance degradation
Problem:

Over a 3-month period, Databricks transformation jobs for large tables (10M+ rows) were taking progressively longer to complete. A job that initially took 15 minutes started taking 45-60 minutes, causing SLA violations and increased compute costs. Some jobs were timing out after 2 hours, requiring manual intervention and reprocessing. The performance degradation was most noticeable for tables with frequent incremental updates.

Root Cause Analysis:

Performance profiling using Databricks Spark UI revealed several bottlenecks:

1. Small File Problem: ADLS Gen2 contained thousands of small files (1-5MB each) instead of optimally sized files (128-256MB). Each incremental run was writing new small files, and over time, the number of files grew exponentially. Reading these files required excessive I/O operations—a single read operation was opening 500+ files instead of 10-20 optimally sized files. This increased read time by 10x.

2. Partitioning Issues: Some tables weren't partitioned at all, while others were partitioned by columns that didn't align with query patterns. For example, a table partitioned by `customer_id` was being queried by `order_date`, causing full table scans. Additionally, incremental merge operations were reading entire partitions even when only a small subset of data changed.

3. Shuffle Operations: Merge operations were causing massive shuffles across the cluster. When merging incremental data with existing data, Spark was redistributing data across all partitions, causing network I/O bottlenecks. For a 50GB table, this resulted in 200GB+ of network traffic per run.

4. Cluster Underutilization: Despite having 8-worker clusters, CPU utilization was only 30-40% because tasks were waiting on I/O and shuffle operations. Workers were idle while waiting for data to be read or shuffled.

Solution Implementation:

We implemented a comprehensive performance optimization strategy:

1. File Compaction Logic: Created a reusable PySpark function that runs after each incremental write. The function identifies files smaller than 128MB, groups them by partition, and uses `coalesce()` or `repartition()` to merge them into optimal sizes (128-256MB). This runs as a separate, low-priority job after the main transformation completes. For tables with 1000+ small files, this reduced file count to 50-100 files, cutting read time by 80%.

2. Metadata-Driven Partitioning Strategy: Enhanced metadata to include `partition_columns` and `partition_strategy` fields. Partition columns are typically date-based (e.g., `load_date`, `order_date`) or business keys (e.g., `region_id`). During writes, we partition data using `partitionBy()` with these columns. For incremental loads, we added partition pruning logic that only reads partitions where `partition_column >= last_watermark`, dramatically reducing I/O for date-partitioned tables.

3. Pre-partitioning for Merges: Before merge operations, we pre-partition both source and target DataFrames by the same key (primary key or business key) using `repartition()`. This ensures that when Spark performs the join, matching records are on the same partition, eliminating cross-partition shuffles. For a 50GB merge operation, this reduced network traffic from 200GB to 5GB.

4. Broadcast Joins for Lookups: For small lookup tables (< 100MB), we use broadcast joins. We identify these in metadata with a `is_lookup` flag and automatically apply `broadcast()` hints. This avoids shuffles entirely for dimension table joins.

5. Dynamic Cluster Sizing: Metadata now includes `estimated_table_size` and `complexity_score`. Based on these, we dynamically size Databricks clusters—small tables (1-5GB) use 2-worker clusters, medium (5-50GB) use 4-6 workers, and large (50GB+) use 8+ workers. This optimizes cost while ensuring adequate resources.

Impact & Results:

Job execution time reduced by 70% on average. A 60-minute job now completes in 18 minutes. Compute costs decreased by 45% due to reduced cluster time and better resource utilization. File compaction reduced storage costs by 20% (fewer files = less metadata overhead). The optimization framework is now applied to all 200+ tables, ensuring consistent performance as data volumes grow.

Issue 3: ADF pipeline failures due to dependency conflicts
Problem:

ADF pipelines were experiencing intermittent failures, particularly for tables with foreign key relationships. For example, when processing `order_items` table, the pipeline would sometimes fail because it tried to merge data before the parent `orders` table was fully processed. This caused referential integrity violations and partial data loads. The failures were unpredictable—sometimes dependent tables would succeed even when parent tables failed, leading to inconsistent data states that required manual cleanup.

Root Cause Analysis:

After analyzing ADF pipeline execution logs and dependency graphs, we identified the core issue:

1. Missing Dependency Metadata: The original metadata repository didn't capture table dependencies. ADF was reading all tables for a source system and triggering Databricks jobs in parallel without understanding parent-child relationships. For example, `order_items` depends on `orders`, but both jobs would start simultaneously, and `order_items` could complete before `orders`, causing foreign key constraint violations.

2. No Dependency Validation: There was no mechanism to check if a parent table's job completed successfully before starting a dependent table's job. ADF's built-in dependency features weren't being utilized because we were dynamically generating job triggers based on metadata.

3. Retry Logic Issues: When a parent table job failed, dependent tables would still execute (or retry) without waiting for the parent to succeed. This created cascading failures and data inconsistency. The retry mechanism was table-specific and didn't consider dependency chains.

4. Parallel Execution Limits: ADF was attempting to run 50+ jobs in parallel, overwhelming the Databricks workspace and causing some jobs to fail due to resource contention. There was no throttling mechanism.

Solution Implementation:

We redesigned the orchestration layer to handle dependencies properly:

1. Dependency Metadata Table: Created a new `table_dependencies` table in the metadata repository with columns: `table_name`, `depends_on_table`, `dependency_type` (FK, business logic, etc.), and `is_required` (boolean). This captures all parent-child relationships. For example: `order_items` depends on `orders`, `order_items` depends on `products`, etc.

2. Dependency Graph Construction: Modified the ADF pipeline to build a dependency graph using a recursive CTE query against the dependency table. The graph is topologically sorted to determine execution order. Tables with no dependencies are level 0, tables depending only on level 0 are level 1, and so on.

3. Sequential Execution by Level: ADF now executes tables level by level. All level 0 tables run in parallel (with throttling—max 10 concurrent jobs). Once all level 0 jobs complete successfully, level 1 jobs start. This ensures parent tables are always processed before children. We use ADF's `Wait` activity and conditional logic to check job status before proceeding.

4. Status Checking Before Triggering: Before triggering a dependent table's Databricks job, ADF queries the audit table to verify the parent table's status is "SUCCESS". If the parent failed or is still running, the dependent job waits (with a configurable timeout). This prevents orphaned records and referential integrity issues.

5. Retry with Dependency Awareness: Enhanced retry logic to be dependency-aware. If a parent table fails after retries, all dependent tables are marked as "BLOCKED" in the audit table with a reason code. Administrators can manually fix the parent issue and then trigger a "reprocess blocked" pipeline that only runs the blocked dependent tables. This prevents cascading failures.

6. Parallel Execution Throttling: Added a configurable `max_parallel_jobs` parameter (default: 10) to prevent overwhelming Databricks. ADF uses a queue mechanism—jobs wait in queue if the limit is reached, ensuring stable execution even with 100+ tables.

Impact & Results:

Pipeline failure rate dropped from 15% to <1%. Data consistency improved dramatically—no more orphaned records or referential integrity violations. The dependency framework now handles 200+ tables with complex dependency chains (some tables have 5+ dependencies). The "blocked table" feature reduced manual intervention time by 90%, as administrators can quickly identify and fix root cause issues without manually cleaning up dependent table data.

🔒
Premium content locked
Project Related Interview Q&A Interview-ready answers
Q1: Walk me through AzureMX from a high-level architecture perspective.

AzureMX is a metadata-driven data migration framework built entirely on Azure. The architecture has four main layers:

1. Metadata Layer: Central SQL Server repository storing source definitions, load types, watermark columns, primary keys, target mappings, and audit configurations. This is the single source of truth.

2. Orchestration Layer: Azure Data Factory reads metadata, determines execution order based on dependencies, and dynamically triggers Databricks notebooks with appropriate parameters. ADF handles retries, error handling, and parallel execution where possible.

3. Processing Layer: Azure Databricks with PySpark performs all transformations—schema alignment, data type standardization, deduplication, and merge logic. The same reusable functions work for all tables, parameterized by metadata.

4. Storage Layer: ADLS Gen2 organized in a metadata-driven folder structure (source/schema/table/load_date). Raw data lands here first, then transformed data goes to curated zones or Synapse Analytics for analytics workloads.

Q2: Why did you choose a metadata-driven approach instead of traditional ETL pipelines?

Traditional ETL meant writing separate pipelines for each table or source system. With hundreds of tables across multiple enterprise systems, this wasn't scalable. Every new table required code changes, testing, and deployment cycles taking weeks.

The metadata-driven approach eliminates this. Behavior is controlled through configuration, not code. To onboard a new table, we insert a row in the metadata table—no code deployment needed. This reduced our development effort by 70% and cut onboarding time from weeks to hours.

More importantly, it ensures consistency. All tables follow the same transformation patterns, error handling, and audit mechanisms. When we need to update logic (like adding a new validation rule), we change it once in the reusable functions, and all tables benefit automatically. This is the power of metadata-driven engineering.

Q3: Explain how you handle different load types (full, incremental, snapshot) in a single framework.

The load type is defined in metadata, and the framework adapts its behavior accordingly:

Full Load: ADF extracts all data from source, Databricks overwrites the target dataset completely. Used for small reference tables or when complete refresh is needed. Simple and fast for small datasets.

Incremental Load: Most common pattern. Metadata defines a watermark column (like last_updated_ts). ADF reads the last successful watermark from control tables, Databricks queries source for records where watermark > last_watermark, then merges new/updated records based on primary keys. After success, watermark is updated. This handles high-volume, frequently changing tables efficiently.

Snapshot Load: Extracts point-in-time data and appends with effective dates. Metadata defines snapshot frequency and effective date columns. Useful for audit trails or historical state preservation. Each snapshot is preserved, allowing time-travel queries.

The key is that the same PySpark transformation functions handle all three types—they just receive different parameters from metadata (load_type, watermark_column, merge_strategy, etc.).

Q4: How does Azure Data Factory orchestrate hundreds of tables? Walk me through the execution flow.

ADF uses a metadata-driven pipeline pattern. Here's the execution flow:

1. Metadata Read: ADF pipeline starts by querying the metadata repository to get all tables for a source system, along with their load types, dependencies, and configurations.

2. Dependency Resolution: ADF builds a dependency graph from metadata. If Table B depends on Table A, ADF ensures Table A completes successfully before triggering Table B. This is critical for maintaining data integrity.

3. Parallel Execution: Tables without dependencies run in parallel. ADF triggers multiple Databricks notebooks simultaneously, each with table-specific parameters (source path, target path, watermark value, merge strategy, etc.).

4. Parameter Passing: Each Databricks notebook receives parameters like source_table, target_path, load_type, watermark_column, primary_keys. The notebook reads these and executes the appropriate transformation logic.

5. Status Tracking: After each job, ADF updates audit tables with status, row counts, and timestamps. If a job fails, ADF can retry based on retry policies, and dependent jobs wait until parent succeeds.

Q5: Describe the PySpark transformation logic. How do you handle schema differences and data quality?

The PySpark transformations are reusable functions parameterized by metadata:

Schema Alignment: Metadata defines source column names and their mappings to target columns. PySpark dynamically renames columns, handles case sensitivity differences, and maps source data types to target data types (e.g., VARCHAR to STRING, DATETIME to TIMESTAMP).

Data Type Standardization: We have transformation functions that convert data types consistently. For example, all dates are normalized to UTC timestamps, numeric types are standardized (INT vs BIGINT based on metadata), and string types are trimmed and validated.

Deduplication: Based on primary keys defined in metadata, we identify duplicates using window functions partitioned by business keys. For incremental loads, we keep the latest record; for full loads, we remove duplicates before writing.

Data Quality Checks: Validation rules are defined in metadata (nullable columns, value ranges, format checks). PySpark applies these validations and flags records that fail. Failed records are written to a separate error table with error details, while valid records proceed to the target. This ensures data quality without stopping the entire load.

Q6: How did you optimize Databricks performance for large-scale data processing?

We implemented multiple optimization strategies:

File Compaction: Small files in ADLS cause excessive I/O. We implemented a compaction step that merges files into optimal sizes (128MB-256MB) using coalesce() or repartition(). This reduced read operations by 80% for some tables.

Partitioning Strategy: Metadata defines partition columns (typically date or business keys). We partition data during writes, which dramatically improves query performance. For incremental loads, we only process partitions that changed, not the entire dataset.

Broadcast Joins: For small lookup tables, we use broadcast joins to avoid shuffles. We identify these in metadata and apply broadcast hints automatically.

Pre-partitioning for Merges: Before merge operations, we pre-partition both source and target data by the same key (primary key or business key). This ensures joins happen locally without shuffles, reducing network I/O significantly.

Cluster Sizing: Metadata includes table size estimates. We dynamically size Databricks clusters based on this—small tables use smaller clusters, large tables get more workers. This optimizes cost and performance.

Q7: Explain your error handling and retry strategy. How do you ensure data consistency?

Error handling operates at multiple levels:

Orchestration Level (ADF): ADF has configurable retry policies with exponential backoff. If a Databricks job fails, ADF retries up to 3 times with increasing delays. We also implemented dependency checking—if a parent table fails, dependent tables don't run, preventing partial loads that could cause data inconsistency.

Transformation Level (Databricks): Before processing, we validate watermark consistency (ensuring source max timestamp > last watermark), schema alignment, and data quality. If validation fails, we log detailed errors to audit tables but don't fail the entire job—we write valid records and flag invalid ones separately.

Atomic Operations: For incremental merges, we use transactional writes. We read the target, merge with new data, validate row counts, and only commit if validation passes. If anything fails, we roll back to the previous state. This ensures data consistency even if a job fails mid-execution.

Audit Trail: Every error is logged with context—which table, which date range, error message, row counts before/after. This allows us to identify patterns, fix root causes, and reprocess specific date ranges without affecting other loads.

Q8: How do you handle backdated processing and historical data reprocessing?

Backdated processing was a critical requirement. Here's how we handle it:

Watermark Override: Metadata allows manual watermark overrides. If we need to reprocess data from a specific date, we update the watermark value in metadata to that date. The next run will extract all data from that date forward, effectively reprocessing historical data.

Audit Tracking: When backdated processing runs, audit tables clearly mark it as a "reprocess" run with the original date range and the reprocess timestamp. This prevents confusion about which data is current vs. reprocessed.

Isolation: Backdated processing doesn't affect current production loads. Current loads continue using their normal watermark values, while backdated runs use overridden values. Both can run in parallel if needed, writing to the same target but with different effective dates or partitions.

Data Integrity: For incremental loads, we handle backdated updates carefully. If a record was updated in the past, we merge it correctly based on effective dates, ensuring the final state reflects all changes chronologically. This is especially important for snapshot loads where we preserve historical states.

Q9: Walk me through your audit framework. What metrics do you track and why?

The audit framework is comprehensive and mandatory for every run:

Row Count Metrics: Source row count (how many records extracted), target row count before (existing records), target row count after (final state), and difference (inserts/updates/deletes). This validates data completeness and identifies data loss or duplication issues.

Timing Metrics: Load start timestamp, end timestamp, and duration. This helps identify performance degradation, bottlenecks, and plan capacity. We also track time per million rows for normalization.

Status Tracking: SUCCESS, FAILURE, or PARTIAL_SUCCESS with detailed error messages. For partial success, we track which partitions or date ranges succeeded vs. failed, allowing targeted reprocessing.

Data Quality Metrics: Number of records that failed validation, number of duplicates found, schema mismatches detected. This helps identify data quality issues at the source.

Reprocessing Flags: Clear indicators for backdated processing—original date range, reprocess timestamp, and reason for reprocessing. This maintains a clear audit trail for compliance and troubleshooting.

All metrics are stored in SQL Server audit tables, queryable for reporting, monitoring dashboards, and compliance requirements.

Q10: How does AzureMX handle schema evolution and changing source structures?

Schema evolution is handled through metadata updates and flexible PySpark logic:

Metadata-Driven Schema Mapping: When source schema changes, we update the metadata table with new column mappings. The PySpark logic reads these mappings dynamically, so no code changes are needed. New columns are added to the target, removed columns are handled gracefully (either dropped or preserved based on metadata flags).

Schema Validation: Before processing, we validate that source schema matches metadata expectations. If there's a mismatch (new column not in metadata, or expected column missing), we log a warning but don't fail the job. This allows us to detect schema changes and update metadata accordingly.

Backward Compatibility: For incremental loads, we handle schema changes carefully. If a new column appears, we add it to existing records with null values. If a column is removed, we preserve it in target with nulls or default values, ensuring downstream systems aren't broken.

Versioning: Metadata includes schema version numbers. When schema changes, we increment the version, allowing us to track which runs used which schema version. This is critical for troubleshooting and understanding data lineage.

Q11: Explain how incremental loads work with watermarks. What edge cases did you handle?

Incremental loads use watermark columns (typically last_updated_ts) defined in metadata. The process is:

Standard Flow: ADF reads last successful watermark from control tables, Databricks queries source for records where watermark > last_watermark, merges new/updated records, then updates watermark to source max value.

Edge Cases Handled:
Timezone Differences: Source and target may be in different timezones. We normalize all timestamps to UTC before comparison, ensuring no records are missed due to timezone mismatches.

Clock Skew: Source system clocks may be slightly off. We add a small buffer (e.g., 1 minute) when querying, ensuring we don't miss records that were updated just before the watermark timestamp.

Null Watermarks: Some records may have null watermark values. We handle these separately—either exclude them, use a default date, or flag them for manual review based on metadata configuration.

Watermark Column Updates: If the watermark column itself is updated (e.g., a record is backdated), we ensure we still capture it. We use ">=" instead of ">" and handle duplicates in merge logic.

Large Gaps: If there's a large gap between last watermark and current time (e.g., system was down for days), we can process in chunks to avoid memory issues, or metadata can flag it for full reload instead.

Q12: What was the business impact and ROI of AzureMX? How do you measure success?

The business impact was significant and measurable:

Development Efficiency: Reduced development effort by 70%. What used to take 2-3 weeks to onboard a new table (requirements, design, coding, testing, deployment) now takes 2-3 hours (just adding metadata rows). We onboarded 200+ tables in 6 months vs. the estimated 2+ years with traditional approach.

Operational Cost Reduction: Single framework means single set of operational procedures, monitoring, and support. Instead of maintaining hundreds of individual pipelines, we maintain one framework. This reduced operational overhead by 60% and freed up team capacity for new projects.

Time to Market: New data sources can be onboarded in hours, enabling faster analytics and business insights. Business teams can request new data sources and have them available within days, not months.

Data Quality: Consistent transformation logic and validation rules across all tables improved data quality. Audit framework provides transparency, building trust with business users.

Scalability: Framework handles hundreds of tables without performance degradation. As business grows, we can add more tables without proportional increase in development or operational costs.

Success Metrics: We track tables onboarded per month, average onboarding time, job success rate (target: >99%), data freshness (time from source update to analytics availability), and developer productivity (tables per developer per month). All metrics showed dramatic improvements.

🔒
Premium content locked
Troubleshooting Issues Common problems and solutions

Issue 1: Incremental loads missing records

Problem:

During production runs, we noticed that some records with recent update timestamps weren't being captured in incremental loads. Business users reported missing data in analytics dashboards, particularly for tables with high update frequency. The issue was intermittent—some runs would capture all records, while others would miss 5-15% of expected updates. This created data quality concerns and required manual data reconciliation.

Root Cause Analysis:

After investigating audit logs and comparing source vs. target row counts, we identified three root causes:

1. Watermark Column Inconsistency: The metadata table was configured to use `last_updated_ts` as the watermark column, but some source systems had triggers that updated this column inconsistently. In some cases, bulk updates would modify the data but not update the timestamp column, causing those records to be skipped.

2. Timezone Mismatches: Source systems were in different timezones (EST, PST, UTC), while our watermark comparison logic was using local server time. When comparing timestamps, records updated just before midnight in one timezone would be missed if the comparison happened in another timezone.

3. Clock Skew: Some legacy source systems had clock drift, causing timestamps to be slightly behind actual update times. Our strict "greater than" comparison (`watermark > last_watermark`) missed records that were updated milliseconds before the watermark timestamp.

Solution Implementation:

We implemented a multi-layered solution:

1. Watermark Validation Logic: Added a pre-processing step in Databricks that compares the source system's maximum timestamp with the last successful watermark stored in control tables. If the difference exceeds a threshold (configurable in metadata, default 1 hour), the job logs a warning and can optionally trigger a full reload instead of incremental.

2. Timezone Normalization: Implemented a timezone conversion function that normalizes all timestamps to UTC before comparison. The function reads the source system's timezone from metadata and converts accordingly. This ensures consistent comparisons regardless of source system location.

3. Buffer Window for Clock Skew: Modified the watermark query to use a small buffer window (configurable, default 1 minute). Instead of `watermark > last_watermark`, we use `watermark >= (last_watermark - buffer_minutes)`. This captures records that might have been updated just before the watermark due to clock differences. We handle potential duplicates in the merge logic using primary keys.

4. Backdated Processing Feature: Added a metadata flag `allow_watermark_override` and a manual override column. When reprocessing is needed, administrators can update the watermark value in metadata, and the next run will extract all data from that date forward. This is tracked in audit tables with a "reprocess" flag to distinguish from normal runs.

Impact & Results:

After implementing these fixes, data completeness improved from 85-90% to 99.8%. The validation logic caught 12 instances where source systems had timestamp inconsistencies, allowing us to proactively fix those issues. The backdated processing feature reduced manual data reconciliation time by 80%, as we could now reprocess specific date ranges without affecting current production loads.

Issue 2: Databricks job performance degradation

Problem:

Over a 3-month period, Databricks transformation jobs for large tables (10M+ rows) were taking progressively longer to complete. A job that initially took 15 minutes started taking 45-60 minutes, causing SLA violations and increased compute costs. Some jobs were timing out after 2 hours, requiring manual intervention and reprocessing. The performance degradation was most noticeable for tables with frequent incremental updates.

Root Cause Analysis:

Performance profiling using Databricks Spark UI revealed several bottlenecks:

1. Small File Problem: ADLS Gen2 contained thousands of small files (1-5MB each) instead of optimally sized files (128-256MB). Each incremental run was writing new small files, and over time, the number of files grew exponentially. Reading these files required excessive I/O operations—a single read operation was opening 500+ files instead of 10-20 optimally sized files. This increased read time by 10x.

2. Partitioning Issues: Some tables weren't partitioned at all, while others were partitioned by columns that didn't align with query patterns. For example, a table partitioned by `customer_id` was being queried by `order_date`, causing full table scans. Additionally, incremental merge operations were reading entire partitions even when only a small subset of data changed.

3. Shuffle Operations: Merge operations were causing massive shuffles across the cluster. When merging incremental data with existing data, Spark was redistributing data across all partitions, causing network I/O bottlenecks. For a 50GB table, this resulted in 200GB+ of network traffic per run.

4. Cluster Underutilization: Despite having 8-worker clusters, CPU utilization was only 30-40% because tasks were waiting on I/O and shuffle operations. Workers were idle while waiting for data to be read or shuffled.

Solution Implementation:

We implemented a comprehensive performance optimization strategy:

1. File Compaction Logic: Created a reusable PySpark function that runs after each incremental write. The function identifies files smaller than 128MB, groups them by partition, and uses `coalesce()` or `repartition()` to merge them into optimal sizes (128-256MB). This runs as a separate, low-priority job after the main transformation completes. For tables with 1000+ small files, this reduced file count to 50-100 files, cutting read time by 80%.

2. Metadata-Driven Partitioning Strategy: Enhanced metadata to include `partition_columns` and `partition_strategy` fields. Partition columns are typically date-based (e.g., `load_date`, `order_date`) or business keys (e.g., `region_id`). During writes, we partition data using `partitionBy()` with these columns. For incremental loads, we added partition pruning logic that only reads partitions where `partition_column >= last_watermark`, dramatically reducing I/O for date-partitioned tables.

3. Pre-partitioning for Merges: Before merge operations, we pre-partition both source and target DataFrames by the same key (primary key or business key) using `repartition()`. This ensures that when Spark performs the join, matching records are on the same partition, eliminating cross-partition shuffles. For a 50GB merge operation, this reduced network traffic from 200GB to 5GB.

4. Broadcast Joins for Lookups: For small lookup tables (< 100MB), we use broadcast joins. We identify these in metadata with a `is_lookup` flag and automatically apply `broadcast()` hints. This avoids shuffles entirely for dimension table joins.

5. Dynamic Cluster Sizing: Metadata now includes `estimated_table_size` and `complexity_score`. Based on these, we dynamically size Databricks clusters—small tables (1-5GB) use 2-worker clusters, medium (5-50GB) use 4-6 workers, and large (50GB+) use 8+ workers. This optimizes cost while ensuring adequate resources.

Impact & Results:

Job execution time reduced by 70% on average. A 60-minute job now completes in 18 minutes. Compute costs decreased by 45% due to reduced cluster time and better resource utilization. File compaction reduced storage costs by 20% (fewer files = less metadata overhead). The optimization framework is now applied to all 200+ tables, ensuring consistent performance as data volumes grow.

Issue 3: ADF pipeline failures due to dependency conflicts

Problem:

ADF pipelines were experiencing intermittent failures, particularly for tables with foreign key relationships. For example, when processing `order_items` table, the pipeline would sometimes fail because it tried to merge data before the parent `orders` table was fully processed. This caused referential integrity violations and partial data loads. The failures were unpredictable—sometimes dependent tables would succeed even when parent tables failed, leading to inconsistent data states that required manual cleanup.

Root Cause Analysis:

After analyzing ADF pipeline execution logs and dependency graphs, we identified the core issue:

1. Missing Dependency Metadata: The original metadata repository didn't capture table dependencies. ADF was reading all tables for a source system and triggering Databricks jobs in parallel without understanding parent-child relationships. For example, `order_items` depends on `orders`, but both jobs would start simultaneously, and `order_items` could complete before `orders`, causing foreign key constraint violations.

2. No Dependency Validation: There was no mechanism to check if a parent table's job completed successfully before starting a dependent table's job. ADF's built-in dependency features weren't being utilized because we were dynamically generating job triggers based on metadata.

3. Retry Logic Issues: When a parent table job failed, dependent tables would still execute (or retry) without waiting for the parent to succeed. This created cascading failures and data inconsistency. The retry mechanism was table-specific and didn't consider dependency chains.

4. Parallel Execution Limits: ADF was attempting to run 50+ jobs in parallel, overwhelming the Databricks workspace and causing some jobs to fail due to resource contention. There was no throttling mechanism.

Solution Implementation:

We redesigned the orchestration layer to handle dependencies properly:

1. Dependency Metadata Table: Created a new `table_dependencies` table in the metadata repository with columns: `table_name`, `depends_on_table`, `dependency_type` (FK, business logic, etc.), and `is_required` (boolean). This captures all parent-child relationships. For example: `order_items` depends on `orders`, `order_items` depends on `products`, etc.

2. Dependency Graph Construction: Modified the ADF pipeline to build a dependency graph using a recursive CTE query against the dependency table. The graph is topologically sorted to determine execution order. Tables with no dependencies are level 0, tables depending only on level 0 are level 1, and so on.

3. Sequential Execution by Level: ADF now executes tables level by level. All level 0 tables run in parallel (with throttling—max 10 concurrent jobs). Once all level 0 jobs complete successfully, level 1 jobs start. This ensures parent tables are always processed before children. We use ADF's `Wait` activity and conditional logic to check job status before proceeding.

4. Status Checking Before Triggering: Before triggering a dependent table's Databricks job, ADF queries the audit table to verify the parent table's status is "SUCCESS". If the parent failed or is still running, the dependent job waits (with a configurable timeout). This prevents orphaned records and referential integrity issues.

5. Retry with Dependency Awareness: Enhanced retry logic to be dependency-aware. If a parent table fails after retries, all dependent tables are marked as "BLOCKED" in the audit table with a reason code. Administrators can manually fix the parent issue and then trigger a "reprocess blocked" pipeline that only runs the blocked dependent tables. This prevents cascading failures.

6. Parallel Execution Throttling: Added a configurable `max_parallel_jobs` parameter (default: 10) to prevent overwhelming Databricks. ADF uses a queue mechanism—jobs wait in queue if the limit is reached, ensuring stable execution even with 100+ tables.

Impact & Results:

Pipeline failure rate dropped from 15% to <1%. Data consistency improved dramatically—no more orphaned records or referential integrity violations. The dependency framework now handles 200+ tables with complex dependency chains (some tables have 5+ dependencies). The "blocked table" feature reduced manual intervention time by 90%, as administrators can quickly identify and fix root cause issues without manually cleaning up dependent table data.

Q1: Walk me through AzureMX from a high-level architecture perspective.

AzureMX is a metadata-driven data migration framework built entirely on Azure. The architecture has four main layers:

1. Metadata Layer: Central SQL Server repository storing source definitions, load types, watermark columns, primary keys, target mappings, and audit configurations. This is the single source of truth.

2. Orchestration Layer: Azure Data Factory reads metadata, determines execution order based on dependencies, and dynamically triggers Databricks notebooks with appropriate parameters. ADF handles retries, error handling, and parallel execution where possible.

3. Processing Layer: Azure Databricks with PySpark performs all transformations—schema alignment, data type standardization, deduplication, and merge logic. The same reusable functions work for all tables, parameterized by metadata.

4. Storage Layer: ADLS Gen2 organized in a metadata-driven folder structure (source/schema/table/load_date). Raw data lands here first, then transformed data goes to curated zones or Synapse Analytics for analytics workloads.

Q2: Why did you choose a metadata-driven approach instead of traditional ETL pipelines?

Traditional ETL meant writing separate pipelines for each table or source system. With hundreds of tables across multiple enterprise systems, this wasn't scalable. Every new table required code changes, testing, and deployment cycles taking weeks.

The metadata-driven approach eliminates this. Behavior is controlled through configuration, not code. To onboard a new table, we insert a row in the metadata table—no code deployment needed. This reduced our development effort by 70% and cut onboarding time from weeks to hours.

More importantly, it ensures consistency. All tables follow the same transformation patterns, error handling, and audit mechanisms. When we need to update logic (like adding a new validation rule), we change it once in the reusable functions, and all tables benefit automatically. This is the power of metadata-driven engineering.

Q3: Explain how you handle different load types (full, incremental, snapshot) in a single framework.

The load type is defined in metadata, and the framework adapts its behavior accordingly:

Full Load: ADF extracts all data from source, Databricks overwrites the target dataset completely. Used for small reference tables or when complete refresh is needed. Simple and fast for small datasets.

Incremental Load: Most common pattern. Metadata defines a watermark column (like last_updated_ts). ADF reads the last successful watermark from control tables, Databricks queries source for records where watermark > last_watermark, then merges new/updated records based on primary keys. After success, watermark is updated. This handles high-volume, frequently changing tables efficiently.

Snapshot Load: Extracts point-in-time data and appends with effective dates. Metadata defines snapshot frequency and effective date columns. Useful for audit trails or historical state preservation. Each snapshot is preserved, allowing time-travel queries.

The key is that the same PySpark transformation functions handle all three types—they just receive different parameters from metadata (load_type, watermark_column, merge_strategy, etc.).

Q4: How does Azure Data Factory orchestrate hundreds of tables? Walk me through the execution flow.

ADF uses a metadata-driven pipeline pattern. Here's the execution flow:

1. Metadata Read: ADF pipeline starts by querying the metadata repository to get all tables for a source system, along with their load types, dependencies, and configurations.

2. Dependency Resolution: ADF builds a dependency graph from metadata. If Table B depends on Table A, ADF ensures Table A completes successfully before triggering Table B. This is critical for maintaining data integrity.

3. Parallel Execution: Tables without dependencies run in parallel. ADF triggers multiple Databricks notebooks simultaneously, each with table-specific parameters (source path, target path, watermark value, merge strategy, etc.).

4. Parameter Passing: Each Databricks notebook receives parameters like source_table, target_path, load_type, watermark_column, primary_keys. The notebook reads these and executes the appropriate transformation logic.

5. Status Tracking: After each job, ADF updates audit tables with status, row counts, and timestamps. If a job fails, ADF can retry based on retry policies, and dependent jobs wait until parent succeeds.

Q5: Describe the PySpark transformation logic. How do you handle schema differences and data quality?

The PySpark transformations are reusable functions parameterized by metadata:

Schema Alignment: Metadata defines source column names and their mappings to target columns. PySpark dynamically renames columns, handles case sensitivity differences, and maps source data types to target data types (e.g., VARCHAR to STRING, DATETIME to TIMESTAMP).

Data Type Standardization: We have transformation functions that convert data types consistently. For example, all dates are normalized to UTC timestamps, numeric types are standardized (INT vs BIGINT based on metadata), and string types are trimmed and validated.

Deduplication: Based on primary keys defined in metadata, we identify duplicates using window functions partitioned by business keys. For incremental loads, we keep the latest record; for full loads, we remove duplicates before writing.

Data Quality Checks: Validation rules are defined in metadata (nullable columns, value ranges, format checks). PySpark applies these validations and flags records that fail. Failed records are written to a separate error table with error details, while valid records proceed to the target. This ensures data quality without stopping the entire load.

Q6: How did you optimize Databricks performance for large-scale data processing?

We implemented multiple optimization strategies:

File Compaction: Small files in ADLS cause excessive I/O. We implemented a compaction step that merges files into optimal sizes (128MB-256MB) using coalesce() or repartition(). This reduced read operations by 80% for some tables.

Partitioning Strategy: Metadata defines partition columns (typically date or business keys). We partition data during writes, which dramatically improves query performance. For incremental loads, we only process partitions that changed, not the entire dataset.

Broadcast Joins: For small lookup tables, we use broadcast joins to avoid shuffles. We identify these in metadata and apply broadcast hints automatically.

Pre-partitioning for Merges: Before merge operations, we pre-partition both source and target data by the same key (primary key or business key). This ensures joins happen locally without shuffles, reducing network I/O significantly.

Cluster Sizing: Metadata includes table size estimates. We dynamically size Databricks clusters based on this—small tables use smaller clusters, large tables get more workers. This optimizes cost and performance.

Q7: Explain your error handling and retry strategy. How do you ensure data consistency?

Error handling operates at multiple levels:

Orchestration Level (ADF): ADF has configurable retry policies with exponential backoff. If a Databricks job fails, ADF retries up to 3 times with increasing delays. We also implemented dependency checking—if a parent table fails, dependent tables don't run, preventing partial loads that could cause data inconsistency.

Transformation Level (Databricks): Before processing, we validate watermark consistency (ensuring source max timestamp > last watermark), schema alignment, and data quality. If validation fails, we log detailed errors to audit tables but don't fail the entire job—we write valid records and flag invalid ones separately.

Atomic Operations: For incremental merges, we use transactional writes. We read the target, merge with new data, validate row counts, and only commit if validation passes. If anything fails, we roll back to the previous state. This ensures data consistency even if a job fails mid-execution.

Audit Trail: Every error is logged with context—which table, which date range, error message, row counts before/after. This allows us to identify patterns, fix root causes, and reprocess specific date ranges without affecting other loads.

Q8: How do you handle backdated processing and historical data reprocessing?

Backdated processing was a critical requirement. Here's how we handle it:

Watermark Override: Metadata allows manual watermark overrides. If we need to reprocess data from a specific date, we update the watermark value in metadata to that date. The next run will extract all data from that date forward, effectively reprocessing historical data.

Audit Tracking: When backdated processing runs, audit tables clearly mark it as a "reprocess" run with the original date range and the reprocess timestamp. This prevents confusion about which data is current vs. reprocessed.

Isolation: Backdated processing doesn't affect current production loads. Current loads continue using their normal watermark values, while backdated runs use overridden values. Both can run in parallel if needed, writing to the same target but with different effective dates or partitions.

Data Integrity: For incremental loads, we handle backdated updates carefully. If a record was updated in the past, we merge it correctly based on effective dates, ensuring the final state reflects all changes chronologically. This is especially important for snapshot loads where we preserve historical states.

Q9: Walk me through your audit framework. What metrics do you track and why?

The audit framework is comprehensive and mandatory for every run:

Row Count Metrics: Source row count (how many records extracted), target row count before (existing records), target row count after (final state), and difference (inserts/updates/deletes). This validates data completeness and identifies data loss or duplication issues.

Timing Metrics: Load start timestamp, end timestamp, and duration. This helps identify performance degradation, bottlenecks, and plan capacity. We also track time per million rows for normalization.

Status Tracking: SUCCESS, FAILURE, or PARTIAL_SUCCESS with detailed error messages. For partial success, we track which partitions or date ranges succeeded vs. failed, allowing targeted reprocessing.

Data Quality Metrics: Number of records that failed validation, number of duplicates found, schema mismatches detected. This helps identify data quality issues at the source.

Reprocessing Flags: Clear indicators for backdated processing—original date range, reprocess timestamp, and reason for reprocessing. This maintains a clear audit trail for compliance and troubleshooting.

All metrics are stored in SQL Server audit tables, queryable for reporting, monitoring dashboards, and compliance requirements.

Q10: How does AzureMX handle schema evolution and changing source structures?

Schema evolution is handled through metadata updates and flexible PySpark logic:

Metadata-Driven Schema Mapping: When source schema changes, we update the metadata table with new column mappings. The PySpark logic reads these mappings dynamically, so no code changes are needed. New columns are added to the target, removed columns are handled gracefully (either dropped or preserved based on metadata flags).

Schema Validation: Before processing, we validate that source schema matches metadata expectations. If there's a mismatch (new column not in metadata, or expected column missing), we log a warning but don't fail the job. This allows us to detect schema changes and update metadata accordingly.

Backward Compatibility: For incremental loads, we handle schema changes carefully. If a new column appears, we add it to existing records with null values. If a column is removed, we preserve it in target with nulls or default values, ensuring downstream systems aren't broken.

Versioning: Metadata includes schema version numbers. When schema changes, we increment the version, allowing us to track which runs used which schema version. This is critical for troubleshooting and understanding data lineage.

Q11: Explain how incremental loads work with watermarks. What edge cases did you handle?

Incremental loads use watermark columns (typically last_updated_ts) defined in metadata. The process is:

Standard Flow: ADF reads last successful watermark from control tables, Databricks queries source for records where watermark > last_watermark, merges new/updated records, then updates watermark to source max value.

Edge Cases Handled:
Timezone Differences: Source and target may be in different timezones. We normalize all timestamps to UTC before comparison, ensuring no records are missed due to timezone mismatches.

Clock Skew: Source system clocks may be slightly off. We add a small buffer (e.g., 1 minute) when querying, ensuring we don't miss records that were updated just before the watermark timestamp.

Null Watermarks: Some records may have null watermark values. We handle these separately—either exclude them, use a default date, or flag them for manual review based on metadata configuration.

Watermark Column Updates: If the watermark column itself is updated (e.g., a record is backdated), we ensure we still capture it. We use ">=" instead of ">" and handle duplicates in merge logic.

Large Gaps: If there's a large gap between last watermark and current time (e.g., system was down for days), we can process in chunks to avoid memory issues, or metadata can flag it for full reload instead.

Q12: What was the business impact and ROI of AzureMX? How do you measure success?

The business impact was significant and measurable:

Development Efficiency: Reduced development effort by 70%. What used to take 2-3 weeks to onboard a new table (requirements, design, coding, testing, deployment) now takes 2-3 hours (just adding metadata rows). We onboarded 200+ tables in 6 months vs. the estimated 2+ years with traditional approach.

Operational Cost Reduction: Single framework means single set of operational procedures, monitoring, and support. Instead of maintaining hundreds of individual pipelines, we maintain one framework. This reduced operational overhead by 60% and freed up team capacity for new projects.

Time to Market: New data sources can be onboarded in hours, enabling faster analytics and business insights. Business teams can request new data sources and have them available within days, not months.

Data Quality: Consistent transformation logic and validation rules across all tables improved data quality. Audit framework provides transparency, building trust with business users.

Scalability: Framework handles hundreds of tables without performance degradation. As business grows, we can add more tables without proportional increase in development or operational costs.

Success Metrics: We track tables onboarded per month, average onboarding time, job success rate (target: >99%), data freshness (time from source update to analytics availability), and developer productivity (tables per developer per month). All metrics showed dramatic improvements.

🔓 Unlock Premium Content

Enter your training batch access key to view the animated flow and detailed explanations.