Professional Summary
- Databricks Engineer with 3+ years of experience in building and supporting cloud-based data platforms.
- Strong hands-on expertise in Azure Databricks, PySpark, and Spark SQL for large-scale data processing.
- Experienced in designing and implementing ETL/ELT pipelines using Azure Databricks and Azure Data Factory.
- Proven ability to transform raw, structured, and semi-structured data into analytics-ready datasets.
- Hands-on experience with Delta Lake concepts, including optimized storage and reliable data handling.
- Skilled in developing metadata-driven data pipelines for full, incremental, and snapshot data loads.
- Experience working with Azure Data Lake Storage Gen2 and Azure Blob Storage as primary data sources.
- Strong understanding of data warehousing concepts, including fact and dimension modeling.
- Proficient in SQL and PySpark for complex data transformations and aggregations.
- Experienced in performance tuning through partitioning, efficient joins, and Spark optimizations.
- Comfortable working in Agile delivery environments, collaborating with cross-functional teams.
- Strong analytical and problem-solving skills with a focus on data quality and reliability.
Work Experience
Technical Skills
Projects
AzureMX (Maximum Acceleration) is a 100% Azure-native, metadata-driven data migration and modernization framework designed to accelerate enterprise data estate transformation. The framework automates full, incremental, and snapshot data loads from multiple source systems into Azure storage and analytics platforms using Databricks and Azure Data Factory.
- Designed Databricks transformation logic using PySpark for full, incremental, and snapshot loads.
- Processed structured and semi-structured data (CSV, Parquet, TXT) in Azure Databricks.
- Integrated Databricks pipelines with Azure Data Factory for orchestration and dependency handling.
- Implemented metadata-driven processing to avoid creating multiple pipelines.
- Supported backdated data processing and auditing requirements.
- Optimized Databricks jobs for performance and scalability.
- Participated in Agile ceremonies and provided regular delivery updates.
HPI TeraRun is a large-scale enterprise analytics transformation program aimed at building a unified big data and analytics platform. The solution processes manufacturing, finance, and supply chain data to enable enterprise-wide reporting and analytics using Azure Databricks and cloud-native data services.
- Supported Databricks-based data pipelines processing manufacturing and enterprise datasets.
- Performed data transformations using PySpark and Spark SQL in Azure Databricks.
- Ingested data from Azure Data Factory into ADLS Gen2 for downstream Databricks processing.
- Debugged and resolved pipeline failures by analyzing Databricks jobs and logs.
- Optimized Spark transformations to improve batch processing performance.
- Collaborated with business and support teams to resolve incidents and data issues.
- Developed Azure Data Factory pipelines to load data from SFTP sources into Azure Blob Storage.
- Performed data migration from on-premises systems to Azure SQL Database.
- Loaded dimension and fact tables into Azure Synapse Data Warehouse.
- Used Power Query for data transformation based on reporting requirements.
- Created and deployed tabular models and analytical cubes.
- Designed Power BI dashboards using charts, KPIs, slicers, maps, and tree maps.
- Implemented scheduled refresh using on-premises data gateway.
- Built incident management dashboards with drill-down capabilities.
- Managed report access, security roles, and user-level permissions.
Sources
AzureMX supports multiple enterprise source systems without code changes.
"So, AzureMX—which stands for Maximum Acceleration—was a metadata-driven data migration framework I built entirely on Azure. The main problem we were solving was that every time we had a new source system or a new table to migrate, we were writing custom pipelines from scratch. That was taking too long and wasn't scalable.
What I did instead was create a central metadata repository—basically configuration tables in SQL Server—that would drive all the ingestion, transformation, and orchestration behavior. So now, whether it's SQL Server, legacy databases, or flat files, the same framework handles it. Same for load types—full loads, incremental loads, snapshot loads—all controlled through metadata. And we can send data to ADLS curated zones, Synapse Analytics, Power BI—all without changing any code. To onboard a new table? Just add a row to the metadata table. That's it.
The impact was significant—we reduced development effort by about 70% and were able to support hundreds of tables across multiple enterprise systems. Azure Data Factory reads the metadata and orchestrates everything, Databricks does the PySpark transformations based on the config, and we built a robust auditing framework that tracks everything. It really demonstrates how you can build reusable, scalable frameworks that eliminate repetitive work while maintaining production-grade quality."
Issue 1: Incremental loads missing records ▼
During production runs, we noticed that some records with recent update timestamps weren't being captured in incremental loads. Business users reported missing data in analytics dashboards, particularly for tables with high update frequency. The issue was intermittent—some runs would capture all records, while others would miss 5-15% of expected updates. This created data quality concerns and required manual data reconciliation.
After investigating audit logs and comparing source vs. target row counts, we identified three root causes:
1. Watermark Column Inconsistency: The metadata table was configured to use `last_updated_ts` as the watermark column, but some source systems had triggers that updated this column inconsistently. In some cases, bulk updates would modify the data but not update the timestamp column, causing those records to be skipped.
2. Timezone Mismatches: Source systems were in different timezones (EST, PST, UTC), while our watermark comparison logic was using local server time. When comparing timestamps, records updated just before midnight in one timezone would be missed if the comparison happened in another timezone.
3. Clock Skew: Some legacy source systems had clock drift, causing timestamps to be slightly behind actual update times. Our strict "greater than" comparison (`watermark > last_watermark`) missed records that were updated milliseconds before the watermark timestamp.
We implemented a multi-layered solution:
1. Watermark Validation Logic: Added a pre-processing step in Databricks that compares the source system's maximum timestamp with the last successful watermark stored in control tables. If the difference exceeds a threshold (configurable in metadata, default 1 hour), the job logs a warning and can optionally trigger a full reload instead of incremental.
2. Timezone Normalization: Implemented a timezone conversion function that normalizes all timestamps to UTC before comparison. The function reads the source system's timezone from metadata and converts accordingly. This ensures consistent comparisons regardless of source system location.
3. Buffer Window for Clock Skew: Modified the watermark query to use a small buffer window (configurable, default 1 minute). Instead of `watermark > last_watermark`, we use `watermark >= (last_watermark - buffer_minutes)`. This captures records that might have been updated just before the watermark due to clock differences. We handle potential duplicates in the merge logic using primary keys.
4. Backdated Processing Feature: Added a metadata flag `allow_watermark_override` and a manual override column. When reprocessing is needed, administrators can update the watermark value in metadata, and the next run will extract all data from that date forward. This is tracked in audit tables with a "reprocess" flag to distinguish from normal runs.
After implementing these fixes, data completeness improved from 85-90% to 99.8%. The validation logic caught 12 instances where source systems had timestamp inconsistencies, allowing us to proactively fix those issues. The backdated processing feature reduced manual data reconciliation time by 80%, as we could now reprocess specific date ranges without affecting current production loads.
Issue 2: Databricks job performance degradation ▼
Over a 3-month period, Databricks transformation jobs for large tables (10M+ rows) were taking progressively longer to complete. A job that initially took 15 minutes started taking 45-60 minutes, causing SLA violations and increased compute costs. Some jobs were timing out after 2 hours, requiring manual intervention and reprocessing. The performance degradation was most noticeable for tables with frequent incremental updates.
Performance profiling using Databricks Spark UI revealed several bottlenecks:
1. Small File Problem: ADLS Gen2 contained thousands of small files (1-5MB each) instead of optimally sized files (128-256MB). Each incremental run was writing new small files, and over time, the number of files grew exponentially. Reading these files required excessive I/O operations—a single read operation was opening 500+ files instead of 10-20 optimally sized files. This increased read time by 10x.
2. Partitioning Issues: Some tables weren't partitioned at all, while others were partitioned by columns that didn't align with query patterns. For example, a table partitioned by `customer_id` was being queried by `order_date`, causing full table scans. Additionally, incremental merge operations were reading entire partitions even when only a small subset of data changed.
3. Shuffle Operations: Merge operations were causing massive shuffles across the cluster. When merging incremental data with existing data, Spark was redistributing data across all partitions, causing network I/O bottlenecks. For a 50GB table, this resulted in 200GB+ of network traffic per run.
4. Cluster Underutilization: Despite having 8-worker clusters, CPU utilization was only 30-40% because tasks were waiting on I/O and shuffle operations. Workers were idle while waiting for data to be read or shuffled.
We implemented a comprehensive performance optimization strategy:
1. File Compaction Logic: Created a reusable PySpark function that runs after each incremental write. The function identifies files smaller than 128MB, groups them by partition, and uses `coalesce()` or `repartition()` to merge them into optimal sizes (128-256MB). This runs as a separate, low-priority job after the main transformation completes. For tables with 1000+ small files, this reduced file count to 50-100 files, cutting read time by 80%.
2. Metadata-Driven Partitioning Strategy: Enhanced metadata to include `partition_columns` and `partition_strategy` fields. Partition columns are typically date-based (e.g., `load_date`, `order_date`) or business keys (e.g., `region_id`). During writes, we partition data using `partitionBy()` with these columns. For incremental loads, we added partition pruning logic that only reads partitions where `partition_column >= last_watermark`, dramatically reducing I/O for date-partitioned tables.
3. Pre-partitioning for Merges: Before merge operations, we pre-partition both source and target DataFrames by the same key (primary key or business key) using `repartition()`. This ensures that when Spark performs the join, matching records are on the same partition, eliminating cross-partition shuffles. For a 50GB merge operation, this reduced network traffic from 200GB to 5GB.
4. Broadcast Joins for Lookups: For small lookup tables (< 100MB), we use broadcast joins. We identify these in metadata with a `is_lookup` flag and automatically apply `broadcast()` hints. This avoids shuffles entirely for dimension table joins.
5. Dynamic Cluster Sizing: Metadata now includes `estimated_table_size` and `complexity_score`. Based on these, we dynamically size Databricks clusters—small tables (1-5GB) use 2-worker clusters, medium (5-50GB) use 4-6 workers, and large (50GB+) use 8+ workers. This optimizes cost while ensuring adequate resources.
Job execution time reduced by 70% on average. A 60-minute job now completes in 18 minutes. Compute costs decreased by 45% due to reduced cluster time and better resource utilization. File compaction reduced storage costs by 20% (fewer files = less metadata overhead). The optimization framework is now applied to all 200+ tables, ensuring consistent performance as data volumes grow.
Issue 3: ADF pipeline failures due to dependency conflicts ▼
ADF pipelines were experiencing intermittent failures, particularly for tables with foreign key relationships. For example, when processing `order_items` table, the pipeline would sometimes fail because it tried to merge data before the parent `orders` table was fully processed. This caused referential integrity violations and partial data loads. The failures were unpredictable—sometimes dependent tables would succeed even when parent tables failed, leading to inconsistent data states that required manual cleanup.
After analyzing ADF pipeline execution logs and dependency graphs, we identified the core issue:
1. Missing Dependency Metadata: The original metadata repository didn't capture table dependencies. ADF was reading all tables for a source system and triggering Databricks jobs in parallel without understanding parent-child relationships. For example, `order_items` depends on `orders`, but both jobs would start simultaneously, and `order_items` could complete before `orders`, causing foreign key constraint violations.
2. No Dependency Validation: There was no mechanism to check if a parent table's job completed successfully before starting a dependent table's job. ADF's built-in dependency features weren't being utilized because we were dynamically generating job triggers based on metadata.
3. Retry Logic Issues: When a parent table job failed, dependent tables would still execute (or retry) without waiting for the parent to succeed. This created cascading failures and data inconsistency. The retry mechanism was table-specific and didn't consider dependency chains.
4. Parallel Execution Limits: ADF was attempting to run 50+ jobs in parallel, overwhelming the Databricks workspace and causing some jobs to fail due to resource contention. There was no throttling mechanism.
We redesigned the orchestration layer to handle dependencies properly:
1. Dependency Metadata Table: Created a new `table_dependencies` table in the metadata repository with columns: `table_name`, `depends_on_table`, `dependency_type` (FK, business logic, etc.), and `is_required` (boolean). This captures all parent-child relationships. For example: `order_items` depends on `orders`, `order_items` depends on `products`, etc.
2. Dependency Graph Construction: Modified the ADF pipeline to build a dependency graph using a recursive CTE query against the dependency table. The graph is topologically sorted to determine execution order. Tables with no dependencies are level 0, tables depending only on level 0 are level 1, and so on.
3. Sequential Execution by Level: ADF now executes tables level by level. All level 0 tables run in parallel (with throttling—max 10 concurrent jobs). Once all level 0 jobs complete successfully, level 1 jobs start. This ensures parent tables are always processed before children. We use ADF's `Wait` activity and conditional logic to check job status before proceeding.
4. Status Checking Before Triggering: Before triggering a dependent table's Databricks job, ADF queries the audit table to verify the parent table's status is "SUCCESS". If the parent failed or is still running, the dependent job waits (with a configurable timeout). This prevents orphaned records and referential integrity issues.
5. Retry with Dependency Awareness: Enhanced retry logic to be dependency-aware. If a parent table fails after retries, all dependent tables are marked as "BLOCKED" in the audit table with a reason code. Administrators can manually fix the parent issue and then trigger a "reprocess blocked" pipeline that only runs the blocked dependent tables. This prevents cascading failures.
6. Parallel Execution Throttling: Added a configurable `max_parallel_jobs` parameter (default: 10) to prevent overwhelming Databricks. ADF uses a queue mechanism—jobs wait in queue if the limit is reached, ensuring stable execution even with 100+ tables.
Pipeline failure rate dropped from 15% to <1%. Data consistency improved dramatically—no more orphaned records or referential integrity violations. The dependency framework now handles 200+ tables with complex dependency chains (some tables have 5+ dependencies). The "blocked table" feature reduced manual intervention time by 90%, as administrators can quickly identify and fix root cause issues without manually cleaning up dependent table data.
Issue 1: Incremental loads missing records
During production runs, we noticed that some records with recent update timestamps weren't being captured in incremental loads. Business users reported missing data in analytics dashboards, particularly for tables with high update frequency. The issue was intermittent—some runs would capture all records, while others would miss 5-15% of expected updates. This created data quality concerns and required manual data reconciliation.
After investigating audit logs and comparing source vs. target row counts, we identified three root causes:
1. Watermark Column Inconsistency: The metadata table was configured to use `last_updated_ts` as the watermark column, but some source systems had triggers that updated this column inconsistently. In some cases, bulk updates would modify the data but not update the timestamp column, causing those records to be skipped.
2. Timezone Mismatches: Source systems were in different timezones (EST, PST, UTC), while our watermark comparison logic was using local server time. When comparing timestamps, records updated just before midnight in one timezone would be missed if the comparison happened in another timezone.
3. Clock Skew: Some legacy source systems had clock drift, causing timestamps to be slightly behind actual update times. Our strict "greater than" comparison (`watermark > last_watermark`) missed records that were updated milliseconds before the watermark timestamp.
We implemented a multi-layered solution:
1. Watermark Validation Logic: Added a pre-processing step in Databricks that compares the source system's maximum timestamp with the last successful watermark stored in control tables. If the difference exceeds a threshold (configurable in metadata, default 1 hour), the job logs a warning and can optionally trigger a full reload instead of incremental.
2. Timezone Normalization: Implemented a timezone conversion function that normalizes all timestamps to UTC before comparison. The function reads the source system's timezone from metadata and converts accordingly. This ensures consistent comparisons regardless of source system location.
3. Buffer Window for Clock Skew: Modified the watermark query to use a small buffer window (configurable, default 1 minute). Instead of `watermark > last_watermark`, we use `watermark >= (last_watermark - buffer_minutes)`. This captures records that might have been updated just before the watermark due to clock differences. We handle potential duplicates in the merge logic using primary keys.
4. Backdated Processing Feature: Added a metadata flag `allow_watermark_override` and a manual override column. When reprocessing is needed, administrators can update the watermark value in metadata, and the next run will extract all data from that date forward. This is tracked in audit tables with a "reprocess" flag to distinguish from normal runs.
After implementing these fixes, data completeness improved from 85-90% to 99.8%. The validation logic caught 12 instances where source systems had timestamp inconsistencies, allowing us to proactively fix those issues. The backdated processing feature reduced manual data reconciliation time by 80%, as we could now reprocess specific date ranges without affecting current production loads.
Issue 2: Databricks job performance degradation
Over a 3-month period, Databricks transformation jobs for large tables (10M+ rows) were taking progressively longer to complete. A job that initially took 15 minutes started taking 45-60 minutes, causing SLA violations and increased compute costs. Some jobs were timing out after 2 hours, requiring manual intervention and reprocessing. The performance degradation was most noticeable for tables with frequent incremental updates.
Performance profiling using Databricks Spark UI revealed several bottlenecks:
1. Small File Problem: ADLS Gen2 contained thousands of small files (1-5MB each) instead of optimally sized files (128-256MB). Each incremental run was writing new small files, and over time, the number of files grew exponentially. Reading these files required excessive I/O operations—a single read operation was opening 500+ files instead of 10-20 optimally sized files. This increased read time by 10x.
2. Partitioning Issues: Some tables weren't partitioned at all, while others were partitioned by columns that didn't align with query patterns. For example, a table partitioned by `customer_id` was being queried by `order_date`, causing full table scans. Additionally, incremental merge operations were reading entire partitions even when only a small subset of data changed.
3. Shuffle Operations: Merge operations were causing massive shuffles across the cluster. When merging incremental data with existing data, Spark was redistributing data across all partitions, causing network I/O bottlenecks. For a 50GB table, this resulted in 200GB+ of network traffic per run.
4. Cluster Underutilization: Despite having 8-worker clusters, CPU utilization was only 30-40% because tasks were waiting on I/O and shuffle operations. Workers were idle while waiting for data to be read or shuffled.
We implemented a comprehensive performance optimization strategy:
1. File Compaction Logic: Created a reusable PySpark function that runs after each incremental write. The function identifies files smaller than 128MB, groups them by partition, and uses `coalesce()` or `repartition()` to merge them into optimal sizes (128-256MB). This runs as a separate, low-priority job after the main transformation completes. For tables with 1000+ small files, this reduced file count to 50-100 files, cutting read time by 80%.
2. Metadata-Driven Partitioning Strategy: Enhanced metadata to include `partition_columns` and `partition_strategy` fields. Partition columns are typically date-based (e.g., `load_date`, `order_date`) or business keys (e.g., `region_id`). During writes, we partition data using `partitionBy()` with these columns. For incremental loads, we added partition pruning logic that only reads partitions where `partition_column >= last_watermark`, dramatically reducing I/O for date-partitioned tables.
3. Pre-partitioning for Merges: Before merge operations, we pre-partition both source and target DataFrames by the same key (primary key or business key) using `repartition()`. This ensures that when Spark performs the join, matching records are on the same partition, eliminating cross-partition shuffles. For a 50GB merge operation, this reduced network traffic from 200GB to 5GB.
4. Broadcast Joins for Lookups: For small lookup tables (< 100MB), we use broadcast joins. We identify these in metadata with a `is_lookup` flag and automatically apply `broadcast()` hints. This avoids shuffles entirely for dimension table joins.
5. Dynamic Cluster Sizing: Metadata now includes `estimated_table_size` and `complexity_score`. Based on these, we dynamically size Databricks clusters—small tables (1-5GB) use 2-worker clusters, medium (5-50GB) use 4-6 workers, and large (50GB+) use 8+ workers. This optimizes cost while ensuring adequate resources.
Job execution time reduced by 70% on average. A 60-minute job now completes in 18 minutes. Compute costs decreased by 45% due to reduced cluster time and better resource utilization. File compaction reduced storage costs by 20% (fewer files = less metadata overhead). The optimization framework is now applied to all 200+ tables, ensuring consistent performance as data volumes grow.
Issue 3: ADF pipeline failures due to dependency conflicts
ADF pipelines were experiencing intermittent failures, particularly for tables with foreign key relationships. For example, when processing `order_items` table, the pipeline would sometimes fail because it tried to merge data before the parent `orders` table was fully processed. This caused referential integrity violations and partial data loads. The failures were unpredictable—sometimes dependent tables would succeed even when parent tables failed, leading to inconsistent data states that required manual cleanup.
After analyzing ADF pipeline execution logs and dependency graphs, we identified the core issue:
1. Missing Dependency Metadata: The original metadata repository didn't capture table dependencies. ADF was reading all tables for a source system and triggering Databricks jobs in parallel without understanding parent-child relationships. For example, `order_items` depends on `orders`, but both jobs would start simultaneously, and `order_items` could complete before `orders`, causing foreign key constraint violations.
2. No Dependency Validation: There was no mechanism to check if a parent table's job completed successfully before starting a dependent table's job. ADF's built-in dependency features weren't being utilized because we were dynamically generating job triggers based on metadata.
3. Retry Logic Issues: When a parent table job failed, dependent tables would still execute (or retry) without waiting for the parent to succeed. This created cascading failures and data inconsistency. The retry mechanism was table-specific and didn't consider dependency chains.
4. Parallel Execution Limits: ADF was attempting to run 50+ jobs in parallel, overwhelming the Databricks workspace and causing some jobs to fail due to resource contention. There was no throttling mechanism.
We redesigned the orchestration layer to handle dependencies properly:
1. Dependency Metadata Table: Created a new `table_dependencies` table in the metadata repository with columns: `table_name`, `depends_on_table`, `dependency_type` (FK, business logic, etc.), and `is_required` (boolean). This captures all parent-child relationships. For example: `order_items` depends on `orders`, `order_items` depends on `products`, etc.
2. Dependency Graph Construction: Modified the ADF pipeline to build a dependency graph using a recursive CTE query against the dependency table. The graph is topologically sorted to determine execution order. Tables with no dependencies are level 0, tables depending only on level 0 are level 1, and so on.
3. Sequential Execution by Level: ADF now executes tables level by level. All level 0 tables run in parallel (with throttling—max 10 concurrent jobs). Once all level 0 jobs complete successfully, level 1 jobs start. This ensures parent tables are always processed before children. We use ADF's `Wait` activity and conditional logic to check job status before proceeding.
4. Status Checking Before Triggering: Before triggering a dependent table's Databricks job, ADF queries the audit table to verify the parent table's status is "SUCCESS". If the parent failed or is still running, the dependent job waits (with a configurable timeout). This prevents orphaned records and referential integrity issues.
5. Retry with Dependency Awareness: Enhanced retry logic to be dependency-aware. If a parent table fails after retries, all dependent tables are marked as "BLOCKED" in the audit table with a reason code. Administrators can manually fix the parent issue and then trigger a "reprocess blocked" pipeline that only runs the blocked dependent tables. This prevents cascading failures.
6. Parallel Execution Throttling: Added a configurable `max_parallel_jobs` parameter (default: 10) to prevent overwhelming Databricks. ADF uses a queue mechanism—jobs wait in queue if the limit is reached, ensuring stable execution even with 100+ tables.
Pipeline failure rate dropped from 15% to <1%. Data consistency improved dramatically—no more orphaned records or referential integrity violations. The dependency framework now handles 200+ tables with complex dependency chains (some tables have 5+ dependencies). The "blocked table" feature reduced manual intervention time by 90%, as administrators can quickly identify and fix root cause issues without manually cleaning up dependent table data.