Data Engineer interview at NielsenIQ with 2 YOE
Anonymous User
319
Sep 15, 2025
Oct 05, 2025

Data Engineer - Airflow, Python,Pyspark, Databricks position

  • Asked about different file formats (Delta file, ORC)

  • Difference between SCD2 and Delta files

  • Airflow functions, how to write sequential and parallel jobs

  • Advanced Python function (other than lambda like Decorators, Generators),
    inheritance, static function

SQL Question

TripID VehicleID StartTime EndTime DistanceKM
101 V1 2025-09-01 08:00:00 2025-09-01 09:00:00 15
103 V2 2025-09-01 08:30:00 2025-09-01 09:15:00 10
104 V2 2025-09-01 09:30:00 2025-09-01 10:30:00 20
105 V1 2025-09-01 12:00:00 2025-09-01 13:00:00 30
102 V1 2025-09-01 10:00:00 2025-09-01 11:30:00 25

1.Write a query to identify who completed the longest trip per vehicle in terms of time spent on the road.

2.Write a query to compare each trip’s distance with the previous trip’s distance for the same vehicle.
select vehicleID,unbounded preceding 1 row

  1. drop duplicates from table
    most optimize solution (not convinced by distinct, insert overwrite)

PySpark

file.csv
pickup_datetime,dropoff_datetime,passenger_count,trip_distance,trip_duration
2023-08-01 08:00:00,2023-08-01 08:20:00,1,2.5,20.0
2023-08-01 09:15:00,2023-08-01 09:50:00,2,5.0,35.0
2023-08-01 10:30:00,2023-08-01 10:40:00,1,1.2,10.0
2023-08-01 11:45:00,2023-08-01 12:10:00,3,7.8,25.0

1.Load the dataset into a Spark DataFrame and infer schema from the data.

2.Convert pickup_datetime and dropoff_datetime columns to timestamp type.

3.Implement a UDF to classify trips as 'short', 'medium', or 'long' based on trip_duration
Short: duration ≤ 15 minutes
Medium: 15 < duration ≤ 30 minutes
Long: duration > 30 minutes

Thank you !!

Comments (3)