Data Engineer - Airflow, Python,Pyspark, Databricks position
Asked about different file formats (Delta file, ORC)
Difference between SCD2 and Delta files
Airflow functions, how to write sequential and parallel jobs
Advanced Python function (other than lambda like Decorators, Generators),
inheritance, static function
SQL Question
TripID VehicleID StartTime EndTime DistanceKM
101 V1 2025-09-01 08:00:00 2025-09-01 09:00:00 15
103 V2 2025-09-01 08:30:00 2025-09-01 09:15:00 10
104 V2 2025-09-01 09:30:00 2025-09-01 10:30:00 20
105 V1 2025-09-01 12:00:00 2025-09-01 13:00:00 30
102 V1 2025-09-01 10:00:00 2025-09-01 11:30:00 25
1.Write a query to identify who completed the longest trip per vehicle in terms of time spent on the road.
2.Write a query to compare each trip’s distance with the previous trip’s distance for the same vehicle.
select vehicleID,unbounded preceding 1 row
PySpark
file.csv
pickup_datetime,dropoff_datetime,passenger_count,trip_distance,trip_duration
2023-08-01 08:00:00,2023-08-01 08:20:00,1,2.5,20.0
2023-08-01 09:15:00,2023-08-01 09:50:00,2,5.0,35.0
2023-08-01 10:30:00,2023-08-01 10:40:00,1,1.2,10.0
2023-08-01 11:45:00,2023-08-01 12:10:00,3,7.8,25.0
1.Load the dataset into a Spark DataFrame and infer schema from the data.
2.Convert pickup_datetime and dropoff_datetime columns to timestamp type.
3.Implement a UDF to classify trips as 'short', 'medium', or 'long' based on trip_duration
Short: duration ≤ 15 minutes
Medium: 15 < duration ≤ 30 minutes
Long: duration > 30 minutes
Thank you !!