**Data Engineer** - Airflow, Python,Pyspark, Databricks position

- Asked about different file formats (Delta file, ORC)
- Difference between SCD2 and Delta files

- Airflow functions, how to write sequential and parallel jobs

- Advanced Python function (other than lambda like Decorators, Generators),
inheritance, static function


**SQL Question**

TripID	VehicleID	StartTime	        EndTime	            DistanceKM
101	      V1	    2025-09-01 08:00:00	2025-09-01 09:00:00	15 
103	      V2	    2025-09-01 08:30:00	2025-09-01 09:15:00	10 
104	      V2	    2025-09-01 09:30:00	2025-09-01 10:30:00	20 
105	      V1	    2025-09-01 12:00:00	2025-09-01 13:00:00	30 
102	      V1	    2025-09-01 10:00:00	2025-09-01 11:30:00	25 
 
1.Write a query to identify who completed the longest trip per vehicle in terms of time spent on the road.


2.Write a query to compare each trip’s distance with the previous trip’s distance for the same vehicle.
select vehicleID,unbounded preceding 1 row

3. drop duplicates from table
	most optimize solution (not convinced by distinct, insert overwrite)


**PySpark**

file.csv
pickup_datetime,dropoff_datetime,passenger_count,trip_distance,trip_duration
2023-08-01 08:00:00,2023-08-01 08:20:00,1,2.5,20.0
2023-08-01 09:15:00,2023-08-01 09:50:00,2,5.0,35.0
2023-08-01 10:30:00,2023-08-01 10:40:00,1,1.2,10.0
2023-08-01 11:45:00,2023-08-01 12:10:00,3,7.8,25.0
 
1.Load the dataset into a Spark DataFrame and infer schema from the data.


2.Convert pickup_datetime and dropoff_datetime columns to timestamp type.


3.Implement a UDF to classify trips as 'short', 'medium', or 'long' based on trip_duration 
Short: duration ≤ 15 minutes
Medium: 15 < duration ≤ 30 minutes
Long: duration > 30 minutes

Thank you !!


Data Engineer - Airflow, Python,Pyspark, Databricks position

Asked about different file formats (Delta file, ORC)


Difference between SCD2 and Delta files


Airflow functions, how to write sequential and parallel jobs


Advanced Python function (ot