PySpark Interview Questions

PySpark Interview Questions

🟢 Basic Level (1–20)

What is PySpark?
What is Apache Spark?
Why do we use PySpark?
Difference between Hadoop and Spark?
What is RDD in PySpark?
What is DataFrame in PySpark?
Difference between RDD and DataFrame?
What is SparkSession?
What is SparkContext?
What are transformations in PySpark?
What are actions in PySpark?
What is lazy evaluation?
What is immutable in Spark?
What is partition in Spark?
What is cluster computing?
What is driver program?
What is executor?
What is DAG in Spark?
What is PySpark API?
What languages does Spark support?

⚙️ Core Concepts (21–40)

What is narrow transformation?
What is wide transformation?
Difference between narrow and wide transformations?
What is shuffle in Spark?
What is caching in PySpark?
What is persistence?
What is broadcast variable?
What is accumulator?
What is map() function?
What is flatMap()?
What is filter() function?
What is reduceByKey()?
What is groupByKey()?
Difference between reduceByKey and groupByKey?
What is join operation?
Types of joins in PySpark?
What is select() function?
What is withColumn()?
What is drop() function?
What is schema in DataFrame?

📊 DataFrame & SQL (41–60)

What is Spark SQL?
What is temp view?
What is global temp view?
What is show() function?
What is collect() function?
What is count() function?
What is distinct()?
What is groupBy()?
What is aggregation?
What is orderBy()?
What is sort()?
What is alias() in PySpark?
What is SQL vs DataFrame API?
How to read CSV in PySpark?
How to read JSON file?
How to write DataFrame to file?
What is inferSchema?
What is null handling?
What is dropna()?
What is fillna()?

⚡ Advanced Level (61–80)

What is Spark architecture?
How does Spark execute job?
What is task in Spark?
What is stage in Spark?
What is job in Spark?
What is partitioning strategy?
What is bucketing?
What is checkpointing?
What is lineage graph?
What is fault tolerance?
How Spark handles memory?
What is Tungsten engine?
What is Catalyst optimizer?
What is data serialization?
What is Parquet file format?
Why Parquet is faster?
What is ORC format?
What is Avro format?
Difference between Parquet and CSV?
What is Spark streaming?

🚀 Scenario-Based (81–100)

How do you handle big data in PySpark?
How to optimize PySpark performance?
How to reduce shuffle in Spark?
How to handle skewed data?
How to improve join performance?
How to debug Spark jobs?
How to handle missing data?
How to process real-time data?
How to scale Spark application?
How to tune Spark memory?
How to handle large joins?
How to use caching effectively?
How to use broadcast joins?
How to write efficient Spark code?
How to handle failures in Spark?
How to monitor Spark jobs?
How to optimize DataFrame operations?
How to design ETL pipeline in Spark?
Why companies use PySpark?
Difference between batch and streaming processing?

PySpark Interview Answers (1–100)

🟢 Basic (1–20)

Python API for Apache Spark
Big data processing engine
For large-scale data processing
Hadoop is disk-based, Spark is in-memory
Resilient Distributed Dataset
Distributed table-like data structure
RDD = low-level, DataFrame = optimized
Entry point for Spark applications
Low-level Spark API (older)
Operations that create new RDD/DataFrame
Operations that trigger execution
Delayed execution until action is called
Data cannot be changed once created
Data split across cluster nodes
Computing across multiple machines
Program that runs Spark job
Worker node executing tasks
Execution plan of Spark job
Python interface for Spark
Python, Java, Scala, R

⚙️ Core (21–40)

Transformation with no shuffle
Requires data movement across cluster
Narrow = fast, Wide = slow due to shuffle
Data movement between nodes
Storing data in memory/disk
Storing intermediate data
Shared read-only variable
Global counter variable
Applies function to each element
Flattens nested structure
Filters data based on condition
Aggregation by key (faster)
Groups all values by key
reduceByKey is optimized
Combines two datasets
Inner, outer, left, right
Select columns
Add or modify column
Remove column
Structure of DataFrame

📊 DataFrame & SQL (41–60)

SQL queries on Spark data
Temporary DataFrame view
Global accessible view
Displays DataFrame
Returns all data (not recommended for big data)
Counts rows
Removes duplicates
Groups data
Mathematical operations on data
Sort in ascending/descending
Sorting function
Rename column
SQL is query-based, DataFrame is API-based
read.csv() function
read.json() function
write() method
Automatically detects schema
Handling missing values
Removes null rows
Fills null values

⚡ Advanced (61–80)

Driver + Executors architecture
Job → Stage → Task execution
Small unit of execution
Group of tasks
Entire computation request
Splitting data across nodes
Pre-partitioning data
Saves intermediate results
Tracks execution history
System recovery ability
Uses memory + disk storage
Optimizes memory usage
Query optimization engine
Conversion of data formats
Columnar storage format
Faster reading and compression
Columnar format like Parquet
Row-based format
Parquet is faster and compressed
Real-time data processing

🚀 Scenario (81–100)

Distributed processing with Spark cluster
Caching, partitioning, avoiding shuffle
Use map-side aggregation
Use salting or repartition
Broadcast join or indexing
Using logs and Spark UI
dropna() or fillna()
Spark Streaming or Structured Streaming
Add more executors/nodes
Tune executor memory settings
Use broadcast joins
Cache frequently used DataFrames
Broadcast smaller dataset
Avoid unnecessary transformations
Retry mechanisms + checkpointing
Spark UI monitoring
Use select and filter early
Extract → Transform → Load pipeline
Fast, scalable, distributed processing
Batch = static data, Streaming = real-time data