Skip to content
PySpark Interview Questions
🟢 Basic Level (1–20)
- What is PySpark?
- What is Apache Spark?
- Why do we use PySpark?
- Difference between Hadoop and Spark?
- What is RDD in PySpark?
- What is DataFrame in PySpark?
- Difference between RDD and DataFrame?
- What is SparkSession?
- What is SparkContext?
- What are transformations in PySpark?
- What are actions in PySpark?
- What is lazy evaluation?
- What is immutable in Spark?
- What is partition in Spark?
- What is cluster computing?
- What is driver program?
- What is executor?
- What is DAG in Spark?
- What is PySpark API?
- What languages does Spark support?
⚙️ Core Concepts (21–40)
- What is narrow transformation?
- What is wide transformation?
- Difference between narrow and wide transformations?
- What is shuffle in Spark?
- What is caching in PySpark?
- What is persistence?
- What is broadcast variable?
- What is accumulator?
- What is map() function?
- What is flatMap()?
- What is filter() function?
- What is reduceByKey()?
- What is groupByKey()?
- Difference between reduceByKey and groupByKey?
- What is join operation?
- Types of joins in PySpark?
- What is select() function?
- What is withColumn()?
- What is drop() function?
- What is schema in DataFrame?
📊 DataFrame & SQL (41–60)
- What is Spark SQL?
- What is temp view?
- What is global temp view?
- What is show() function?
- What is collect() function?
- What is count() function?
- What is distinct()?
- What is groupBy()?
- What is aggregation?
- What is orderBy()?
- What is sort()?
- What is alias() in PySpark?
- What is SQL vs DataFrame API?
- How to read CSV in PySpark?
- How to read JSON file?
- How to write DataFrame to file?
- What is inferSchema?
- What is null handling?
- What is dropna()?
- What is fillna()?
⚡ Advanced Level (61–80)
- What is Spark architecture?
- How does Spark execute job?
- What is task in Spark?
- What is stage in Spark?
- What is job in Spark?
- What is partitioning strategy?
- What is bucketing?
- What is checkpointing?
- What is lineage graph?
- What is fault tolerance?
- How Spark handles memory?
- What is Tungsten engine?
- What is Catalyst optimizer?
- What is data serialization?
- What is Parquet file format?
- Why Parquet is faster?
- What is ORC format?
- What is Avro format?
- Difference between Parquet and CSV?
- What is Spark streaming?
🚀 Scenario-Based (81–100)
- How do you handle big data in PySpark?
- How to optimize PySpark performance?
- How to reduce shuffle in Spark?
- How to handle skewed data?
- How to improve join performance?
- How to debug Spark jobs?
- How to handle missing data?
- How to process real-time data?
- How to scale Spark application?
- How to tune Spark memory?
- How to handle large joins?
- How to use caching effectively?
- How to use broadcast joins?
- How to write efficient Spark code?
- How to handle failures in Spark?
- How to monitor Spark jobs?
- How to optimize DataFrame operations?
- How to design ETL pipeline in Spark?
- Why companies use PySpark?
- Difference between batch and streaming processing?
PySpark Interview Answers (1–100)
🟢 Basic (1–20)
- Python API for Apache Spark
- Big data processing engine
- For large-scale data processing
- Hadoop is disk-based, Spark is in-memory
- Resilient Distributed Dataset
- Distributed table-like data structure
- RDD = low-level, DataFrame = optimized
- Entry point for Spark applications
- Low-level Spark API (older)
- Operations that create new RDD/DataFrame
- Operations that trigger execution
- Delayed execution until action is called
- Data cannot be changed once created
- Data split across cluster nodes
- Computing across multiple machines
- Program that runs Spark job
- Worker node executing tasks
- Execution plan of Spark job
- Python interface for Spark
- Python, Java, Scala, R
⚙️ Core (21–40)
- Transformation with no shuffle
- Requires data movement across cluster
- Narrow = fast, Wide = slow due to shuffle
- Data movement between nodes
- Storing data in memory/disk
- Storing intermediate data
- Shared read-only variable
- Global counter variable
- Applies function to each element
- Flattens nested structure
- Filters data based on condition
- Aggregation by key (faster)
- Groups all values by key
- reduceByKey is optimized
- Combines two datasets
- Inner, outer, left, right
- Select columns
- Add or modify column
- Remove column
- Structure of DataFrame
📊 DataFrame & SQL (41–60)
- SQL queries on Spark data
- Temporary DataFrame view
- Global accessible view
- Displays DataFrame
- Returns all data (not recommended for big data)
- Counts rows
- Removes duplicates
- Groups data
- Mathematical operations on data
- Sort in ascending/descending
- Sorting function
- Rename column
- SQL is query-based, DataFrame is API-based
- read.csv() function
- read.json() function
- write() method
- Automatically detects schema
- Handling missing values
- Removes null rows
- Fills null values
⚡ Advanced (61–80)
- Driver + Executors architecture
- Job → Stage → Task execution
- Small unit of execution
- Group of tasks
- Entire computation request
- Splitting data across nodes
- Pre-partitioning data
- Saves intermediate results
- Tracks execution history
- System recovery ability
- Uses memory + disk storage
- Optimizes memory usage
- Query optimization engine
- Conversion of data formats
- Columnar storage format
- Faster reading and compression
- Columnar format like Parquet
- Row-based format
- Parquet is faster and compressed
- Real-time data processing
🚀 Scenario (81–100)
- Distributed processing with Spark cluster
- Caching, partitioning, avoiding shuffle
- Use map-side aggregation
- Use salting or repartition
- Broadcast join or indexing
- Using logs and Spark UI
- dropna() or fillna()
- Spark Streaming or Structured Streaming
- Add more executors/nodes
- Tune executor memory settings
- Use broadcast joins
- Cache frequently used DataFrames
- Broadcast smaller dataset
- Avoid unnecessary transformations
- Retry mechanisms + checkpointing
- Spark UI monitoring
- Use select and filter early
- Extract → Transform → Load pipeline
- Fast, scalable, distributed processing
- Batch = static data, Streaming = real-time data
Please Share This
Share this content
You Might Also Like