PySpark Interview Questions

PySpark Interview Questions

🟢 Basic Level (1–20)

  1. What is PySpark?
  2. What is Apache Spark?
  3. Why do we use PySpark?
  4. Difference between Hadoop and Spark?
  5. What is RDD in PySpark?
  6. What is DataFrame in PySpark?
  7. Difference between RDD and DataFrame?
  8. What is SparkSession?
  9. What is SparkContext?
  10. What are transformations in PySpark?
  11. What are actions in PySpark?
  12. What is lazy evaluation?
  13. What is immutable in Spark?
  14. What is partition in Spark?
  15. What is cluster computing?
  16. What is driver program?
  17. What is executor?
  18. What is DAG in Spark?
  19. What is PySpark API?
  20. What languages does Spark support?

⚙️ Core Concepts (21–40)

  1. What is narrow transformation?
  2. What is wide transformation?
  3. Difference between narrow and wide transformations?
  4. What is shuffle in Spark?
  5. What is caching in PySpark?
  6. What is persistence?
  7. What is broadcast variable?
  8. What is accumulator?
  9. What is map() function?
  10. What is flatMap()?
  11. What is filter() function?
  12. What is reduceByKey()?
  13. What is groupByKey()?
  14. Difference between reduceByKey and groupByKey?
  15. What is join operation?
  16. Types of joins in PySpark?
  17. What is select() function?
  18. What is withColumn()?
  19. What is drop() function?
  20. What is schema in DataFrame?

📊 DataFrame & SQL (41–60)

  1. What is Spark SQL?
  2. What is temp view?
  3. What is global temp view?
  4. What is show() function?
  5. What is collect() function?
  6. What is count() function?
  7. What is distinct()?
  8. What is groupBy()?
  9. What is aggregation?
  10. What is orderBy()?
  11. What is sort()?
  12. What is alias() in PySpark?
  13. What is SQL vs DataFrame API?
  14. How to read CSV in PySpark?
  15. How to read JSON file?
  16. How to write DataFrame to file?
  17. What is inferSchema?
  18. What is null handling?
  19. What is dropna()?
  20. What is fillna()?

⚡ Advanced Level (61–80)

  1. What is Spark architecture?
  2. How does Spark execute job?
  3. What is task in Spark?
  4. What is stage in Spark?
  5. What is job in Spark?
  6. What is partitioning strategy?
  7. What is bucketing?
  8. What is checkpointing?
  9. What is lineage graph?
  10. What is fault tolerance?
  11. How Spark handles memory?
  12. What is Tungsten engine?
  13. What is Catalyst optimizer?
  14. What is data serialization?
  15. What is Parquet file format?
  16. Why Parquet is faster?
  17. What is ORC format?
  18. What is Avro format?
  19. Difference between Parquet and CSV?
  20. What is Spark streaming?

🚀 Scenario-Based (81–100)

  1. How do you handle big data in PySpark?
  2. How to optimize PySpark performance?
  3. How to reduce shuffle in Spark?
  4. How to handle skewed data?
  5. How to improve join performance?
  6. How to debug Spark jobs?
  7. How to handle missing data?
  8. How to process real-time data?
  9. How to scale Spark application?
  10. How to tune Spark memory?
  11. How to handle large joins?
  12. How to use caching effectively?
  13. How to use broadcast joins?
  14. How to write efficient Spark code?
  15. How to handle failures in Spark?
  16. How to monitor Spark jobs?
  17. How to optimize DataFrame operations?
  18. How to design ETL pipeline in Spark?
  19. Why companies use PySpark?
  20. Difference between batch and streaming processing?

PySpark Interview Answers (1–100)

🟢 Basic (1–20)

  1. Python API for Apache Spark
  2. Big data processing engine
  3. For large-scale data processing
  4. Hadoop is disk-based, Spark is in-memory
  5. Resilient Distributed Dataset
  6. Distributed table-like data structure
  7. RDD = low-level, DataFrame = optimized
  8. Entry point for Spark applications
  9. Low-level Spark API (older)
  10. Operations that create new RDD/DataFrame
  11. Operations that trigger execution
  12. Delayed execution until action is called
  13. Data cannot be changed once created
  14. Data split across cluster nodes
  15. Computing across multiple machines
  16. Program that runs Spark job
  17. Worker node executing tasks
  18. Execution plan of Spark job
  19. Python interface for Spark
  20. Python, Java, Scala, R

⚙️ Core (21–40)

  1. Transformation with no shuffle
  2. Requires data movement across cluster
  3. Narrow = fast, Wide = slow due to shuffle
  4. Data movement between nodes
  5. Storing data in memory/disk
  6. Storing intermediate data
  7. Shared read-only variable
  8. Global counter variable
  9. Applies function to each element
  10. Flattens nested structure
  11. Filters data based on condition
  12. Aggregation by key (faster)
  13. Groups all values by key
  14. reduceByKey is optimized
  15. Combines two datasets
  16. Inner, outer, left, right
  17. Select columns
  18. Add or modify column
  19. Remove column
  20. Structure of DataFrame

📊 DataFrame & SQL (41–60)

  1. SQL queries on Spark data
  2. Temporary DataFrame view
  3. Global accessible view
  4. Displays DataFrame
  5. Returns all data (not recommended for big data)
  6. Counts rows
  7. Removes duplicates
  8. Groups data
  9. Mathematical operations on data
  10. Sort in ascending/descending
  11. Sorting function
  12. Rename column
  13. SQL is query-based, DataFrame is API-based
  14. read.csv() function
  15. read.json() function
  16. write() method
  17. Automatically detects schema
  18. Handling missing values
  19. Removes null rows
  20. Fills null values

⚡ Advanced (61–80)

  1. Driver + Executors architecture
  2. Job → Stage → Task execution
  3. Small unit of execution
  4. Group of tasks
  5. Entire computation request
  6. Splitting data across nodes
  7. Pre-partitioning data
  8. Saves intermediate results
  9. Tracks execution history
  10. System recovery ability
  11. Uses memory + disk storage
  12. Optimizes memory usage
  13. Query optimization engine
  14. Conversion of data formats
  15. Columnar storage format
  16. Faster reading and compression
  17. Columnar format like Parquet
  18. Row-based format
  19. Parquet is faster and compressed
  20. Real-time data processing

🚀 Scenario (81–100)

  1. Distributed processing with Spark cluster
  2. Caching, partitioning, avoiding shuffle
  3. Use map-side aggregation
  4. Use salting or repartition
  5. Broadcast join or indexing
  6. Using logs and Spark UI
  7. dropna() or fillna()
  8. Spark Streaming or Structured Streaming
  9. Add more executors/nodes
  10. Tune executor memory settings
  11. Use broadcast joins
  12. Cache frequently used DataFrames
  13. Broadcast smaller dataset
  14. Avoid unnecessary transformations
  15. Retry mechanisms + checkpointing
  16. Spark UI monitoring
  17. Use select and filter early
  18. Extract → Transform → Load pipeline
  19. Fast, scalable, distributed processing
  20. Batch = static data, Streaming = real-time data