Common DataFrame operations

spark
Author

Youfeng Zhou

Published

November 21, 2022

Common DataFrame operations

Tip

If we don’t specify the schema, Spark can infer the schema from the data.
However, for large datasets and files, it’s more efficient to define a schema than have Spark infer it.

Using the DataFrameReader interface to read a csv file

from pyspark.sql import SparkSession
from pyspark.sql.types import *

spark = SparkSession.builder.appName('CSV-Reader').getOrCreate()
# In python, the syntax is as below.
df = spark.read.csv('csv_file_path', header=True, schema=csv_file_schema)
Important

csv_file_schema is defined before reading the file.

Saviing a DataFrame as a parquet file or SQL table

parquet_path = '...'
df.write.format("parquet").save(parquet_path)
parquet_table = '...'                                   # name of the table
df.write.format("parquet").saveAsTable(parquet_table)

SQL table will cover later

Projections and filters

# in Spark, projections are done with `select()` method,
# while filters can be conducted using `filter()` or `where()` method.
sub_df = df.select([columns list]).where(col(column name == 'some condition'))
sub_df.show(5, truncate=False)

Renaming, adding, dropping columns

df.withColumnRenamed('name of current column', 'renamed column name')
df.withColumn('target column', 'new_column')
df.drop('columns needed to drop')

More details look at book Learning Spark

spark.stop()