from pyspark.sql import SparkSession
from pyspark.sql.types import *
spark = SparkSession.builder.appName('CSV-Reader').getOrCreate()Common DataFrame operations
spark
Common DataFrame operations
Tip
If we don’t specify the schema, Spark can infer the schema from the data.
However, for large datasets and files, it’s more efficient to define a schema than have Spark infer it.
Using the DataFrameReader interface to read a csv file
# In python, the syntax is as below.
df = spark.read.csv('csv_file_path', header=True, schema=csv_file_schema)
Important
csv_file_schema is defined before reading the file.
Saviing a DataFrame as a parquet file or SQL table
parquet_path = '...'
df.write.format("parquet").save(parquet_path)parquet_table = '...' # name of the table
df.write.format("parquet").saveAsTable(parquet_table)SQL table will cover later
Projections and filters
# in Spark, projections are done with `select()` method,
# while filters can be conducted using `filter()` or `where()` method.
sub_df = df.select([columns list]).where(col(column name == 'some condition'))
sub_df.show(5, truncate=False)Renaming, adding, dropping columns
df.withColumnRenamed('name of current column', 'renamed column name')df.withColumn('target column', 'new_column')df.drop('columns needed to drop')More details look at book Learning Spark
spark.stop()