Define a spark schema

spark
Author

Youfeng Zhou

Published

November 17, 2022

Setup

from pyspark.sql.types import *
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Simple-table').getOrCreate()

Define it programmatically

schema = StructType([StructField("first name", StringType(), False), 
                     StructField("last name", StringType(), False), 
                     StructField("weight", IntegerType(), False)])
schema
StructType([StructField('first name', StringType(), False), StructField('last name', StringType(), False), StructField('weight', IntegerType(), False)])

False indicate whether the field can be null (None) or not.

data = [['Jake', 'Z', 60], ['Tom', 'X', 50]]
df = spark.createDataFrame(data, schema)
df.show()
+----------+---------+------+
|first name|last name|weight|
+----------+---------+------+
|      Jake|        Z|    60|
|       Tom|        X|    50|
+----------+---------+------+

Define it using DDL

This method is much simper.

schema = "first_name STRING, last_name STRING, weight INT"
schema
'first_name STRING, last_name STRING, weight INT'
df = spark.createDataFrame(data, schema)
df.show()
+----------+---------+------+
|first_name|last_name|weight|
+----------+---------+------+
|      Jake|        Z|    60|
|       Tom|        X|    50|
+----------+---------+------+
Warning

A disadvantage of this method is we cannot put space or - between words.

Stop spark session

spark.stop()