from pyspark.sql.types import *
from pyspark.sql import SparkSessionDefine a spark schema
spark
Setup
spark = SparkSession.builder.appName('Simple-table').getOrCreate()Define it programmatically
schema = StructType([StructField("first name", StringType(), False),
StructField("last name", StringType(), False),
StructField("weight", IntegerType(), False)])schemaStructType([StructField('first name', StringType(), False), StructField('last name', StringType(), False), StructField('weight', IntegerType(), False)])
False indicate whether the field can be null (None) or not.
data = [['Jake', 'Z', 60], ['Tom', 'X', 50]]df = spark.createDataFrame(data, schema)
df.show()+----------+---------+------+
|first name|last name|weight|
+----------+---------+------+
| Jake| Z| 60|
| Tom| X| 50|
+----------+---------+------+
Define it using DDL
This method is much simper.
schema = "first_name STRING, last_name STRING, weight INT"schema'first_name STRING, last_name STRING, weight INT'
df = spark.createDataFrame(data, schema)
df.show()+----------+---------+------+
|first_name|last_name|weight|
+----------+---------+------+
| Jake| Z| 60|
| Tom| X| 50|
+----------+---------+------+
Warning
A disadvantage of this method is we cannot put space or - between words.
Stop spark session
spark.stop()