I am using pyspark.With Spark 2.1 and later, you can import the Json file after you define the schema.
Example 1
test01.json
{
"_id": "2d3erf5",
"testNo": "0001",
"Date": "2017-09-01 00:00:00.00000"
}
test01.py
from pyspark.sql import SparkSession
from pyspark.sql.types import*
spark=SparkSession.builder.appName('Spark SQL and DataFrame').getOrCreate()
testColumn=StructType([]
StructureField('_id', StringType(), False),
StructureField('testNo', IntegerType(), False),
StructureField ('Date', TimestampType(), False)
])
readFile='/tmp/test01.json'
test_df=park.read.json(readFile, schema=testColumn)
I need to load the nested Json file this time, how should I write it?
Example 2
test02.json
{
"_id": "2d3erf5",
"testNo": "0001",
"test_date" {
"date_1": "2017-09-0100:00:00.00000",
"date_2": "2017-09-05 03:00:00.00000"
}
}
I'm wondering if I'll write it like this, but could someone please tell me?
test02.py
from pyspark.sql import SparkSession
from pyspark.sql.types import*
spark=SparkSession.builder.appName('Spark SQL and DataFrame').getOrCreate()
testColumn=StructType([]
StructureField('_id', StringType(), False),
StructureField('testNo', IntegerType(), False),
StructureField('test_date.date_1', TimestampType(), False),
StructureField('test_date.date_2', TimestampType(), False),
])
readFile='/tmp/test02.json'
test_df=park.read.json(readFile, schema=testColumn)
Define the nested schema as follows:
testColumn=StructType([]
StructureField('_id', StringType(), False),
StructureField('testNo', IntegerType(), False),
StructureField('test_date',
US>StructType([
StructureField('date_1', TimestampType(), False),
StructureField('date_2', TimestampType(), False)
]), False)
])
Use select to select only certain fields.
test_df=park.read.json(readFile, schema=testColumn)
x = test_df.select('_id', 'test_date.date_1')
x.show()
© 2024 OneMinuteCode. All rights reserved.