I want to load data from a Json file nested in Spark DataFrame

Asked 1 years ago, Updated 1 years ago, 58 views

I am using pyspark.With Spark 2.1 and later, you can import the Json file after you define the schema.

Example 1
test01.json

{
 "_id": "2d3erf5",
 "testNo": "0001",
 "Date": "2017-09-01 00:00:00.00000"
}

test01.py

 from pyspark.sql import SparkSession
from pyspark.sql.types import*

spark=SparkSession.builder.appName('Spark SQL and DataFrame').getOrCreate()


testColumn=StructType([]
  StructureField('_id', StringType(), False),
  StructureField('testNo', IntegerType(), False),
  StructureField ('Date', TimestampType(), False)
])
readFile='/tmp/test01.json'
test_df=park.read.json(readFile, schema=testColumn)

I need to load the nested Json file this time, how should I write it?

Example 2
test02.json

{
  "_id": "2d3erf5",
  "testNo": "0001",
  "test_date" {
              "date_1": "2017-09-0100:00:00.00000",
              "date_2": "2017-09-05 03:00:00.00000"
             }
}

I'm wondering if I'll write it like this, but could someone please tell me?

test02.py

 from pyspark.sql import SparkSession
from pyspark.sql.types import*

spark=SparkSession.builder.appName('Spark SQL and DataFrame').getOrCreate() 


testColumn=StructType([]
  StructureField('_id', StringType(), False),
  StructureField('testNo', IntegerType(), False),
  StructureField('test_date.date_1', TimestampType(), False),
  StructureField('test_date.date_2', TimestampType(), False),
])
readFile='/tmp/test02.json'
test_df=park.read.json(readFile, schema=testColumn)

json python3 spark

2022-09-30 15:43

1 Answers

Define the nested schema as follows:

 testColumn=StructType([]
  StructureField('_id', StringType(), False),
  StructureField('testNo', IntegerType(), False),
  StructureField('test_date',
              US>StructType([
                  StructureField('date_1', TimestampType(), False),
                  StructureField('date_2', TimestampType(), False)
              ]), False)
])

Use select to select only certain fields.

test_df=park.read.json(readFile, schema=testColumn)
x = test_df.select('_id', 'test_date.date_1')
x.show()


2022-09-30 15:43

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.