Is there a way to expand a specific column in the dataframe structure of sqlContext in python pyspark?

Asked 2 years ago, Updated 2 years ago, 17 views

%pyspark
sc.stop()
from pyspark import SparkConf, SparkContext, SQLContext
from pyspark.sql.functions import explode
import json


conf = SparkConf().setMaster("spark://local").setAppName("text")
sc = SparkContext(conf = conf)
sqlContext = SQLContext(sc)
df = sqlContext.read.json("file:///home/test/data/test.json")
df.select(df.data).show()

Like the code above, select only the column data from the entire data frame and show() it

| data |

row({key:val1, key2:val1})

row({key:val2, key2:val2})

row({key:val3, key2:val3})

row({key:val4, key2:val4})

It's roughly chosen as this structure. But the final structure I want is I want to change the structure of the key and key2 to the column name and the values according to each column, what should I do? I would appreciate it if you could tell me about Python or Spark grammar.

python

2022-09-20 21:49

1 Answers

val jsonStr = Seq("""{"id" : "1", "name": "aaaaa", "addr": "seoul", "data": "{\"column_name1\":\"value1\",\"column_name2\":\"value2\"}"}""", 
                  """{"id" : "2", "name": "bbbbb", "addr": "pusan", "data": "{\"column_name1\":\"value3\",\"column_name2\":\"value4\"}"}""") 
val rddData = spark.sparkContext.parallelize(jsonStr)
val resultDF = spark.read.json(rddData)

resultDF.selectExpr("id", "name", "addr", "get_json_object(data, '$.column_name1') as column_name1", "get_json_object(data, '$.column_name2') as column_name2").show()

+---+-----+-----+------------+------------+
| | id| name| addr|column_name1|column_name2|
+---+-----+-----+------------+------------+
|  |  1|aaaaa|seoul|      value1|      value2|
|  |  2|bbbbb|pusan|      value3|      value4|
+---+-----+-----+------------+------------+


2022-09-20 21:49

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.