%pyspark
sc.stop()
from pyspark import SparkConf, SparkContext, SQLContext
from pyspark.sql.functions import explode
import json
conf = SparkConf().setMaster("spark://local").setAppName("text")
sc = SparkContext(conf = conf)
sqlContext = SQLContext(sc)
df = sqlContext.read.json("file:///home/test/data/test.json")
df.select(df.data).show()
Like the code above, select only the column data from the entire data frame and show() it
| data |
row({key:val1, key2:val1})
row({key:val2, key2:val2})
row({key:val3, key2:val3})
row({key:val4, key2:val4})
It's roughly chosen as this structure. But the final structure I want is I want to change the structure of the key and key2 to the column name and the values according to each column, what should I do? I would appreciate it if you could tell me about Python or Spark grammar.
python
val jsonStr = Seq("""{"id" : "1", "name": "aaaaa", "addr": "seoul", "data": "{\"column_name1\":\"value1\",\"column_name2\":\"value2\"}"}""",
"""{"id" : "2", "name": "bbbbb", "addr": "pusan", "data": "{\"column_name1\":\"value3\",\"column_name2\":\"value4\"}"}""")
val rddData = spark.sparkContext.parallelize(jsonStr)
val resultDF = spark.read.json(rddData)
resultDF.selectExpr("id", "name", "addr", "get_json_object(data, '$.column_name1') as column_name1", "get_json_object(data, '$.column_name2') as column_name2").show()
+---+-----+-----+------------+------------+
| | id| name| addr|column_name1|column_name2|
+---+-----+-----+------------+------------+
| | 1|aaaaa|seoul| value1| value2|
| | 2|bbbbb|pusan| value3| value4|
+---+-----+-----+------------+------------+
572 Uncaught (inpromise) Error on Electron: An object could not be cloned
581 GDB gets version error when attempting to debug with the Presense SDK (IDE)
824 When building Fast API+Uvicorn environment with PyInstaller, console=False results in an error
561 rails db:create error: Could not find mysql2-0.5.4 in any of the sources
© 2024 OneMinuteCode. All rights reserved.