Wednesday, August 22, 2018

Select column which has special character in its name in spark

  • Suppose there is a tap value in the name of a column.
  • Get the schema of the data frame.
    • print(df.schema)
      • StructType(StructField(  _a,LongType,true) ...
    • Notice that there is a tab value before "_a".
  • Save it to a variable
    • val schema = "StructType(StructField(  _a,LongType,true) ..."
  • Add quotes to the column name
    • println(schema.replaceAll("StructField\\(([\\s_a-zA-Z0-9]+),", "StructField\\(\"$1\","))
  • Put all StructFields inside a Seq, and change the column name which has tab value.
    • StructType(Seq(StructField("bad_a",LongType,true) ...
  • Create a new data frame by using the new schema
    • import org.apache.spark.sql.types._
    • val df2 = spark.createDataFrame(df.rdd, StructType(Seq(StructField("bad_a",LongType,true) ... )
  • Use the new column name to select it.
    • df2.select("*").where("bad_a is not null").show(false)

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.