Spark Scala'da bir DataFrame'in sütun adlarını yeniden adlandırma

Question 1

DataFrameSpark-Scala'daki a'nın tüm başlıklarını / sütun adlarını dönüştürmeye çalışıyorum . şu andan itibaren sadece tek bir sütun adının yerini alan aşağıdaki kodu buldum.

for( i <- 0 to origCols.length - 1) {
  df.withColumnRenamed(
    df.columns(i), 
    df.columns(i).toLowerCase
  );
}

Question 2

Yapı düzse:

val df = Seq((1L, "a", "foo", 3.0)).toDF
df.printSchema
// root
//  |-- _1: long (nullable = false)
//  |-- _2: string (nullable = true)
//  |-- _3: string (nullable = true)
//  |-- _4: double (nullable = false)

Yapabileceğiniz en basit şey toDFyöntemi kullanmaktır :

val newNames = Seq("id", "x1", "x2", "x3")
val dfRenamed = df.toDF(newNames: _*)

dfRenamed.printSchema
// root
// |-- id: long (nullable = false)
// |-- x1: string (nullable = true)
// |-- x2: string (nullable = true)
// |-- x3: double (nullable = false)

Tek tek sütunları yeniden adlandırmak istiyorsanız birini kullanabilirsiniz selectile alias:

df.select($"_1".alias("x1"))

kolayca birden çok sütuna genelleştirilebilir:

val lookup = Map("_1" -> "foo", "_3" -> "bar")

df.select(df.columns.map(c => col(c).as(lookup.getOrElse(c, c))): _*)

veya withColumnRenamed:

df.withColumnRenamed("_1", "x1")

foldLeftbirden çok sütunu yeniden adlandırmak için ile birlikte kullanılır :

lookup.foldLeft(df)((acc, ca) => acc.withColumnRenamed(ca._1, ca._2))

İç içe yapılarla ( structs) olası seçeneklerden biri, tüm yapıyı seçerek yeniden adlandırmaktır:

val nested = spark.read.json(sc.parallelize(Seq(
    """{"foobar": {"foo": {"bar": {"first": 1.0, "second": 2.0}}}, "id": 1}"""
)))

nested.printSchema
// root
//  |-- foobar: struct (nullable = true)
//  |    |-- foo: struct (nullable = true)
//  |    |    |-- bar: struct (nullable = true)
//  |    |    |    |-- first: double (nullable = true)
//  |    |    |    |-- second: double (nullable = true)
//  |-- id: long (nullable = true)

@transient val foobarRenamed = struct(
  struct(
    struct(
      $"foobar.foo.bar.first".as("x"), $"foobar.foo.bar.first".as("y")
    ).alias("point")
  ).alias("location")
).alias("record")

nested.select(foobarRenamed, $"id").printSchema
// root
//  |-- record: struct (nullable = false)
//  |    |-- location: struct (nullable = false)
//  |    |    |-- point: struct (nullable = false)
//  |    |    |    |-- x: double (nullable = true)
//  |    |    |    |-- y: double (nullable = true)
//  |-- id: long (nullable = true)

nullabilityMeta verileri etkileyebileceğini unutmayın . Başka bir olasılık, çevrimle yeniden adlandırmaktır:

nested.select($"foobar".cast(
  "struct<location:struct<point:struct<x:double,y:double>>>"
).alias("record")).printSchema

// root
//  |-- record: struct (nullable = true)
//  |    |-- location: struct (nullable = true)
//  |    |    |-- point: struct (nullable = true)
//  |    |    |    |-- x: double (nullable = true)
//  |    |    |    |-- y: double (nullable = true)

veya:

import org.apache.spark.sql.types._

nested.select($"foobar".cast(
  StructType(Seq(
    StructField("location", StructType(Seq(
      StructField("point", StructType(Seq(
        StructField("x", DoubleType), StructField("y", DoubleType)))))))))
).alias("record")).printSchema

// root
//  |-- record: struct (nullable = true)
//  |    |-- location: struct (nullable = true)
//  |    |    |-- point: struct (nullable = true)
//  |    |    |    |-- x: double (nullable = true)
//  |    |    |    |-- y: double (nullable = true)

Question 3

PySpark sürümüyle ilgilenenler için (aslında Scala'da aynı - aşağıdaki yoruma bakın):

    merchants_df_renamed = merchants_df.toDF(
        'merchant_id', 'category', 'subcategory', 'merchant')

    merchants_df_renamed.printSchema()

Sonuç:

kök
| - merchant_id: tamsayı (nullable = true)
| - kategori: string (nullable = true)
| - alt kategori: string (nullable = true)
| - satıcı: string (nullable = true)

Question 4

def aliasAllColumns(t: DataFrame, p: String = "", s: String = ""): DataFrame =
{
  t.select( t.columns.map { c => t.col(c).as( p + c + s) } : _* )
}

Açık olmadığı durumda, bu, geçerli sütun adlarının her birine bir önek ve bir sonek ekler. Bu, aynı ada sahip bir veya daha fazla sütuna sahip iki tablonuz olduğunda ve bunlara katılmak istediğinizde ancak yine de ortaya çıkan tablodaki sütunları netleştirebildiğinizde yararlı olabilir. Bunu "normal" SQL'de yapmanın benzer bir yolu olsaydı kesinlikle güzel olurdu.

Question 5

Dataframe df'nin 3 sütun id1, ad1, fiyat1 olduğunu ve bunları id2, ad2, fiyat2 olarak yeniden adlandırmak istediğinizi varsayalım.

val list = List("id2", "name2", "price2")
import spark.implicits._
val df2 = df.toDF(list:_*)
df2.columns.foreach(println)

Bu yaklaşımı birçok durumda yararlı buldum.

Question 6

çekme masası birleştirme, birleştirilmiş anahtarı yeniden adlandırmaz

// method 1: create a new DF
day1 = day1.toDF(day1.columns.map(x => if (x.equals(key)) x else s"${x}_d1"): _*)

// method 2: use withColumnRenamed
for ((x, y) <- day1.columns.filter(!_.equals(key)).map(x => (x, s"${x}_d1"))) {
    day1 = day1.withColumnRenamed(x, y)
}

İşler!

Question 7

Sometime we have the column name is below format in SQLServer or MySQL table

Ex  : Account Number,customer number

But Hive tables do not support column name containing spaces, so please use below solution to rename your old column names.

Solution:

val renamedColumns = df.columns.map(c => df(c).as(c.replaceAll(" ", "_").toLowerCase()))
df = df.select(renamedColumns: _*)