TIL: Quick Dataframe Column Rename in Spark
August 31, 2020
[scala]
[spark]
[pyspark]
[python]
[big data]
This is a quick and useful tip I learned recently. It's quite normal
that you need to apply some renames on columns of a data frame. Well
it turned out, it's quite straightforward in Spark. In the following
example you can see how you can replace . characters in column
names:
df.toDF(df.columns.map(x => x.replace("." "_")): _*)The same thing can be acheived in PySpark as follows:
df.toDf(*[c.replace(".", "_") for c in df.columns])I'm using replace dot function as an example of course, any other text
transformation can take place here. However, this dot replacement is
quite useful when you consume data from sources where `.` doesn't have
special meaning. Without this whenever we want to use these columns,
we should use `column.name` to skip nested column structure. Of
course, other solution here would be casting data as a nested column,
which requires another post 😉.