
Readable Scala Code in Apache Spark (4 attempts)
Jupyter and Apache Zeppelin is a good place to experiment with data. Unfortunately, the specifics of notebooks do not encourage to organize the code, including its decomposition and readability. We can copy cells to Intellij IDEA and build JAR, but the effect will not be stunning. We can copy cells to Intellij IDEA and build JAR, but the effect will not be stunning. In this article you will learn how to make more readable Scala Apache Spark code in Intellij IDEA.
0. The base code
It is a simple application which:
- downloads groceries data from a file;
- filters fruits;
- normalizes names;
- calculates the quantity of each fruit.
1. Extract Methods
Let’s use the power of IDE, more precisely the Extract Method. It allows you to easily create a method from a selected piece of code. This way, let’s try to create methods corresponding to each step in the application.
It doesn’t work!?
The code in the main
method is already more readable… but this code does not work. We want to use SparkSession
and spark.implicits._
in the methods. Unfortunately these values are not within the scope of methods.

2. SparkSession overdose
We can fix this by passing on SparkSession
in every method. Unfortunately, this is a pain in the ass. We also have to import spark.implicits._
every time. I’m to lazy for this solution 😁.
3. SparkSession at your service
We need to provide access to SparkSession
in a slightly different way. The SparkJob
object will help.
Now we can import SparkJob
and spark.implicits._
in the application. The code looks better. We can reuse the methods.
4. Implicit class / Extension method
I wrote a lot of C# code in my life. An interesting and useful concept is Extension Method. It allows you to “add” methods to an existing type/class without modifying it. Below is an example. Instead of writing
int numberA = 1
int numberB = 2
val sum = Sum(numberA, numberB)
...
public int Sum(int numberA, int numberB)
{
return numberA + numberB
}
We can write
int numberA = 1
int numberB = 2
val sum = numberA.Add(numberB)
...
public static int Add(this int numberA, int numberB)
{
return numberA + numberB
}
The difference in readability can be seen in the following example
Sum(A, Sum(B, Sum(C,Sum (D,...))))
// VS
A.Add(B).Add(C).Add(D)...
In Scala we can get a similar mechanism using Implicit class
. Below is the reorganized logic of the reviewed Apache Spark application.
Application logic has moved to another object and the code can be read like prose.
Let’s go back to what the application was supposed to do:
- download groceries data from a file
- filter fruits
- normalize names
- calculate the quantity of each fruit
Maybe not word for word, but you know what it is about 😁.
EDIT: Dataset transform
While the previous way is cool, it can sometimes be misleading. To separate the business code from the base class, we can use Dataset.transform. You will find details in this article from MungingData.
Repository
Please share what you think about this in comment secion. What is your way of making the code readable?