Image for post
Image for post

Readable Scala Code in Apache Spark (4 attempts)

Jupyter and Apache Zeppelin is a good place to experiment with data. Unfortunately, the specifics of notebooks do not encourage to organize the code, including its decomposition and readability. We can copy cells to Intellij IDEA and build JAR, but the effect will not be stunning. We can copy cells to Intellij IDEA and build JAR, but the effect will not be stunning. In this article you will learn how to make more readable Scala Apache Spark code in Intellij IDEA.

0. The base code

It is a simple application which:

  • downloads groceries data from a file;
  • filters fruits;
  • normalizes names;
  • calculates the quantity of each fruit.

1. Extract Methods

Let’s use the power of IDE, more precisely the Extract Method. It allows you to easily create a method from a selected piece of code. This way, let’s try to create methods corresponding to each step in the application.

It doesn’t work!?

The code in the main method is already more readable… but this code does not work. We want to use SparkSession and spark.implicits._ in the methods. Unfortunately these values are not within the scope of methods.

Image for post
Image for post

2. SparkSession overdose

We can fix this by passing on SparkSession in every method. Unfortunately, this is a pain in the ass. We also have to import spark.implicits._ every time. I’m to lazy for this solution 😁.

3. SparkSession at your service

We need to provide access to SparkSession in a slightly different way. The SparkJob object will help.

Now we can import SparkJob and spark.implicits._ in the application. The code looks better. We can reuse the methods.

4. Implicit class / Extension method

I wrote a lot of C# code in my life. An interesting and useful concept is Extension Method. It allows you to “add” methods to an existing type/class without modifying it. Below is an example. Instead of writing

int numberA = 1
int numberB = 2
val sum = Sum(numberA, numberB)
...
public int Sum(int numberA, int numberB)
{
return numberA + numberB
}

We can write

int numberA = 1
int numberB = 2
val sum = numberA.Add(numberB)
...
public static int Add(this int numberA, int numberB)
{
return numberA + numberB
}

The difference in readability can be seen in the following example

Sum(A, Sum(B, Sum(C,Sum (D,...))))
// VS
A.Add(B).Add(C).Add(D)...

In Scala we can get a similar mechanism using Implicit class. Below is the reorganized logic of the reviewed Apache Spark application.

Application logic has moved to another object and the code can be read like prose.

Let’s go back to what the application was supposed to do:

  • download groceries data from a file
  • filter fruits
  • normalize names
  • calculate the quantity of each fruit

Maybe not word for word, but you know what it is about 😁.

EDIT: Dataset transform

While the previous way is cool, it can sometimes be misleading. To separate the business code from the base class, we can use Dataset.transform. You will find details in this article from MungingData.

Repository

Please share what you think about this in comment secion. What is your way of making the code readable?

Software Developer, Big Data Engineer, Blogger (https://wiadrodanych.pl), Amateur Cyclists & Triathlete, @maciej_szymczyk

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store