Image for post
Image for post

It is said that Apache Airflow is CRON on steroids. It is gaining popularity among tools for ETL orchestration (Scheduling, managing and monitoring tasks). The tasks are defined as Directed Acyclic Graph (DAG), in which they exchange information. In the entry you will learn how to use Variables and XCom in Apache Airflow.

The Environment

In case of Apache Airflow, the puckel/docker-airflow version works well. Most often I use docker-compose-LocalExecutor.yml variant.

sudo docker-compose -f docker-compose-LocalExecutor.yml up -d

Why Variables and XCom?

Variables and XCom are like variables used within the Apache Airflow environment.

Variables are a kind of global variable. If a value is used by many DAGs (and you don’t want to edit N files if you change it), consider adding it to Variables. …

Image for post
Image for post

Jupyter and Apache Zeppelin is a good place to experiment with data. Unfortunately, the specifics of notebooks do not encourage to organize the code, including its decomposition and readability. We can copy cells to Intellij IDEA and build JAR, but the effect will not be stunning. We can copy cells to Intellij IDEA and build JAR, but the effect will not be stunning. In this article you will learn how to make more readable Scala Apache Spark code in Intellij IDEA.

0. The base code

It is a simple application which:

  • downloads groceries data from a file;
  • filters fruits;
  • normalizes names;
  • calculates the quantity of each fruit. …

Image for post
Image for post

Twitter data can be obtained in many ways, but who wants to write the code 😉. Especially one that will work 24/7. In Elastic Stack, you can easily collect and analyze data from Twitter. Logstash has input to collect tweets. Kafka Connect discussed in the previous story also has this option, but Logstash can send data to many sources (including Apache Kafka) and is easier to use.

In the article:

  • Saving a tweet stream to Elasticsearch in Logstash
  • Visualizations in Kibana (Xbox vs PlayStation)
  • Removing HTML tags for the keyword with a standardization mechanism

Elastic Stack Environment

All the necessary components are contained in one docker-compose. If you already have an Elasticsearch cluster, you just need Logstash. …

Image for post
Image for post

Kafka Connect is part of the Apache Kafka platform. It is used to connect Kafka with external services such as file systems and databases. In this story you will learn what problem it solves and how to run it.

Why Kafka Connect?

Apache Kafka is used in microservices architecture, log aggregation, Change data capture (CDC), integration, streaming platform and data acquisition layer to Data Lake. Whatever you use Kafka for, data flows from the source and goes to the sink.

It takes time and knowledge to properly implement a Kafka’s consumer or producer. The point is that the inputs and outputs often repeat themselves. Many companies pull data from Kafka to HDFS/S3 and Elasticsearch. …

Image for post
Image for post

I recorded a video in which I talk about the advantages of NoSQL databases. The response was interesting, but I had the impression that not everyone sees the two sides of the coin. The facts are that they can cause us a lot of problems 😉.

Schema Management

Each NoSQL database approaches the schema in its own way. In some there is no schema (MongoDB), in some, it is dynamic (Elasticsearch), and in some it resembles the one from relational databases (Cassandra). In the conceptual model, data ALWAYS have a pattern. Entities, fields, names, types, relations. …

Image for post
Image for post

In Apache Spark/PySpark we use abstractions and the actual processing is done only when we want to materialize the result of the operation. To connect to different databases and file systems we use mostly ready-made libraries. In this story you will learn how to combine data with MySQL and MongoDB and then save it in Apache Cassandra.


The ideal moment to use Docker, or more precisely Docker Compose. We will run all databases and Jupyter with Apache Spark.

# Use root/example as user/password credentials
version: '3.1'

image: jupyter/all-spark-notebook
- 8888:8888
- 4040:4040
- ./work:/home/jovyan/work

image: 'bitnami/cassandra:latest'

image: mongo

image: mysql:5.7 …

Image for post
Image for post

This is a continuation of the previous story. This time we will look at the Detections tab in Elastic SIEM. Our goal is to automate IOC detection using proven rules. Let’s remind: We installed Elasticsearch + Kibana on one of the VMs. We monitor an Ubuntu (Auditbeat, Filebeat, Packetbeat) and Windows 10 VM (Winlogbeat), although in this story we will focus on the Windows.

Image for post
Image for post

IT environments are becoming increasingly large, distributed and difficult to manage. All system components must be protected and monitored against cyber threats. You need a scalable platform that can store and analyze logs, metrics and events. SIEM solutions can cost a lot of money. In this story we will take a look at the free solution available in Elastic Stack, which is Elastic SIEM.

What will we use?

Elastic Stack is a set of components: Elasticsearch, Kibana, Logstash and Beats. Brief information about what is used in this story:

  • Elasticsearch — document database/search engine
  • Kibana —Data visualization dashboard for Elasticsearch
  • Filebeat — lightweight log collector (available…

Image for post
Image for post

Apache Spark is one of the most popular platforms for distributed data processing and analysis. Although it is associated with a server farm, Hadoop and cloud technologies, you can successfully launch it on your machine. In this entry you will learn several ways to configure the Apache Spark development environment.


The base system in this case is Ubuntu Desktop 20.04 LTS.


The first way is to run Spark in the terminal. Let’s start by downloading Apache Spark. You can download it here. After downloading, we have to unpack the package with tar.

tar zxvf spark-3.0.0-bin-hadoop3.2.tgz

Apache Spark is written in Scala, which means that we need a Java Virtual Machine (JVM). For Spark 3.0 it will be Java 11. …

Image for post
Image for post

Apache Cassandra is a specific database that scales linearly. This has its price: specific table modelling, configurable consistency and limited analytics. Apple performs millions of operations per second on over 160,000 Cassandra instances while collecting over 100 PBs of data. You can bypass these limited analytics with the Apache Spark and the DataStax connector, and that’s what the story is about.


I’ve used one Apache Cassandra node on Docker

version: '3'

image: cassandra:latest
- "9042:9042"

Apache Spark 3.0 is launched as shell with connector and Cassandra’s client library, which will be useful for timeuuid type conversion.

./spark-shell --packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.0-beta,com.datastax.cassandra:cassandra-driver-core:3.9.0 …


Maciej Szymczyk

Software Developer, Big Data Engineer, Blogger (, Amateur Cyclists & Triathlete, @maciej_szymczyk

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store