I see that the problem was solved with symbolic link. I didn't do any updates to Dockerfile. Currently I'm busy with other projects (not Kafka Streams related)


Have you ever thought about creating a stream from database operations? In this story, you will learn what Change Data Capture is and how to use it while planning your system architecture. In the practical part, we will see the Debezium in action.

What is Change Data Capture?

Change Data Capture is a process of detecting changes made to the database. The changes can then be streamed and integrated with other databases and systems. In other words: we receive a stream of events from our database.

This allows us to make faster and more accurate decisions based on data (Stream Processing and Streaming ETL). It…


It is said that Apache Airflow is CRON on steroids. It is gaining popularity among tools for ETL orchestration (Scheduling, managing and monitoring tasks). The tasks are defined as Directed Acyclic Graph (DAG), in which they exchange information. In the entry you will learn how to use Variables and XCom in Apache Airflow.

The Environment

In case of Apache Airflow, the puckel/docker-airflow version works well. Most often I use docker-compose-LocalExecutor.yml variant.

sudo docker-compose -f docker-compose-LocalExecutor.yml up -d

Why Variables and XCom?

Variables and XCom are like variables used within the Apache Airflow environment.

Variables are a kind of global variable. If a value is used by…


Jupyter and Apache Zeppelin is a good place to experiment with data. Unfortunately, the specifics of notebooks do not encourage to organize the code, including its decomposition and readability. We can copy cells to Intellij IDEA and build JAR, but the effect will not be stunning. We can copy cells to Intellij IDEA and build JAR, but the effect will not be stunning. In this article you will learn how to make more readable Scala Apache Spark code in Intellij IDEA.

0. The base code

It is a simple application which:

  • downloads groceries data from a file;
  • filters fruits;
  • normalizes names;
  • calculates the quantity…


Twitter data can be obtained in many ways, but who wants to write the code 😉. Especially one that will work 24/7. In Elastic Stack, you can easily collect and analyze data from Twitter. Logstash has input to collect tweets. Kafka Connect discussed in the previous story also has this option, but Logstash can send data to many sources (including Apache Kafka) and is easier to use.

In the article:

  • Saving a tweet stream to Elasticsearch in Logstash
  • Visualizations in Kibana (Xbox vs PlayStation)
  • Removing HTML tags for the keyword with a standardization mechanism

Elastic Stack Environment

All the necessary components are contained…


Kafka Connect is part of the Apache Kafka platform. It is used to connect Kafka with external services such as file systems and databases. In this story you will learn what problem it solves and how to run it.

Why Kafka Connect?

Apache Kafka is used in microservices architecture, log aggregation, Change data capture (CDC), integration, streaming platform and data acquisition layer to Data Lake. Whatever you use Kafka for, data flows from the source and goes to the sink.

It takes time and knowledge to properly implement a Kafka’s consumer or producer. The point is that the inputs and outputs often repeat…


I recorded a video in which I talk about the advantages of NoSQL databases. The response was interesting, but I had the impression that not everyone sees the two sides of the coin. The facts are that they can cause us a lot of problems 😉.

Schema Management

Each NoSQL database approaches the schema in its own way. In some there is no schema (MongoDB), in some, it is dynamic (Elasticsearch), and in some it resembles the one from relational databases (Cassandra). In the conceptual model, data ALWAYS have a pattern. Entities, fields, names, types, relations. …


In Apache Spark/PySpark we use abstractions and the actual processing is done only when we want to materialize the result of the operation. To connect to different databases and file systems we use mostly ready-made libraries. In this story you will learn how to combine data with MySQL and MongoDB and then save it in Apache Cassandra.

Environment

The ideal moment to use Docker, or more precisely Docker Compose. We will run all databases and Jupyter with Apache Spark.

# Use root/example as user/password credentials version: '3.1' services: notebook: image: jupyter/all-spark-notebook ports: - 8888:8888 - 4040:4040 volumes: - ./work:/home/jovyan/work cassandra: image…


This is a continuation of the previous story. This time we will look at the Detections tab in Elastic SIEM. Our goal is to automate IOC detection using proven rules. Let’s remind: We installed Elasticsearch + Kibana on one of the VMs. We monitor an Ubuntu (Auditbeat, Filebeat, Packetbeat) and Windows 10 VM (Winlogbeat), although in this story we will focus on the Windows.


IT environments are becoming increasingly large, distributed and difficult to manage. All system components must be protected and monitored against cyber threats. You need a scalable platform that can store and analyze logs, metrics and events. SIEM solutions can cost a lot of money. In this story we will take a look at the free solution available in Elastic Stack, which is Elastic SIEM.

What will we use?

Elastic Stack is a set of components: Elasticsearch, Kibana, Logstash and Beats. Brief information about what is used in this story:

  • Elasticsearch — document database/search engine
  • Kibana —Data visualization dashboard for Elasticsearch
  • Filebeat — lightweight log…

Maciej Szymczyk

Software Developer, Big Data Engineer, Blogger (https://wiadrodanych.pl), Amateur Cyclists & Triathlete, @maciej_szymczyk

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store