It is said that Apache Airflow is CRON on steroids. It is gaining popularity among tools for ETL orchestration (Scheduling, managing and monitoring tasks). The tasks are defined as Directed Acyclic Graph (DAG), in which they exchange information. In the entry you will learn how to use Variables and XCom in Apache Airflow.
In case of Apache Airflow, the puckel/docker-airflow version works well. Most often I use
sudo docker-compose -f docker-compose-LocalExecutor.yml up -d
Variables and XCom are like variables used within the Apache Airflow environment.
Variables are a kind of global variable. If a value is used by many DAGs (and you don’t want to edit N files if you change it), consider adding it to Variables. …
Jupyter and Apache Zeppelin is a good place to experiment with data. Unfortunately, the specifics of notebooks do not encourage to organize the code, including its decomposition and readability. We can copy cells to Intellij IDEA and build JAR, but the effect will not be stunning. We can copy cells to Intellij IDEA and build JAR, but the effect will not be stunning. In this article you will learn how to make more readable Scala Apache Spark code in Intellij IDEA.
It is a simple application which:
Twitter data can be obtained in many ways, but who wants to write the code 😉. Especially one that will work 24/7. In Elastic Stack, you can easily collect and analyze data from Twitter. Logstash has input to collect tweets. Kafka Connect discussed in the previous story also has this option, but Logstash can send data to many sources (including Apache Kafka) and is easier to use.
In the article:
All the necessary components are contained in one docker-compose. If you already have an Elasticsearch cluster, you just need Logstash. …
Kafka Connect is part of the Apache Kafka platform. It is used to connect Kafka with external services such as file systems and databases. In this story you will learn what problem it solves and how to run it.
Apache Kafka is used in microservices architecture, log aggregation, Change data capture (CDC), integration, streaming platform and data acquisition layer to Data Lake. Whatever you use Kafka for, data flows from the source and goes to the sink.
It takes time and knowledge to properly implement a Kafka’s consumer or producer. The point is that the inputs and outputs often repeat themselves. Many companies pull data from Kafka to HDFS/S3 and Elasticsearch. …
I recorded a video in which I talk about the advantages of NoSQL databases. The response was interesting, but I had the impression that not everyone sees the two sides of the coin. The facts are that they can cause us a lot of problems 😉.
Each NoSQL database approaches the schema in its own way. In some there is no schema (MongoDB), in some, it is dynamic (Elasticsearch), and in some it resembles the one from relational databases (Cassandra). In the conceptual model, data ALWAYS have a pattern. Entities, fields, names, types, relations. …
In Apache Spark/PySpark we use abstractions and the actual processing is done only when we want to materialize the result of the operation. To connect to different databases and file systems we use mostly ready-made libraries. In this story you will learn how to combine data with MySQL and MongoDB and then save it in Apache Cassandra.
The ideal moment to use Docker, or more precisely Docker Compose. We will run all databases and Jupyter with Apache Spark.
# Use root/example as user/password credentials
image: mysql:5.7 …
This is a continuation of the previous story. This time we will look at the Detections tab in Elastic SIEM. Our goal is to automate IOC detection using proven rules. Let’s remind: We installed Elasticsearch + Kibana on one of the VMs. We monitor an Ubuntu (Auditbeat, Filebeat, Packetbeat) and Windows 10 VM (Winlogbeat), although in this story we will focus on the Windows.
IT environments are becoming increasingly large, distributed and difficult to manage. All system components must be protected and monitored against cyber threats. You need a scalable platform that can store and analyze logs, metrics and events. SIEM solutions can cost a lot of money. In this story we will take a look at the free solution available in Elastic Stack, which is Elastic SIEM.
Elastic Stack is a set of components: Elasticsearch, Kibana, Logstash and Beats. Brief information about what is used in this story:
Apache Spark is one of the most popular platforms for distributed data processing and analysis. Although it is associated with a server farm, Hadoop and cloud technologies, you can successfully launch it on your machine. In this entry you will learn several ways to configure the Apache Spark development environment.
The base system in this case is Ubuntu Desktop 20.04 LTS.
The first way is to run Spark in the terminal. Let’s start by downloading Apache Spark. You can download it here. After downloading, we have to unpack the package with tar.
tar zxvf spark-3.0.0-bin-hadoop3.2.tgz
Apache Spark is written in Scala, which means that we need a Java Virtual Machine (JVM). For Spark 3.0 it will be Java 11. …
Apache Cassandra is a specific database that scales linearly. This has its price: specific table modelling, configurable consistency and limited analytics. Apple performs millions of operations per second on over 160,000 Cassandra instances while collecting over 100 PBs of data. You can bypass these limited analytics with the Apache Spark and the DataStax connector, and that’s what the story is about.
I’ve used one Apache Cassandra node on Docker
Apache Spark 3.0 is launched as shell with connector and Cassandra’s client library, which will be useful for timeuuid type conversion.
./spark-shell --packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.0-beta,com.datastax.cassandra:cassandra-driver-core:3.9.0 …