Member-only story

Big Data without Hadoop/HDFS? MinIO tested on Jupyter + PySpark

Maciej Szymczyk
Python in Plain English

--

The takeover of Hortonworks by Cloudera ended the free distribution of Hadoop. Therefore, a lot of people are looking for alternative solutions. Cloud solutions are impossible in some areas.

MinIO is a distributed storage implementing AWS S3 API. It can be deployed in on-premises environments. It is prepared for Kubernetes. It is an interesting alternative to HDFS-based environments and the rest of the Hadoop ecosystem. Finally, Kubernetes is becoming an increasingly interesting alternative of YARN to Apache Spark.In this story, we will take a look at the local MinIO on the docker-compose and perform several operations in the Spark.

Environment

Docker compose consists of a jupiter and 4 nodes from MinIO.

version: '3'
services:
notebook:
image: jupyter/all-spark-notebook
ports:
- 8888:8888
- 4040:4040
environment:
- PYSPARK_SUBMIT_ARGS=--packages com.amazonaws:aws-java-sdk-bundle:1.11.819,org.apache.hadoop:hadoop-aws:3.2.0 pyspark-shell
volumes:
- ./work:/home/jovyan/work
minio1:
image: minio/minio:RELEASE.2020-07-02T00-15-09Z
volumes:
- ./minio/data1-1:/data1
- ./minio/data1-2:/data2
ports:
- "9001:9000"
environment:
MINIO_ACCESS_KEY: minio
MINIO_SECRET_KEY…

--

--

Responses (1)

Write a response