
Member-only story
Big Data without Hadoop/HDFS? MinIO tested on Jupyter + PySpark
The takeover of Hortonworks by Cloudera ended the free distribution of Hadoop. Therefore, a lot of people are looking for alternative solutions. Cloud solutions are impossible in some areas.
MinIO is a distributed storage implementing AWS S3 API. It can be deployed in on-premises environments. It is prepared for Kubernetes. It is an interesting alternative to HDFS-based environments and the rest of the Hadoop ecosystem. Finally, Kubernetes is becoming an increasingly interesting alternative of YARN to Apache Spark.In this story, we will take a look at the local MinIO on the docker-compose and perform several operations in the Spark.
Environment
Docker compose consists of a jupiter and 4 nodes from MinIO.
version: '3'
services:
notebook:
image: jupyter/all-spark-notebook
ports:
- 8888:8888
- 4040:4040
environment:
- PYSPARK_SUBMIT_ARGS=--packages com.amazonaws:aws-java-sdk-bundle:1.11.819,org.apache.hadoop:hadoop-aws:3.2.0 pyspark-shell
volumes:
- ./work:/home/jovyan/work
minio1:
image: minio/minio:RELEASE.2020-07-02T00-15-09Z
volumes:
- ./minio/data1-1:/data1
- ./minio/data1-2:/data2
ports:
- "9001:9000"
environment:
MINIO_ACCESS_KEY: minio
MINIO_SECRET_KEY…