Member-only story

Big Data without Hadoop/HDFS? MinIO tested on Jupyter + PySpark

Maciej Szymczyk
Python in Plain English

The takeover of Hortonworks by Cloudera ended the free distribution of Hadoop. Therefore, a lot of people are looking for alternative solutions. Cloud solutions are impossible in some areas.

MinIO is a distributed storage implementing AWS S3 API. It can be deployed in on-premises environments. It is prepared for Kubernetes. It is an interesting alternative to HDFS-based environments and the rest of the Hadoop ecosystem. Finally, Kubernetes is becoming an increasingly interesting alternative of YARN to Apache Spark.In this story, we will take a look at the local MinIO on the docker-compose and perform several operations in the Spark.

Environment

Docker compose consists of a jupiter and 4 nodes from MinIO.

version: '3'
services:
notebook:
image: jupyter/all-spark-notebook
ports:
- 8888:8888
- 4040:4040
environment:
- PYSPARK_SUBMIT_ARGS=--packages com.amazonaws:aws-java-sdk-bundle:1.11.819,org.apache.hadoop:hadoop-aws:3.2.0 pyspark-shell
volumes:
- ./work:/home/jovyan/work
minio1:
image: minio/minio:RELEASE.2020-07-02T00-15-09Z
volumes:
- ./minio/data1-1:/data1
- ./minio/data1-2:/data2
ports:
- "9001:9000"
environment:
MINIO_ACCESS_KEY: minio
MINIO_SECRET_KEY…

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Or, continue in mobile web

Already have an account? Sign in

Responses (1)

Write a response

Hello, I have an error with this part #1
ratings = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv("s3a://bucket1/movielens/ratings.csv")
how to solve it please
error:
Py4JJavaError: An error occurred while calling o38.csv.