Member-only story

Big Data without Hadoop/HDFS? MinIO tested on Jupyter + PySpark

Maciej Szymczyk

Published in

Python in Plain English

6 min readJul 15, 2020

The takeover of Hortonworks by Cloudera ended the free distribution of Hadoop. Therefore, a lot of people are looking for alternative solutions. Cloud solutions are impossible in some areas.

MinIO is a distributed storage implementing AWS S3 API. It can be deployed in on-premises environments. It is prepared for Kubernetes. It is an interesting alternative to HDFS-based environments and the rest of the Hadoop ecosystem. Finally, Kubernetes is becoming an increasingly interesting alternative of YARN to Apache Spark.In this story, we will take a look at the local MinIO on the docker-compose and perform several operations in the Spark.

Environment

Docker compose consists of a jupiter and 4 nodes from MinIO.

version: '3'
services:
  notebook:
    image: jupyter/all-spark-notebook
    ports:
      - 8888:8888
      - 4040:4040
    environment:
      - PYSPARK_SUBMIT_ARGS=--packages com.amazonaws:aws-java-sdk-bundle:1.11.819,org.apache.hadoop:hadoop-aws:3.2.0 pyspark-shell
    volumes:
      - ./work:/home/jovyan/work
  minio1:
    image: minio/minio:RELEASE.2020-07-02T00-15-09Z
    volumes:
      - ./minio/data1-1:/data1
      - ./minio/data1-2:/data2
    ports:
      - "9001:9000"
    environment:
      MINIO_ACCESS_KEY: minio
      MINIO_SECRET_KEY…

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Continue in app

Or, continue in mobile web

Sign up with Google

Sign up with Facebook

Sign up with email

Already have an account? Sign in

Published in Python in Plain English

Last published 10 hours ago

New Python content every day. Follow to join our 3.5M+ monthly readers.

Written by Maciej Szymczyk

Software Developer, Big Data Engineer, Blogger (https://wiadrodanych.pl), Amateur Cyclists & Triathlete, @maciej_szymczyk

Responses (1)

Write a response

What are your thoughts?

Also publish to my profile

Stefentaime

Dec 1, 2022

Hello, I have an error with this part #1
ratings = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv("s3a://bucket1/movielens/ratings.csv")
how to solve it please
error:
Py4JJavaError: An error occurred while calling o38.csv.
…

More from Maciej Szymczyk and Python in Plain English

Does Elasticsearch lie? How does Elasticsearch work?

In

The Startup

by

Maciej Szymczyk

Does Elasticsearch lie? How does Elasticsearch work?

Elasticsearch surprises us with its capabilities and speed of action, but does it return the correct results? In this post, you’ll learn…

Jun 23, 2020

A visually engaging collage illustrating Python automation scripts in action. The image features a computer screen displaying Python code, surrounded by icons symbolizing various tasks like API data fetching, data cleaning, web monitoring, password generation, expense tracking, to-do list management, text summarization, stock market tracking, image resizing, directory cleaning, habit tracking, desktop notifications, and YouTube downloading. The background has a futuristic, tech-inspired design w

In

Python in Plain English

by

PURRFECT SOFTWARE LIMITED

19 Insanely Useful Python Automation Scripts I Use Every Day

Supercharge your productivity with these Python scripts that tackle repetitive tasks effortlessly.

Jan 20

Just Stop Writing Python Functions Like This!!!

In

Python in Plain English

by

Kiran Maan

Just Stop Writing Python Functions Like This!!!

I just reviewed someone else’s code and I was just shocked.

Jan 19

How to use Variables and XCom in Apache Airflow?

Maciej Szymczyk

How to use Variables and XCom in Apache Airflow?

It is said that Apache Airflow is CRON on steroids. It is gaining popularity among tools for ETL orchestration (Scheduling, managing and…

Dec 11, 2020

See all from Maciej Szymczyk

See all from Python in Plain English

Recommended from Medium

How to Install PySpark on Your Local Machine

Shittu Olumide Ayodeji

How to Install PySpark on Your Local Machine

PySpark, is a game-changer for data analysis and processing. This article breaks down the process of installation on your local machine

Dec 9, 2024

🔥 PySpark 3.5.4: The Must-Know Features That Will Supercharge Your Data Processing 🚀

Think Data

🔥 PySpark 3.5.4: The Must-Know Features That Will Supercharge Your Data Processing 🚀

PySpark just got a major upgrade with version 3.5.4, and trust me — you don’t want to miss these game-changing features! Whether you’re a…

Feb 22

Lists

Interesting Design Topics

258 stories974 saves

Staff picks

819 stories1637 saves

Natural Language Processing

1962 stories1606 saves

Implementing End to end Change Data Capture (CDC) with PySpark: A Comprehensive Guide

Mayurkumar Surani

Implementing End to end Change Data Capture (CDC) with PySpark: A Comprehensive Guide

Part 1: Foundation and Setup

Feb 21

Just Stop Writing Python Functions Like This!!!

In

Python in Plain English

by

Kiran Maan

Just Stop Writing Python Functions Like This!!!

I just reviewed someone else’s code and I was just shocked.

Jan 19

Hypermodern Python Toolbox 2025

In

Level Up Coding

by

Adam Green

Hypermodern Python Toolbox 2025

Python tools setting the standard in 2025.

Feb 17

How I processed ONE billion rows in PySpark without crashing (and You Can Too!)

Varsha C Bendre

How I processed ONE billion rows in PySpark without crashing (and You Can Too!)

Ever tried running a PySpark job on 1 billion rows, only to watch it crash and burn?

Feb 19

See more recommendations

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams