Member-only story

How To Clean Data with Python Pandas — Vehicles registered in Poland

Maciej Szymczyk
5 min readMar 19, 2023

Thanks to the Open Data project, we have sources made available by Polish public entities. In this article, we will prepare and clean the Vehicles registered in Poland by province using Python and Pandas.

Data

We have data available in csv and xlsx (Excel) formats. Upon initial observation, the following problems can be seen:

  1. Inconsistent names of voivodeships — upper and lower case letters
  2. Inconsistent number of columns — TERYT column not present everywhere
  3. Inconsistent column names regarding months — mixed data forms + typos
the number and names of columns differ
typo

Data Loading

Of the two available formats, I chose xlsx. The following code snippet will create a list of paths for all files of this type. Why did I choose xlsx and not csv? No specific reason.

xlsx_paths = []
for path, subdirs, files in os.walk('..\data'):
for name in files…

--

--

Maciej Szymczyk
Maciej Szymczyk

Written by Maciej Szymczyk

Software Developer, Big Data Engineer, Blogger (https://wiadrodanych.pl), Amateur Cyclists & Triathlete, @maciej_szymczyk

No responses yet