Conference site ยป Proceedings

pyjanitor: A Cleaner API for Cleaning Data

Eric J. Ma
Novartis Institutes for Biomedical Research

Zachary Barry
Novartis Institutes for Biomedical Research

Sam Zuckerman

Zachary Sailer
Jupyter Project

Abstract

The pandas library has become the de facto library for data wrangling in the Python programming language. However, inconsistencies in the pandas application programming interface (API), while idiomatic due to historical use, prevent use of expressive, fluent programming idioms that enable self-documenting pandas code. Here, we introduce pyjanitor, an open source Python package that extends the pandas API with such idioms. We describe its design and implementation of the package, provide usage examples from a variety of domains, and discuss the ways that the pyjanitor project has enabled the inclusion of first-time contributors to open source projects.

Keywords

data engineering, data science, data cleaning

DOI

10.25080/Majora-7ddc1dd1-007

Bibtex entry

Full text PDF