pandera: Statistical Data Validation of Pandas Dataframes

Niels Bantilan

doi:10.25080/Majora-342d178e-010

pandera: Statistical Data Validation of Pandas Dataframes

Niels Bantilan
Talkspace
pyOpenSci

Abstract

pandas is an essential tool in the data scientist’s toolkit for modern data engineering, analysis, and modeling in the Python ecosystem. However, dataframes can often be difficult to reason about in terms of their data types and statistical properties as data is reshaped from its raw form to one that’s ready for analysis. Here, I introduce pandera, an open source package that provides a flexible and expressive data validation API designed to make it easy for data wranglers to define dataframe schemas. These schemas execute logical and statistical assertions at runtime so that analysts can spend less time worrying about the correctness of their dataframes and more time obtaining insights and training models.

Keywords

data validation, data engineering

DOI

10.25080/Majora-342d178e-010

Bibtex entry

Full text PDF

Proceedings

pandera: Statistical Data Validation of Pandas Dataframes