Conference site ยป Proceedings

Pandera: Going Beyond Pandas Data Validation

Niels Bantilan
Union.ai
pyOpenSci

Abstract

Data quality remains a core concern for practitioners in machine learning, data science, and data engineering, and many specialized packages have emerged to fulfill the need of validating and monitoring data and models. However, as the open source community creates new data processing frameworks - notably, new highly performant entrants such as Polars - existing data quality frameworks need to catch up to support them, and in some cases, the Python community more broadly creates new data validation libraries for these new data frameworks. This paper outlines pandera's motivation and challenges that took it from being a pandas-only data validation framework niels\_bantilan-proc-scipy-2020 to one that is extensible to other non-pandas-compliant dataframe-like libraries. It also provides an informative case study of the technical and organizational challenges associated with expanding the scope of a library beyond its original boundaries.

Keywords

data validation, data testing, data science, machine learning, data engineering

DOI

10.25080/gerudo-f2bc6f59-010

Bibtex entry

Full text PDF