Natural Language Processing with Pandas DataFrames

Frederick Reiss; Bryan Cutler; Zachary Eichenberger

doi:10.25080/majora-1b6fd038-006

Natural Language Processing with Pandas DataFrames

Frederick Reiss
IBM Research

Bryan Cutler
IBM

Zachary Eichenberger
University of Michigan
IBM Research

Abstract

Most areas of Python data science have standardized on using Pandas DataFrames for representing and manipulating structured data in memory. Natural Language Processing (NLP), not so much.

We believe that Pandas has the potential to serve as a universal data structure for NLP data. DataFrames could make every phase of NLP easier, from creating new models, to evaluating their effectiveness, to building applications that integrate those models. However, Pandas currently lacks important data types and operations for representing and manipulating crucial types of data in many of these NLP tasks.

This paper describes Text Extensions for Pandas, a library of extensions to Pandas that make it possible to build end-to-end NLP applications while representing all of the applications' internal data with DataFrames. We leverage the extension points built into Pandas library to add new data types, and we provide important NLP-specfific operations over these data types and and integrations with popular NLP libraries and data formats.

Keywords

natural language processing, Pandas, DataFrames

DOI

10.25080/majora-1b6fd038-006

Bibtex entry

Full text PDF

Proceedings

Natural Language Processing with Pandas DataFrames