Natural Language Processing with Pandas DataFrames
Frederick Reiss
Bryan Cutler
Zachary Eichenberger
Most areas of Python data science have standardized on using Pandas
DataFrames for representing and manipulating structured data in memory.
Natural Language Processing (NLP), not so much.
We believe that Pandas has the potential to serve as a universal data
structure for NLP data. DataFrames could make every phase of NLP easier,
from creating new models, to evaluating their effectiveness, to building
applications that integrate those models. However, Pandas currently lacks
important data types and operations for representing and manipulating
crucial types of data in many of these NLP tasks.
This paper describes Text Extensions for Pandas, a library of extensions
to Pandas that make it possible to build end-to-end NLP applications while
representing all of the applications' internal data with DataFrames.
We leverage the extension points built into Pandas library to add new data
types, and we provide important NLP-specfific operations over these data
types and and integrations with popular NLP libraries and data formats.
natural language processing, Pandas, DataFrames
DOI10.25080/majora-1b6fd038-006