OpenWest 2014/Python Pandas
A Brief Tour of the Python Pandas Package
- by Matt Harrison (@__mharrison__)
"Python is gaining popularity among Data Scientists. One reason is the Pandas package, which provides facilities for data manipulation. It has facilities similar to Excel, SQL, ETL packages, and more."
Python Data Analysis Library — pandas: Python Data Analysis Library - http://pandas.pydata.org/
NumPy — http://www.numpy.org/ - NumPy is the fundamental package for scientific computing with Python.
SciPy - http://www.scipy.org/ - SciPy (pronounced “Sigh Pie”) is a Python-based ecosystem of open-source software for mathematics, science, and engineering
Pandas - http://pandas.pydata.org/ - Python Data Analysis Library - pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
Matt Harrison - http://hairysun.com
- co-chair Utah Python.
Impetus - if this were a perl class it would be about regexes. Panda is the weapon of choice for dealing with tabular data in Python.
Pandas is "A nosql in-memory db using Python, that has SQL-like constructs" - Matt's view
- note adopts many numpy-isms that may not appear pure Python
Based off of data framing (tabular data) stolen from 'R'. Data frame is similar to a table in SQL.
Panda is best for small to medium data, not "Big Data".
Not really good from ETL perspective - star schema
- Extract Transform Load - take data from one system to another
- Data warehousing
Data Structures:
- Series (1D)
- TimeSeries (1D) - special Series
- DataFrame (2D)
- Panel (3D) - like stacked DataFrames
Series:
# python version ser = { 'index':[0,1,2], 'data':[.5,.6,.7], 'name':'growth', }
# pandas version import pandas as pd ser = pd.Series([.5,.6,.7], name='growth')
Behaves like NumPy array:
ser[1] ser.mean()
Boolean Array
ser > ser.median() a False b False c True
Filtering:
ser[ser > ser.median()]
DataFrames - Tables with columns as Series
# python version, but not a true Pandas DataFrame df = { 'index':[0,1,2], cols = [ { 'name':'growth', 'data':[.5,.6,1.2] }, { 'name':'Name', 'data':["paul","george", "ringo"] }, ] }
# pandas version df = pd.DataFrame({ 'growth':[.5,.7,1.2], 'Name':['paul','geroge','ringo'] }
Import DataFrame from: rows (list of dicts), columns (dicts of lists), csv file ***, slurp up a NumPy ndarray directly
Two Axes:
- axes 0 - index
- axes 1 - columns
df.axes[0] or df.index df.axes[1] or df.columns
Examine:
df.columns df.describe() df.to_string() df.test1 # or df['test1'] # makes magic attribute for you df.test1.median() df.test1.corr(df.test2) # correlation - if data goes in same direction 1, no would be 0 and opposite would be -1
Tweaking Data
- note: pandas objects are generally immutable
- add row
df = pd.concat()
- add column
df['test3'] = 0 #note: df.test3 = 3 does not work!
- add column with function
def name_grade(val): .. df['test4'] = df.fname.apply(name_grade)
- remove column
t3 = df.pop('test3') # or del df['test3'] # note: del df.test3 does not work!
- rename column
Fill - statistics ignore NaN, so if you want a zero can use this.
Install Pandas: (what worked for me)
# yum install gcc-c++ pip install pandas
Pivoting - Pivot Tables
print pd.pivot_table(..rules..)
Serialization
- dump to CSV, etc
Plotting
- box plot, etc...
Clipping
GPS example.