OpenWest 2014/Python Pandas

From Omnia
Revision as of 01:35, 12 May 2014 by Kenneth (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

A Brief Tour of the Python Pandas Package

by Matt Harrison (@__mharrison__)

"Python is gaining popularity among Data Scientists. One reason is the Pandas package, which provides facilities for data manipulation. It has facilities similar to Excel, SQL, ETL packages, and more."


Python Data Analysis Library — pandas: Python Data Analysis Library - http://pandas.pydata.org/


NumPy — http://www.numpy.org/ - NumPy is the fundamental package for scientific computing with Python.

SciPy - http://www.scipy.org/ - SciPy (pronounced “Sigh Pie”) is a Python-based ecosystem of open-source software for mathematics, science, and engineering

Pandas - http://pandas.pydata.org/ - Python Data Analysis Library - pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.


Matt Harrison - http://hairysun.com

  • co-chair Utah Python.


Impetus - if this were a perl class it would be about regexes. Panda is the weapon of choice for dealing with tabular data in Python.


Pandas is "A nosql in-memory db using Python, that has SQL-like constructs" - Matt's view

  • note adopts many numpy-isms that may not appear pure Python

Based off of data framing (tabular data) stolen from 'R'. Data frame is similar to a table in SQL.

Panda is best for small to medium data, not "Big Data".


Not really good from ETL perspective - star schema

  • Extract Transform Load - take data from one system to another
  • Data warehousing


Data Structures:

  • Series (1D)
  • TimeSeries (1D) - special Series
  • DataFrame (2D)
  • Panel (3D) - like stacked DataFrames


Series:

# python version
ser = {
  'index':[0,1,2],
  'data':[.5,.6,.7],
  'name':'growth',
}
# pandas version
import pandas as pd
ser = pd.Series([.5,.6,.7], name='growth')

Behaves like NumPy array:

ser[1]
ser.mean()

Boolean Array

ser > ser.median()
  a False
  b False
  c True

Filtering:

ser[ser > ser.median()]


DataFrames - Tables with columns as Series

# python version, but not a true Pandas DataFrame
df = {
  'index':[0,1,2],
  cols = [
    { 'name':'growth',
      'data':[.5,.6,1.2] },
    { 'name':'Name',
      'data':["paul","george", "ringo"] },
   ]
}
# pandas version
df = pd.DataFrame({
  'growth':[.5,.7,1.2],
  'Name':['paul','geroge','ringo'] }

Import DataFrame from: rows (list of dicts), columns (dicts of lists), csv file ***, slurp up a NumPy ndarray directly

Two Axes:

  • axes 0 - index
  • axes 1 - columns
df.axes[0] or df.index
df.axes[1] or df.columns

Examine:

df.columns
df.describe()
df.to_string()
df.test1 # or df['test1']  # makes magic attribute for you
df.test1.median()
df.test1.corr(df.test2)  # correlation - if data goes in same direction 1, no would be 0 and opposite would be -1

Tweaking Data

  • note: pandas objects are generally immutable
  • add row
df = pd.concat()
  • add column
df['test3'] = 0
#note:  df.test3 = 3 does not work!
  • add column with function
def name_grade(val):
  ..
df['test4'] = df.fname.apply(name_grade)
  • remove column
t3 = df.pop('test3')  # or del df['test3']   # note: del df.test3 does not work!
  • rename column

Fill - statistics ignore NaN, so if you want a zero can use this.


Install Pandas: (what worked for me)

# yum install gcc-c++
pip install pandas


Pivoting - Pivot Tables

print pd.pivot_table(..rules..)


Serialization

  • dump to CSV, etc

Plotting

  • box plot, etc...

Clipping


GPS example.