OpenWest 2014/Python Pandas: Difference between revisions

Latest revision as of 01:35, 12 May 2014

A Brief Tour of the Python Pandas Package

by Matt Harrison (@__mharrison__)

"Python is gaining popularity among Data Scientists. One reason is the Pandas package, which provides facilities for data manipulation. It has facilities similar to Excel, SQL, ETL packages, and more."

Python Data Analysis Library — pandas: Python Data Analysis Library - http://pandas.pydata.org/

NumPy — http://www.numpy.org/ - NumPy is the fundamental package for scientific computing with Python.

SciPy - http://www.scipy.org/ - SciPy (pronounced “Sigh Pie”) is a Python-based ecosystem of open-source software for mathematics, science, and engineering

Pandas - http://pandas.pydata.org/ - Python Data Analysis Library - pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Matt Harrison - http://hairysun.com

co-chair Utah Python.

Impetus - if this were a perl class it would be about regexes. Panda is the weapon of choice for dealing with tabular data in Python.

Pandas is "A nosql in-memory db using Python, that has SQL-like constructs" - Matt's view

note adopts many numpy-isms that may not appear pure Python

Based off of data framing (tabular data) stolen from 'R'. Data frame is similar to a table in SQL.

Panda is best for small to medium data, not "Big Data".

Not really good from ETL perspective - star schema

Extract Transform Load - take data from one system to another
Data warehousing

Data Structures:

Series (1D)
TimeSeries (1D) - special Series
DataFrame (2D)
Panel (3D) - like stacked DataFrames

Series:

# python version
ser = {
  'index':[0,1,2],
  'data':[.5,.6,.7],
  'name':'growth',
}

# pandas version
import pandas as pd
ser = pd.Series([.5,.6,.7], name='growth')

Behaves like NumPy array:

ser[1]
ser.mean()

Boolean Array

ser > ser.median()
  a False
  b False
  c True

Filtering:

ser[ser > ser.median()]

DataFrames - Tables with columns as Series

# python version, but not a true Pandas DataFrame
df = {
  'index':[0,1,2],
  cols = [
    { 'name':'growth',
      'data':[.5,.6,1.2] },
    { 'name':'Name',
      'data':["paul","george", "ringo"] },
   ]
}

# pandas version
df = pd.DataFrame({
  'growth':[.5,.7,1.2],
  'Name':['paul','geroge','ringo'] }

Import DataFrame from: rows (list of dicts), columns (dicts of lists), csv file ***, slurp up a NumPy ndarray directly

Two Axes:

axes 0 - index
axes 1 - columns

df.axes[0] or df.index
df.axes[1] or df.columns

Examine:

df.columns
df.describe()
df.to_string()
df.test1 # or df['test1']  # makes magic attribute for you
df.test1.median()
df.test1.corr(df.test2)  # correlation - if data goes in same direction 1, no would be 0 and opposite would be -1

Tweaking Data

note: pandas objects are generally immutable
add row

df = pd.concat()

add column

df['test3'] = 0
#note:  df.test3 = 3 does not work!

add column with function

def name_grade(val):
  ..
df['test4'] = df.fname.apply(name_grade)

remove column

t3 = df.pop('test3')  # or del df['test3']   # note: del df.test3 does not work!

rename column

Fill - statistics ignore NaN, so if you want a zero can use this.

Install Pandas: (what worked for me)

# yum install gcc-c++
pip install pandas

Pivoting - Pivot Tables

print pd.pivot_table(..rules..)

Serialization

dump to CSV, etc

Plotting

box plot, etc...

Clipping

GPS example.

OpenWest 2014/Python Pandas: Difference between revisions

Latest revision as of 01:35, 12 May 2014

Navigation menu

Search