<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://aznot.com/index.php?action=history&amp;feed=atom&amp;title=OpenWest_2014%2FPython_Pandas</id>
	<title>OpenWest 2014/Python Pandas - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://aznot.com/index.php?action=history&amp;feed=atom&amp;title=OpenWest_2014%2FPython_Pandas"/>
	<link rel="alternate" type="text/html" href="https://aznot.com/index.php?title=OpenWest_2014/Python_Pandas&amp;action=history"/>
	<updated>2026-04-30T05:51:16Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.41.0</generator>
	<entry>
		<id>https://aznot.com/index.php?title=OpenWest_2014/Python_Pandas&amp;diff=59&amp;oldid=prev</id>
		<title>Kenneth at 01:35, 12 May 2014</title>
		<link rel="alternate" type="text/html" href="https://aznot.com/index.php?title=OpenWest_2014/Python_Pandas&amp;diff=59&amp;oldid=prev"/>
		<updated>2014-05-12T01:35:07Z</updated>

		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;A Brief Tour of the Python Pandas Package&lt;br /&gt;
:by Matt Harrison (@__mharrison__)&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Python is gaining popularity among Data Scientists. One reason is the Pandas package, which provides facilities for data manipulation. It has facilities similar to Excel, SQL, ETL packages, and more.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Python Data Analysis Library — pandas: Python Data Analysis Library - http://pandas.pydata.org/&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
NumPy — http://www.numpy.org/ - NumPy is the fundamental package for scientific computing with Python. &lt;br /&gt;
&lt;br /&gt;
SciPy - http://www.scipy.org/ - SciPy (pronounced “Sigh Pie”) is a Python-based ecosystem of open-source software for mathematics, science, and engineering&lt;br /&gt;
&lt;br /&gt;
Pandas - http://pandas.pydata.org/ - Python Data Analysis Library - pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Matt Harrison - http://hairysun.com&lt;br /&gt;
* co-chair Utah Python.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Impetus - if this were a perl class it would be about regexes.  Panda is the weapon of choice for dealing with tabular data in Python.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Pandas is &amp;quot;A nosql in-memory db using Python, that has SQL-like constructs&amp;quot; - Matt&amp;#039;s view&lt;br /&gt;
* note adopts many numpy-isms that may not appear pure Python&lt;br /&gt;
&lt;br /&gt;
Based off of data framing (tabular data) stolen from &amp;#039;R&amp;#039;.  Data frame is similar to a table in SQL.&lt;br /&gt;
&lt;br /&gt;
Panda is best for small to medium data, not &amp;quot;Big Data&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Not really good from ETL perspective - star schema&lt;br /&gt;
* Extract Transform Load - take data from one system to another&lt;br /&gt;
* Data warehousing&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Data Structures:&lt;br /&gt;
* Series (1D)&lt;br /&gt;
* TimeSeries (1D) - special Series&lt;br /&gt;
* DataFrame (2D)&lt;br /&gt;
* Panel (3D) - like stacked DataFrames&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Series:&lt;br /&gt;
 # python version&lt;br /&gt;
 ser = {&lt;br /&gt;
   &amp;#039;index&amp;#039;:[0,1,2],&lt;br /&gt;
   &amp;#039;data&amp;#039;:[.5,.6,.7],&lt;br /&gt;
   &amp;#039;name&amp;#039;:&amp;#039;growth&amp;#039;,&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
 # pandas version&lt;br /&gt;
 import pandas as pd&lt;br /&gt;
 ser = pd.Series([.5,.6,.7], name=&amp;#039;growth&amp;#039;)&lt;br /&gt;
&lt;br /&gt;
Behaves like NumPy array:&lt;br /&gt;
 ser[1]&lt;br /&gt;
 ser.mean()&lt;br /&gt;
&lt;br /&gt;
Boolean Array&lt;br /&gt;
 ser &amp;gt; ser.median()&lt;br /&gt;
   a False&lt;br /&gt;
   b False&lt;br /&gt;
   c True&lt;br /&gt;
&lt;br /&gt;
Filtering:&lt;br /&gt;
 ser[ser &amp;gt; ser.median()]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
DataFrames - Tables with columns as Series&lt;br /&gt;
 # python version, but not a true Pandas DataFrame&lt;br /&gt;
 df = {&lt;br /&gt;
   &amp;#039;index&amp;#039;:[0,1,2],&lt;br /&gt;
   cols = [&lt;br /&gt;
     { &amp;#039;name&amp;#039;:&amp;#039;growth&amp;#039;,&lt;br /&gt;
       &amp;#039;data&amp;#039;:[.5,.6,1.2] },&lt;br /&gt;
     { &amp;#039;name&amp;#039;:&amp;#039;Name&amp;#039;,&lt;br /&gt;
       &amp;#039;data&amp;#039;:[&amp;quot;paul&amp;quot;,&amp;quot;george&amp;quot;, &amp;quot;ringo&amp;quot;] },&lt;br /&gt;
    ]&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
 # pandas version&lt;br /&gt;
 df = pd.DataFrame({&lt;br /&gt;
   &amp;#039;growth&amp;#039;:[.5,.7,1.2],&lt;br /&gt;
   &amp;#039;Name&amp;#039;:[&amp;#039;paul&amp;#039;,&amp;#039;geroge&amp;#039;,&amp;#039;ringo&amp;#039;] }&lt;br /&gt;
&lt;br /&gt;
Import DataFrame from: rows (list of dicts), columns (dicts of lists), csv file ***, slurp up a NumPy ndarray directly&lt;br /&gt;
&lt;br /&gt;
Two Axes:&lt;br /&gt;
* axes 0 - index&lt;br /&gt;
* axes 1 - columns&lt;br /&gt;
 df.axes[0] or df.index&lt;br /&gt;
 df.axes[1] or df.columns&lt;br /&gt;
&lt;br /&gt;
Examine:&lt;br /&gt;
 df.columns&lt;br /&gt;
 df.describe()&lt;br /&gt;
 df.to_string()&lt;br /&gt;
 df.test1 # or df[&amp;#039;test1&amp;#039;]  # makes magic attribute for you&lt;br /&gt;
 df.test1.median()&lt;br /&gt;
 df.test1.corr(df.test2)  # correlation - if data goes in same direction 1, no would be 0 and opposite would be -1&lt;br /&gt;
&lt;br /&gt;
Tweaking Data&lt;br /&gt;
* note: pandas objects are generally immutable&lt;br /&gt;
* add row&lt;br /&gt;
 df = pd.concat()&lt;br /&gt;
* add column&lt;br /&gt;
 df[&amp;#039;test3&amp;#039;] = 0&lt;br /&gt;
 #note:  df.test3 = 3 does not work!&lt;br /&gt;
* add column with function&lt;br /&gt;
 def name_grade(val):&lt;br /&gt;
   ..&lt;br /&gt;
 df[&amp;#039;test4&amp;#039;] = df.fname.apply(name_grade)&lt;br /&gt;
* remove column&lt;br /&gt;
 t3 = df.pop(&amp;#039;test3&amp;#039;)  # or del df[&amp;#039;test3&amp;#039;]   # note: del df.test3 does not work!&lt;br /&gt;
* rename column&lt;br /&gt;
&lt;br /&gt;
Fill - statistics ignore NaN, so if you want a zero can use this.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Install Pandas: (what worked for me)&lt;br /&gt;
 # yum install gcc-c++&lt;br /&gt;
 pip install pandas&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Pivoting - Pivot Tables&lt;br /&gt;
 print pd.pivot_table(..rules..)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Serialization&lt;br /&gt;
* dump to CSV, etc&lt;br /&gt;
&lt;br /&gt;
Plotting&lt;br /&gt;
* box plot, etc...&lt;br /&gt;
&lt;br /&gt;
Clipping&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
GPS example.&lt;/div&gt;</summary>
		<author><name>Kenneth</name></author>
	</entry>
</feed>