# Data Analysis and Visualization with Python

## Outlook

• Introduction to Numpy
• Plotting with Matplotlib
• Introduction to Pandas
• Pandas use case: Personal expenses

## Literature

Python for Data Analysis by Wes McKinney:

## IPython Notebook

In [42]:
from IPython.display import HTML
HTML('<iframe src="http://127.0.0.1:8888" width=600 height=400></iframe>')

Out[42]:

## Numpy

• Fast vectorized arithmetic operations
• Tools for integrating code written in C, C++ or Fortran
• Basic building block of most of the scientific libraries in Python
In [43]:
import numpy as np


## Numpy ndarray - Multidimensional Array Object

ndarray is a container for homogeneous data of a given type

In [44]:
data = np.array([[0.1, 0.3], [1.4, -3.1]], dtype='float64')
data

Out[44]:
array([[ 0.1,  0.3],
[ 1.4, -3.1]])


Vectorized arithmetic operations:

In [45]:
data * 10

Out[45]:
array([[  1.,   3.],
[ 14., -31.]])

In [46]:
data + 3*(np.sin(data)-1)

Out[46]:
array([[-2.6 , -1.81],
[ 1.36, -6.22]])


## Numpy - Easy creation and reshaping

In [47]:
np.ones((2,3))

Out[47]:
array([[ 1.,  1.,  1.],
[ 1.,  1.,  1.]])

In [48]:
data = np.arange(6)
data

Out[48]:
array([0, 1, 2, 3, 4, 5])

In [49]:
data.reshape(2,3)

Out[49]:
array([[0, 1, 2],
[3, 4, 5]])

In [50]:
np.random.rand(6)

Out[50]:
array([ 0.4 ,  0.56,  0.58,  0.01,  0.7 ,  0.8 ])


## Numpy - Indexing and Slicing

In [51]:
data = np.arange(10)
data

Out[51]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [52]:
data[3]

Out[52]:
3

In [53]:
data[6:9]

Out[53]:
array([6, 7, 8])

In [54]:
data[::2]

Out[54]:
array([0, 2, 4, 6, 8])


## Numpy - Indexing and Slicing

In [55]:
data = np.arange(12).reshape(3,4)
data

Out[55]:
array([[ 0,  1,  2,  3],
[ 4,  5,  6,  7],
[ 8,  9, 10, 11]])

In [56]:
data[1:3,1:4]

Out[56]:
array([[ 5,  6,  7],
[ 9, 10, 11]])

In [57]:
data[[0,2,2],[1,2,3]]

Out[57]:
array([ 1, 10, 11])


## Numpy - Boolean Indexing

In [58]:
data

Out[58]:
array([[ 0,  1,  2,  3],
[ 4,  5,  6,  7],
[ 8,  9, 10, 11]])

In [59]:
data[data > 4]

Out[59]:
array([ 5,  6,  7,  8,  9, 10, 11])

In [60]:
data[data % 3 != 0]

Out[60]:
array([ 1,  2,  4,  5,  7,  8, 10, 11])

In [61]:
data[data % 3 == 0] = 100
data

Out[61]:
array([[100,   1,   2, 100],
[  4,   5, 100,   7],
[  8, 100,  10,  11]])


## Numpy - Linear Algebra

In [62]:
x = np.arange(1,7,dtype='float64').reshape(2,3)
x

Out[62]:
array([[ 1.,  2.,  3.],
[ 4.,  5.,  6.]])

In [63]:
y = np.arange(8,14,dtype='float64').reshape(3,2)
y

Out[63]:
array([[  8.,   9.],
[ 10.,  11.],
[ 12.,  13.]])

In [64]:
y.dot(x)

Out[64]:
array([[  44.,   61.,   78.],
[  54.,   75.,   96.],
[  64.,   89.,  114.]])


## Numpy - Linear Algebra

In [66]:
from numpy.linalg import inv,qr
X = np.random.randn(3,3)
mat = X.T.dot(X)
inv(mat)

Out[66]:
array([[ 4.78, -2.57,  1.47],
[-2.57,  1.82, -0.84],
[ 1.47, -0.84,  0.63]])

In [67]:
mat.dot(inv(mat))

Out[67]:
array([[ 1.,  0., -0.],
[-0.,  1., -0.],
[ 0.,  0.,  1.]])

In [68]:
q, r = qr(mat)
r

Out[68]:
array([[-2.15, -1.32,  4.31],
[ 0.  , -2.36, -4.23],
[-0.  ,  0.  ,  0.55]])


## Matplotlib

• Library for publication-quality plots
• Highly configurable
• Provides an alternative, matlab-like interface
In [69]:
import matplotlib.pyplot as plt


## Matplotlib - Plotting

• IPython magic for plotting inside the IPython notebook document
In [70]:
%matplotlib inline

• Plots reside in the figure object
• Add subplots with add_subplot(nrows, ncol, position)
In [72]:
fig = plt.figure()
x = np.linspace(0, 3*np.pi, 1000)
y = np.sin(x**2)
ax1.plot(x,y);


## Matplotlib - Labels

In [73]:
fig = plt.figure()
x = np.linspace(0, 3*np.pi, 1000)
y = np.sin(x**2)

ax1.plot(x,y, label='$\sin(x^2)$');
ax1.set_xticks([np.pi * i for i in range(0,4)])
ax1.set_xticklabels(['$0$', '$\pi$', '$2\pi$', '$3\pi$'], fontsize='x-large')
ax1.set_title('My Plot', fontsize='x-large')
ax1.legend(loc='best', fontsize='x-large')
ax1.set_xlabel('$x$', fontsize='x-large');


## Matplotlib - Subplots and Styles

In [75]:
fig = plt.figure()

ax1.hist(np.random.randn(100), bins=20, color='g', alpha=0.3);
ax2.scatter(np.arange(30), np.arange(30) + 3*np.random.randn(30));
ax3.plot(np.random.randn(50).cumsum(), 'r--');


## Pandas

• Data structures with integrated axes labeling
• Time series indexing
• Handling of missing data
• Sophisticated data transformation
In [76]:
import pandas as pd


## Pandas - Data Objects

One dimensional data:

In [77]:
pd.Series(np.random.randn(3), index = ['First', 'Second', 'Third'] )

Out[77]:
First     0.014906
Second    0.552592
Third     0.755098
dtype: float64


Multi dimensional data:

In [78]:
df=pd.DataFrame(np.random.randn(6).reshape(2,3), columns = ['Alice', 'Bob', 'Charles'], index=["One", "Two"])
df

Out[78]:
Alice Bob Charles
One -0.555279 0.040178 0.530189
Two -1.362870 0.816769 -1.258043

## Pandas - Get columns

• Column of DataFrame is Series
In [79]:
df['Alice']

Out[79]:
One   -0.555279
Two   -1.362870
Name: Alice, dtype: float64

In [80]:
type(df['Alice'])

Out[80]:
pandas.core.series.Series


## Pandas - Indexing and Slicing

• Label and index based slicing
In [81]:
df.ix['One',:]

Out[81]:
Alice     -0.555279
Bob        0.040178
Charles    0.530189
Name: One, dtype: float64

In [82]:
df.ix[0,[0,2]]

Out[82]:
Alice     -0.555279
Charles    0.530189
Name: One, dtype: float64


## Pandas - Function application

In [83]:
func = lambda x: x + 100
df.apply(func)

Out[83]:
Alice Bob Charles
One 99.444721 100.040178 100.530189
Two 98.637130 100.816769 98.741957
In [84]:
df.sum()

Out[84]:
Alice     -1.918149
Bob        0.856947
Charles   -0.727854
dtype: float64


## Pandas - Merge data

In [85]:
df1 = pd.DataFrame({'key': ['a', 'a', 'b', 'a', 'c'], 'data1': range(5)})
df1

Out[85]:
data1 key
0 0 a
1 1 a
2 2 b
3 3 a
4 4 c
In [86]:
df2 = pd.DataFrame({'key': ['a', 'b', 'c'], 'data2': range(3,6)})
df2

Out[86]:
data2 key
0 3 a
1 4 b
2 5 c

## Pandas - Merge data

In [87]:
pd.merge(df2, df1)

Out[87]:
data2 key data1
0 3 a 0
1 3 a 1
2 3 a 3
3 4 b 2
4 5 c 4

## Pandas - Concatenate data

In [88]:
df1 = pd.DataFrame(np.arange(12).reshape(3,4), index=['One', 'Two', 'Three'], columns=['A', 'B', 'C', 'D'])
df1

Out[88]:
A B C D
One 0 1 2 3
Two 4 5 6 7
Three 8 9 10 11
In [89]:
df2 = pd.DataFrame(np.arange(8,17).reshape(3,3), index=['Three', 'Four', 'Five'],columns=['A', 'B', 'C'])
df2

Out[89]:
A B C
Three 8 9 10
Four 11 12 13
Five 14 15 16

## Pandas - Concatenate data

In [90]:
df3 = pd.concat([df1, df2])
df3

Out[90]:
A B C D
One 0 1 2 3
Two 4 5 6 7
Three 8 9 10 11
Three 8 9 10 NaN
Four 11 12 13 NaN
Five 14 15 16 NaN

## Pandas - Drop duplicates

In [91]:
df3 = df3.drop_duplicates(subset=['A'],)
df3

Out[91]:
A B C D
One 0 1 2 3
Two 4 5 6 7
Three 8 9 10 11
Four 11 12 13 NaN
Five 14 15 16 NaN

## Pandas - Time Series

In [92]:
from datetime import datetime
df3.index=[datetime(2014, 1, 3), datetime(2014, 1, 6), datetime(2014, 1, 14), datetime(2014, 2, 1), datetime(2014, 3, 25)]
df3.index.name = 'Dates'
df3.columns.name = 'Class'
df3

Out[92]:
Class A B C D
Dates
2014-01-03 0 1 2 3
2014-01-06 4 5 6 7
2014-01-14 8 9 10 11
2014-02-01 11 12 13 NaN
2014-03-25 14 15 16 NaN
In [93]:
df3.index[0]

Out[93]:
Timestamp('2014-01-03 00:00:00')


## Pandas - Plotting

• Extends matplotlib plotting
• Automatic labeling
In [94]:
df3.plot();


## Pandas use case: Personal expenses

Typical account statement

In [98]:
df.head()

Out[98]:
Value Descr Category Comment
Date
2013-11-14 -13.48 Handelsblatt Media
2013-11-14 -12.9 Aral Car
2013-11-27 1659.69 Big Company Salary Salary
2013-11-28 -13.28 Aldi Food
2013-11-29 -12.61 Mobil phone Company Cell phone

Categorized by commandline tool: pyAccount

## Pandas grouping mechanism

Sum of categorized expenses

In [99]:
group_sums=df.groupby('Category').sum()

Out[99]:
Value Descr Comment
Category
Car -149.97 Aral Esso Aral Aral Esso Aral Aral Aral Esso E...
Cash -131.28 ATM Foreign Bank ATM Foreign Bank ATM MyBank A...
Cell phone -83.24 Mobil phone Company Mobil phone Company Mobil ...
Clothes 386.14 Clothes Shop Clothes Shop Clothes Shop Clothes...
Ensurance -72.97 Ensurance Company Ensurance Company Ensurance ...

## Pandas pie chart

Select subset of categories

In [101]:
pie_data = -group_sums.ix[['Car', 'Cash', 'Food', 'Kids', 'Media','Restaurant', 'Sports'],'Value']
pie_data.plot(kind='pie', title='Sum of categorized expenses');


## Restrict time span index

Sum of categorized expenses restricted to certain timespan

In [102]:
df['2/1/2014':'4/1/2014'].groupby('Category').sum().head()

Out[102]:
Value Descr Comment
Category
Car -58.28 Aral Aral Aral Esso Esso
Cash -34.74 ATM Foreign Bank ATM MyBank ATM MyBank
Cell phone -28.71 Mobil phone Company Mobil phone Company
Clothes 450.67 Clothes Shop Clothes Shop Clothes Shop Clothes...
Ensurance -40.00 Ensurance Company Ensurance Company Ensurance ...

## Get time series of specific category

In [103]:
food = df.groupby('Category').get_group('Food')

Out[103]:
Value Descr Category Comment
Date
2013-11-28 -13.28 Aldi Food
2013-12-02 -14.27 Rewe Food
2013-12-02 -11.93 Rewe Food
In [105]:
food.plot(style = 'o-');


## Get monthly sum by resampling and converting timestamp to period

In [106]:
food_monthly_timestamp = food['Value'].resample('M', how='sum')

Out[106]:
Date
2013-11-30    -13.28
2013-12-31   -101.09
2014-01-31    -38.00
2014-02-28   -122.26
Freq: M, Name: Value, dtype: float64

In [107]:
food_monthly = food_monthly_timestamp.to_period()

Out[107]:
Date
2013-11    -13.28
2013-12   -101.09
2014-01    -38.00
2014-02   -122.26
Freq: M, Name: Value, dtype: float64


## Get monthly sum by resampling and converting timestamp to period

In [109]:
food_monthly.plot(style = 'o-');


## Relative change in monthly food expenses

In [110]:
food_change = food_monthly / food_monthly.shift(-1) -1
food_change[:-3]

Out[110]:
Date
2013-11   -0.868632
2013-12    1.660263
2014-01   -0.689187
2014-02   -0.170106
Freq: M, Name: Value, dtype: float64

In [112]:
food_change.plot(style = 'o-');


## Monthly expenses categorized

In [113]:
me = pd.DataFrame()
for group in ['Car', 'Cash', 'Food', 'Kids', 'Media','Restaurant']:
group_expenses = df.groupby('Category').get_group(group)
group_expenses_monthly = -group_expenses['Value'].resample('M', how='sum', kind='period')
me[group] = group_expenses_monthly
me

Out[113]:
Car Cash Food Kids Media Restaurant
Date
2013-11 12.90 NaN 13.28 NaN 13.48 NaN
2013-12 14.37 23.52 101.09 11.40 14.86 25.68
2014-01 36.13 22.49 38.00 10.06 13.48 27.83
2014-02 32.34 11.51 122.26 10.91 13.94 23.57
2014-03 25.94 12.47 147.32 10.10 14.77 27.33
2014-04 13.35 49.64 35.75 13.32 14.06 38.62
2014-05 14.94 11.65 26.50 11.65 11.73 12.47

## Monthly expenses categorized

In [115]:
me.plot(kind='barh', stacked=True, alpha=0.8);