Data Analysis and Visualization with Python

Tobias Stollenwerk

German Aerospace Center (DLR)

Simulation and Software Technology

Outlook

  • Introduction to Numpy
  • Plotting with Matplotlib
  • Introduction to Pandas
  • Pandas use case: Personal expenses

Literature

Python for Data Analysis by Wes McKinney:

IPython Notebook

In [42]:
from IPython.display import HTML
HTML('<iframe src="http://127.0.0.1:8888" width=600 height=400></iframe>')
Out[42]:

Numpy

  • Fast vectorized arithmetic operations
  • Tools for integrating code written in C, C++ or Fortran
  • Basic building block of most of the scientific libraries in Python
In [43]:
import numpy as np

Numpy ndarray - Multidimensional Array Object

ndarray is a container for homogeneous data of a given type

In [44]:
data = np.array([[0.1, 0.3], [1.4, -3.1]], dtype='float64')
data
Out[44]:
array([[ 0.1,  0.3],
       [ 1.4, -3.1]])

Vectorized arithmetic operations:

In [45]:
data * 10
Out[45]:
array([[  1.,   3.],
       [ 14., -31.]])
In [46]:
data + 3*(np.sin(data)-1)
Out[46]:
array([[-2.6 , -1.81],
       [ 1.36, -6.22]])

Numpy - Easy creation and reshaping

In [47]:
np.ones((2,3))
Out[47]:
array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])
In [48]:
data = np.arange(6)
data
Out[48]:
array([0, 1, 2, 3, 4, 5])
In [49]:
data.reshape(2,3)
Out[49]:
array([[0, 1, 2],
       [3, 4, 5]])
In [50]:
np.random.rand(6)
Out[50]:
array([ 0.4 ,  0.56,  0.58,  0.01,  0.7 ,  0.8 ])

Numpy - Indexing and Slicing

In [51]:
data = np.arange(10)
data
Out[51]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [52]:
data[3]
Out[52]:
3
In [53]:
data[6:9]
Out[53]:
array([6, 7, 8])
In [54]:
data[::2]
Out[54]:
array([0, 2, 4, 6, 8])

Numpy - Indexing and Slicing

In [55]:
data = np.arange(12).reshape(3,4)
data
Out[55]:
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
In [56]:
data[1:3,1:4]
Out[56]:
array([[ 5,  6,  7],
       [ 9, 10, 11]])
In [57]:
data[[0,2,2],[1,2,3]]
Out[57]:
array([ 1, 10, 11])

Numpy - Boolean Indexing

In [58]:
data
Out[58]:
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
In [59]:
data[data > 4]
Out[59]:
array([ 5,  6,  7,  8,  9, 10, 11])
In [60]:
data[data % 3 != 0]
Out[60]:
array([ 1,  2,  4,  5,  7,  8, 10, 11])
In [61]:
data[data % 3 == 0] = 100
data
Out[61]:
array([[100,   1,   2, 100],
       [  4,   5, 100,   7],
       [  8, 100,  10,  11]])

Numpy - Linear Algebra

In [62]:
x = np.arange(1,7,dtype='float64').reshape(2,3)
x
Out[62]:
array([[ 1.,  2.,  3.],
       [ 4.,  5.,  6.]])
In [63]:
y = np.arange(8,14,dtype='float64').reshape(3,2)
y
Out[63]:
array([[  8.,   9.],
       [ 10.,  11.],
       [ 12.,  13.]])
In [64]:
y.dot(x)
Out[64]:
array([[  44.,   61.,   78.],
       [  54.,   75.,   96.],
       [  64.,   89.,  114.]])

Numpy - Linear Algebra

In [64]:
In [65]:
np.set_printoptions(precision=2, suppress=True)
In [66]:
from numpy.linalg import inv,qr
X = np.random.randn(3,3)
mat = X.T.dot(X)
inv(mat)
Out[66]:
array([[ 4.78, -2.57,  1.47],
       [-2.57,  1.82, -0.84],
       [ 1.47, -0.84,  0.63]])
In [67]:
mat.dot(inv(mat))
Out[67]:
array([[ 1.,  0., -0.],
       [-0.,  1., -0.],
       [ 0.,  0.,  1.]])
In [68]:
q, r = qr(mat)
r
Out[68]:
array([[-2.15, -1.32,  4.31],
       [ 0.  , -2.36, -4.23],
       [-0.  ,  0.  ,  0.55]])

Matplotlib

  • Library for publication-quality plots
  • Highly configurable
  • Provides an alternative, matlab-like interface
In [69]:
import matplotlib.pyplot as plt

Matplotlib - Plotting

  • IPython magic for plotting inside the IPython notebook document
In [70]:
%matplotlib inline
  • Plots reside in the figure object
  • Add subplots with add_subplot(nrows, ncol, position)
In [71]:
plt.rcParams['figure.figsize'] = 10, 2 
In [72]:
fig = plt.figure()
ax1 = fig.add_subplot(1,1,1)
x = np.linspace(0, 3*np.pi, 1000)
y = np.sin(x**2)
ax1.plot(x,y);

Matplotlib - Labels

In [73]:
fig = plt.figure()
ax1 = fig.add_subplot(1,1,1)
x = np.linspace(0, 3*np.pi, 1000)
y = np.sin(x**2)

ax1.plot(x,y, label='$\sin(x^2)$');
ax1.set_xticks([np.pi * i for i in range(0,4)])
ax1.set_xticklabels(['$0$', '$\pi$', '$2\pi$', '$3\pi$'], fontsize='x-large') 
ax1.set_title('My Plot', fontsize='x-large')
ax1.legend(loc='best', fontsize='x-large')
ax1.set_xlabel('$x$', fontsize='x-large');

Matplotlib - Subplots and Styles

In [74]:
plt.rcParams['figure.figsize'] = 14, 4 
In [75]:
fig = plt.figure()
ax1 = fig.add_subplot(1, 3, 1)
ax2 = fig.add_subplot(1, 3, 2)
ax3 = fig.add_subplot(1, 3, 3)

ax1.hist(np.random.randn(100), bins=20, color='g', alpha=0.3);
ax2.scatter(np.arange(30), np.arange(30) + 3*np.random.randn(30));
ax3.plot(np.random.randn(50).cumsum(), 'r--');

Pandas

  • Data structures with integrated axes labeling
  • Time series indexing
  • Handling of missing data
  • Sophisticated data transformation
In [76]:
import pandas as pd

Pandas - Data Objects

One dimensional data:

In [77]:
pd.Series(np.random.randn(3), index = ['First', 'Second', 'Third'] )
Out[77]:
First     0.014906
Second    0.552592
Third     0.755098
dtype: float64

Multi dimensional data:

In [78]:
df=pd.DataFrame(np.random.randn(6).reshape(2,3), columns = ['Alice', 'Bob', 'Charles'], index=["One", "Two"])
df
Out[78]:
Alice Bob Charles
One -0.555279 0.040178 0.530189
Two -1.362870 0.816769 -1.258043

Pandas - Get columns

  • Column of DataFrame is Series
In [79]:
df['Alice']
Out[79]:
One   -0.555279
Two   -1.362870
Name: Alice, dtype: float64
In [80]:
type(df['Alice'])
Out[80]:
pandas.core.series.Series

Pandas - Indexing and Slicing

  • Label and index based slicing
In [81]:
df.ix['One',:]
Out[81]:
Alice     -0.555279
Bob        0.040178
Charles    0.530189
Name: One, dtype: float64
In [82]:
df.ix[0,[0,2]]
Out[82]:
Alice     -0.555279
Charles    0.530189
Name: One, dtype: float64

Pandas - Function application

In [83]:
func = lambda x: x + 100
df.apply(func)
Out[83]:
Alice Bob Charles
One 99.444721 100.040178 100.530189
Two 98.637130 100.816769 98.741957
In [84]:
df.sum()
Out[84]:
Alice     -1.918149
Bob        0.856947
Charles   -0.727854
dtype: float64

Pandas - Merge data

In [85]:
df1 = pd.DataFrame({'key': ['a', 'a', 'b', 'a', 'c'], 'data1': range(5)})
df1
Out[85]:
data1 key
0 0 a
1 1 a
2 2 b
3 3 a
4 4 c
In [86]:
df2 = pd.DataFrame({'key': ['a', 'b', 'c'], 'data2': range(3,6)})
df2
Out[86]:
data2 key
0 3 a
1 4 b
2 5 c

Pandas - Merge data

In [87]:
pd.merge(df2, df1)
Out[87]:
data2 key data1
0 3 a 0
1 3 a 1
2 3 a 3
3 4 b 2
4 5 c 4

Pandas - Concatenate data

In [88]:
df1 = pd.DataFrame(np.arange(12).reshape(3,4), index=['One', 'Two', 'Three'], columns=['A', 'B', 'C', 'D'])
df1
Out[88]:
A B C D
One 0 1 2 3
Two 4 5 6 7
Three 8 9 10 11
In [89]:
df2 = pd.DataFrame(np.arange(8,17).reshape(3,3), index=['Three', 'Four', 'Five'],columns=['A', 'B', 'C'])
df2
Out[89]:
A B C
Three 8 9 10
Four 11 12 13
Five 14 15 16

Pandas - Concatenate data

In [90]:
df3 = pd.concat([df1, df2])
df3
Out[90]:
A B C D
One 0 1 2 3
Two 4 5 6 7
Three 8 9 10 11
Three 8 9 10 NaN
Four 11 12 13 NaN
Five 14 15 16 NaN

Pandas - Drop duplicates

In [91]:
df3 = df3.drop_duplicates(subset=['A'],)
df3
Out[91]:
A B C D
One 0 1 2 3
Two 4 5 6 7
Three 8 9 10 11
Four 11 12 13 NaN
Five 14 15 16 NaN

Pandas - Time Series

In [92]:
from datetime import datetime
df3.index=[datetime(2014, 1, 3), datetime(2014, 1, 6), datetime(2014, 1, 14), datetime(2014, 2, 1), datetime(2014, 3, 25)]
df3.index.name = 'Dates'
df3.columns.name = 'Class'
df3
Out[92]:
Class A B C D
Dates
2014-01-03 0 1 2 3
2014-01-06 4 5 6 7
2014-01-14 8 9 10 11
2014-02-01 11 12 13 NaN
2014-03-25 14 15 16 NaN
In [93]:
df3.index[0]
Out[93]:
Timestamp('2014-01-03 00:00:00')

Pandas - Plotting

  • Extends matplotlib plotting
  • Automatic labeling
In [94]:
df3.plot();

Pandas use case: Personal expenses

In [95]:
import sys
sys.path.append("../")
In [96]:
import pyAccount.bookkeeping as bookkeeping
import pyAccount.categories as categories
In [97]:
testdatadir = '../pyAccount/personal_testdata'
filterdir = '../pyAccount/personal_filter'

# read in data
book=bookkeeping.book()
book.dataimport(testdatadir + '/umsaetze_test.csv')

# read in filter lists
folder = filterdir 
filt=categories.bookkeeping_filter()
filt.load(folder)

# apply filter to bookkeeping items and user has to provide category 
book.filter(filt.categories, raw_input)

# anonymize data
book.anonymize()

df = book.data
df = df.sort_index()

Typical account statement

In [98]:
df.head()
Out[98]:
Value Descr Category Comment
Date
2013-11-14 -13.48 Handelsblatt Media
2013-11-14 -12.9 Aral Car
2013-11-27 1659.69 Big Company Salary Salary
2013-11-28 -13.28 Aldi Food
2013-11-29 -12.61 Mobil phone Company Cell phone

Categorized by commandline tool: pyAccount

Pandas grouping mechanism

Sum of categorized expenses

In [99]:
group_sums=df.groupby('Category').sum()
group_sums.head()
Out[99]:
Value Descr Comment
Category
Car -149.97 Aral Esso Aral Aral Esso Aral Aral Aral Esso E...
Cash -131.28 ATM Foreign Bank ATM Foreign Bank ATM MyBank A...
Cell phone -83.24 Mobil phone Company Mobil phone Company Mobil ...
Clothes 386.14 Clothes Shop Clothes Shop Clothes Shop Clothes...
Ensurance -72.97 Ensurance Company Ensurance Company Ensurance ...

Pandas pie chart

Select subset of categories

In [100]:
plt.rcParams['figure.figsize'] = 6, 6 
In [101]:
pie_data = -group_sums.ix[['Car', 'Cash', 'Food', 'Kids', 'Media','Restaurant', 'Sports'],'Value']
pie_data.plot(kind='pie', title='Sum of categorized expenses');

Restrict time span index

Sum of categorized expenses restricted to certain timespan

In [102]:
df['2/1/2014':'4/1/2014'].groupby('Category').sum().head()
Out[102]:
Value Descr Comment
Category
Car -58.28 Aral Aral Aral Esso Esso
Cash -34.74 ATM Foreign Bank ATM MyBank ATM MyBank
Cell phone -28.71 Mobil phone Company Mobil phone Company
Clothes 450.67 Clothes Shop Clothes Shop Clothes Shop Clothes...
Ensurance -40.00 Ensurance Company Ensurance Company Ensurance ...

Get time series of specific category

In [103]:
food = df.groupby('Category').get_group('Food')
food.head(3)
Out[103]:
Value Descr Category Comment
Date
2013-11-28 -13.28 Aldi Food
2013-12-02 -14.27 Rewe Food
2013-12-02 -11.93 Rewe Food
In [104]:
plt.rcParams['figure.figsize'] = 14, 2
In [105]:
food.plot(style = 'o-');

Get monthly sum by resampling and converting timestamp to period

In [106]:
food_monthly_timestamp = food['Value'].resample('M', how='sum')
food_monthly_timestamp.head(4)
Out[106]:
Date
2013-11-30    -13.28
2013-12-31   -101.09
2014-01-31    -38.00
2014-02-28   -122.26
Freq: M, Name: Value, dtype: float64
In [107]:
food_monthly = food_monthly_timestamp.to_period()
food_monthly.head(4)
Out[107]:
Date
2013-11    -13.28
2013-12   -101.09
2014-01    -38.00
2014-02   -122.26
Freq: M, Name: Value, dtype: float64

Get monthly sum by resampling and converting timestamp to period

In [108]:
plt.rcParams['figure.figsize'] = 10, 4
In [109]:
food_monthly.plot(style = 'o-');

Relative change in monthly food expenses

In [110]:
food_change = food_monthly / food_monthly.shift(-1) -1
food_change[:-3]
Out[110]:
Date
2013-11   -0.868632
2013-12    1.660263
2014-01   -0.689187
2014-02   -0.170106
Freq: M, Name: Value, dtype: float64
In [111]:
plt.rcParams['figure.figsize'] = 10, 2
In [112]:
food_change.plot(style = 'o-');

Monthly expenses categorized

In [113]:
me = pd.DataFrame()
for group in ['Car', 'Cash', 'Food', 'Kids', 'Media','Restaurant']:
    group_expenses = df.groupby('Category').get_group(group)
    group_expenses_monthly = -group_expenses['Value'].resample('M', how='sum', kind='period')
    me[group] = group_expenses_monthly
me
Out[113]:
Car Cash Food Kids Media Restaurant
Date
2013-11 12.90 NaN 13.28 NaN 13.48 NaN
2013-12 14.37 23.52 101.09 11.40 14.86 25.68
2014-01 36.13 22.49 38.00 10.06 13.48 27.83
2014-02 32.34 11.51 122.26 10.91 13.94 23.57
2014-03 25.94 12.47 147.32 10.10 14.77 27.33
2014-04 13.35 49.64 35.75 13.32 14.06 38.62
2014-05 14.94 11.65 26.50 11.65 11.73 12.47

Monthly expenses categorized

In [114]:
plt.rcParams['figure.figsize'] = 10, 5 
In [115]:
me.plot(kind='barh', stacked=True, alpha=0.8);

Thank You