Data Analysis and Visualization with Python

Tobias Stollenwerk

German Aerospace Center (DLR)

Simulation and Software Technology

Outlook

Introduction to Numpy
Plotting with Matplotlib
Introduction to Pandas
Pandas use case: Personal expenses

Literature

Python for Data Analysis by Wes McKinney:

IPython Notebook

In [42]:

from IPython.display import HTML
HTML('<iframe src="http://127.0.0.1:8888" width=600 height=400></iframe>')

Out[42]:

Numpy

Fast vectorized arithmetic operations
Tools for integrating code written in C, C++ or Fortran
Basic building block of most of the scientific libraries in Python

In [43]:

import numpy as np

Numpy ndarray - Multidimensional Array Object

ndarray is a container for homogeneous data of a given type

In [44]:

data = np.array([[0.1, 0.3], [1.4, -3.1]], dtype='float64')
data

Out[44]:

array([[ 0.1,  0.3],
       [ 1.4, -3.1]])

Vectorized arithmetic operations:

In [45]:

data * 10

Out[45]:

array([[  1.,   3.],
       [ 14., -31.]])

In [46]:

data + 3*(np.sin(data)-1)

Out[46]:

array([[-2.6 , -1.81],
       [ 1.36, -6.22]])

Numpy - Easy creation and reshaping

In [47]:

np.ones((2,3))

Out[47]:

array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])

In [48]:

data = np.arange(6)
data

Out[48]:

array([0, 1, 2, 3, 4, 5])

In [49]:

data.reshape(2,3)

Out[49]:

array([[0, 1, 2],
       [3, 4, 5]])

In [50]:

np.random.rand(6)

Out[50]:

array([ 0.4 ,  0.56,  0.58,  0.01,  0.7 ,  0.8 ])

Numpy - Indexing and Slicing

In [51]:

data = np.arange(10)
data

Out[51]:

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [52]:

data[3]

Out[52]:

In [53]:

data[6:9]

Out[53]:

array([6, 7, 8])

In [54]:

data[::2]

Out[54]:

array([0, 2, 4, 6, 8])

Numpy - Indexing and Slicing

In [55]:

data = np.arange(12).reshape(3,4)
data

Out[55]:

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [56]:

data[1:3,1:4]

Out[56]:

array([[ 5,  6,  7],
       [ 9, 10, 11]])

In [57]:

data[[0,2,2],[1,2,3]]

Out[57]:

array([ 1, 10, 11])

Numpy - Boolean Indexing

In [58]:

data

Out[58]:

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [59]:

data[data > 4]

Out[59]:

array([ 5,  6,  7,  8,  9, 10, 11])

In [60]:

data[data % 3 != 0]

Out[60]:

array([ 1,  2,  4,  5,  7,  8, 10, 11])

In [61]:

data[data % 3 == 0] = 100
data

Out[61]:

array([[100,   1,   2, 100],
       [  4,   5, 100,   7],
       [  8, 100,  10,  11]])

Numpy - Linear Algebra

In [62]:

x = np.arange(1,7,dtype='float64').reshape(2,3)
x

Out[62]:

array([[ 1.,  2.,  3.],
       [ 4.,  5.,  6.]])

In [63]:

y = np.arange(8,14,dtype='float64').reshape(3,2)
y

Out[63]:

array([[  8.,   9.],
       [ 10.,  11.],
       [ 12.,  13.]])

In [64]:

y.dot(x)

Out[64]:

array([[  44.,   61.,   78.],
       [  54.,   75.,   96.],
       [  64.,   89.,  114.]])

Numpy - Linear Algebra

In [66]:

from numpy.linalg import inv,qr
X = np.random.randn(3,3)
mat = X.T.dot(X)
inv(mat)

Out[66]:

array([[ 4.78, -2.57,  1.47],
       [-2.57,  1.82, -0.84],
       [ 1.47, -0.84,  0.63]])

In [67]:

mat.dot(inv(mat))

Out[67]:

array([[ 1.,  0., -0.],
       [-0.,  1., -0.],
       [ 0.,  0.,  1.]])

In [68]:

q, r = qr(mat)
r

Out[68]:

array([[-2.15, -1.32,  4.31],
       [ 0.  , -2.36, -4.23],
       [-0.  ,  0.  ,  0.55]])

Matplotlib

Library for publication-quality plots
Highly configurable
Provides an alternative, matlab-like interface

In [69]:

import matplotlib.pyplot as plt

Matplotlib - Plotting

IPython magic for plotting inside the IPython notebook document

In [70]:

%matplotlib inline

Plots reside in the figure object
Add subplots with add_subplot(nrows, ncol, position)

In [72]:

fig = plt.figure()
ax1 = fig.add_subplot(1,1,1)
x = np.linspace(0, 3*np.pi, 1000)
y = np.sin(x**2)
ax1.plot(x,y);

Matplotlib - Labels

In [73]:

fig = plt.figure()
ax1 = fig.add_subplot(1,1,1)
x = np.linspace(0, 3*np.pi, 1000)
y = np.sin(x**2)

ax1.plot(x,y, label='$\sin(x^2)$');
ax1.set_xticks([np.pi * i for i in range(0,4)])
ax1.set_xticklabels(['$0$', '$\pi$', '$2\pi$', '$3\pi$'], fontsize='x-large') 
ax1.set_title('My Plot', fontsize='x-large')
ax1.legend(loc='best', fontsize='x-large')
ax1.set_xlabel('$x$', fontsize='x-large');

Matplotlib - Subplots and Styles

In [75]:

fig = plt.figure()
ax1 = fig.add_subplot(1, 3, 1)
ax2 = fig.add_subplot(1, 3, 2)
ax3 = fig.add_subplot(1, 3, 3)

ax1.hist(np.random.randn(100), bins=20, color='g', alpha=0.3);
ax2.scatter(np.arange(30), np.arange(30) + 3*np.random.randn(30));
ax3.plot(np.random.randn(50).cumsum(), 'r--');

Pandas

Data structures with integrated axes labeling
Time series indexing
Handling of missing data
Sophisticated data transformation

In [76]:

import pandas as pd

Pandas - Data Objects

One dimensional data:

In [77]:

pd.Series(np.random.randn(3), index = ['First', 'Second', 'Third'] )

Out[77]:

First     0.014906
Second    0.552592
Third     0.755098
dtype: float64

Multi dimensional data:

In [78]:

df=pd.DataFrame(np.random.randn(6).reshape(2,3), columns = ['Alice', 'Bob', 'Charles'], index=["One", "Two"])
df

Out[78]:

	Alice	Bob	Charles
One	-0.555279	0.040178	0.530189
Two	-1.362870	0.816769	-1.258043

Pandas - Get columns

Column of DataFrame is Series

In [79]:

df['Alice']

Out[79]:

One   -0.555279
Two   -1.362870
Name: Alice, dtype: float64

In [80]:

type(df['Alice'])

Out[80]:

pandas.core.series.Series

Pandas - Indexing and Slicing

Label and index based slicing

In [81]:

df.ix['One',:]

Out[81]:

Alice     -0.555279
Bob        0.040178
Charles    0.530189
Name: One, dtype: float64

In [82]:

df.ix[0,[0,2]]

Out[82]:

Alice     -0.555279
Charles    0.530189
Name: One, dtype: float64

Pandas - Function application

In [83]:

func = lambda x: x + 100
df.apply(func)

Out[83]:

	Alice	Bob	Charles
One	99.444721	100.040178	100.530189
Two	98.637130	100.816769	98.741957

In [84]:

df.sum()

Out[84]:

Alice     -1.918149
Bob        0.856947
Charles   -0.727854
dtype: float64

Pandas - Merge data

In [85]:

df1 = pd.DataFrame({'key': ['a', 'a', 'b', 'a', 'c'], 'data1': range(5)})
df1

Out[85]:

	data1	key
0	0	a
1	1	a
2	2	b
3	3	a
4	4	c

In [86]:

df2 = pd.DataFrame({'key': ['a', 'b', 'c'], 'data2': range(3,6)})
df2

Out[86]:

	data2	key
0	3	a
1	4	b
2	5	c

Pandas - Merge data

In [87]:

pd.merge(df2, df1)

Out[87]:

	data2	key	data1
0	3	a	0
1	3	a	1
2	3	a	3
3	4	b	2
4	5	c	4

Pandas - Concatenate data

In [88]:

df1 = pd.DataFrame(np.arange(12).reshape(3,4), index=['One', 'Two', 'Three'], columns=['A', 'B', 'C', 'D'])
df1

Out[88]:

	A	B	C	D
One	0	1	2	3
Two	4	5	6	7
Three	8	9	10	11

In [89]:

df2 = pd.DataFrame(np.arange(8,17).reshape(3,3), index=['Three', 'Four', 'Five'],columns=['A', 'B', 'C'])
df2

Out[89]:

	A	B	C
Three	8	9	10
Four	11	12	13
Five	14	15	16

Pandas - Concatenate data

In [90]:

df3 = pd.concat([df1, df2])
df3

Out[90]:

	A	B	C	D
One	0	1	2	3
Two	4	5	6	7
Three	8	9	10	11
Three	8	9	10	NaN
Four	11	12	13	NaN
Five	14	15	16	NaN

Pandas - Drop duplicates

In [91]:

df3 = df3.drop_duplicates(subset=['A'],)
df3

Out[91]:

	A	B	C	D
One	0	1	2	3
Two	4	5	6	7
Three	8	9	10	11
Four	11	12	13	NaN
Five	14	15	16	NaN

Pandas - Time Series

In [92]:

from datetime import datetime
df3.index=[datetime(2014, 1, 3), datetime(2014, 1, 6), datetime(2014, 1, 14), datetime(2014, 2, 1), datetime(2014, 3, 25)]
df3.index.name = 'Dates'
df3.columns.name = 'Class'
df3

Out[92]:

Class	A	B	C	D
Dates
2014-01-03	0	1	2	3
2014-01-06	4	5	6	7
2014-01-14	8	9	10	11
2014-02-01	11	12	13	NaN
2014-03-25	14	15	16	NaN

In [93]:

df3.index[0]

Out[93]:

Timestamp('2014-01-03 00:00:00')

Pandas - Plotting

Extends matplotlib plotting
Automatic labeling

In [94]:

df3.plot();

Pandas use case: Personal expenses

Typical account statement

In [98]:

df.head()

Out[98]:

	Value	Descr	Category
Date
2013-11-14	-13.48	Handelsblatt	Media
2013-11-14	-12.9	Aral	Car
2013-11-27	1659.69	Big Company Salary	Salary
2013-11-28	-13.28	Aldi	Food
2013-11-29	-12.61	Mobil phone Company	Cell phone

Categorized by commandline tool: pyAccount

Pandas grouping mechanism

Sum of categorized expenses

In [99]:

group_sums=df.groupby('Category').sum()
group_sums.head()

Out[99]:

	Value	Descr
Category
Car	-149.97	Aral Esso Aral Aral Esso Aral Aral Aral Esso E...
Cash	-131.28	ATM Foreign Bank ATM Foreign Bank ATM MyBank A...
Cell phone	-83.24	Mobil phone Company Mobil phone Company Mobil ...
Clothes	386.14	Clothes Shop Clothes Shop Clothes Shop Clothes...
Ensurance	-72.97	Ensurance Company Ensurance Company Ensurance ...

Pandas pie chart

Select subset of categories

In [101]:

pie_data = -group_sums.ix[['Car', 'Cash', 'Food', 'Kids', 'Media','Restaurant', 'Sports'],'Value']
pie_data.plot(kind='pie', title='Sum of categorized expenses');

Restrict time span index

Sum of categorized expenses restricted to certain timespan

In [102]:

df['2/1/2014':'4/1/2014'].groupby('Category').sum().head()

Out[102]:

	Value	Descr
Category
Car	-58.28	Aral Aral Aral Esso Esso
Cash	-34.74	ATM Foreign Bank ATM MyBank ATM MyBank
Cell phone	-28.71	Mobil phone Company Mobil phone Company
Clothes	450.67	Clothes Shop Clothes Shop Clothes Shop Clothes...
Ensurance	-40.00	Ensurance Company Ensurance Company Ensurance ...

Get time series of specific category

In [103]:

food = df.groupby('Category').get_group('Food')
food.head(3)

Out[103]:

	Value	Descr	Category
Date
2013-11-28	-13.28	Aldi	Food
2013-12-02	-14.27	Rewe	Food
2013-12-02	-11.93	Rewe	Food

In [105]:

food.plot(style = 'o-');

Get monthly sum by resampling and converting timestamp to period

In [106]:

food_monthly_timestamp = food['Value'].resample('M', how='sum')
food_monthly_timestamp.head(4)

Out[106]:

Date
2013-11-30    -13.28
2013-12-31   -101.09
2014-01-31    -38.00
2014-02-28   -122.26
Freq: M, Name: Value, dtype: float64

In [107]:

food_monthly = food_monthly_timestamp.to_period()
food_monthly.head(4)

Out[107]:

Date
2013-11    -13.28
2013-12   -101.09
2014-01    -38.00
2014-02   -122.26
Freq: M, Name: Value, dtype: float64

Get monthly sum by resampling and converting timestamp to period

In [109]:

food_monthly.plot(style = 'o-');

Relative change in monthly food expenses

In [110]:

food_change = food_monthly / food_monthly.shift(-1) -1
food_change[:-3]

Out[110]:

Date
2013-11   -0.868632
2013-12    1.660263
2014-01   -0.689187
2014-02   -0.170106
Freq: M, Name: Value, dtype: float64

In [112]:

food_change.plot(style = 'o-');

Monthly expenses categorized

In [113]:

me = pd.DataFrame()
for group in ['Car', 'Cash', 'Food', 'Kids', 'Media','Restaurant']:
    group_expenses = df.groupby('Category').get_group(group)
    group_expenses_monthly = -group_expenses['Value'].resample('M', how='sum', kind='period')
    me[group] = group_expenses_monthly
me

Out[113]:

	Car	Cash	Food	Kids	Media	Restaurant
Date
2013-11	12.90	NaN	13.28	NaN	13.48	NaN
2013-12	14.37	23.52	101.09	11.40	14.86	25.68
2014-01	36.13	22.49	38.00	10.06	13.48	27.83
2014-02	32.34	11.51	122.26	10.91	13.94	23.57
2014-03	25.94	12.47	147.32	10.10	14.77	27.33
2014-04	13.35	49.64	35.75	13.32	14.06	38.62
2014-05	14.94	11.65	26.50	11.65	11.73	12.47

Monthly expenses categorized

In [115]:

me.plot(kind='barh', stacked=True, alpha=0.8);

Data Analysis and Visualization with Python

Tobias Stollenwerk

German Aerospace Center (DLR)

Simulation and Software Technology

Outlook

Literature

IPython Notebook

Numpy

Numpy ndarray - Multidimensional Array Object

Numpy - Easy creation and reshaping

Numpy - Indexing and Slicing

Numpy - Indexing and Slicing

Numpy - Boolean Indexing

Numpy - Linear Algebra

Numpy - Linear Algebra

Matplotlib

Matplotlib - Plotting

Matplotlib - Labels

Matplotlib - Subplots and Styles

Pandas

Pandas - Data Objects

Pandas - Get columns

Pandas - Indexing and Slicing

Pandas - Function application

Pandas - Merge data

Pandas - Merge data

Pandas - Concatenate data

Pandas - Concatenate data

Pandas - Drop duplicates

Pandas - Time Series

Pandas - Plotting

Pandas use case: Personal expenses

Pandas grouping mechanism

Pandas pie chart

Restrict time span index

Get time series of specific category

Get monthly sum by resampling and converting timestamp to period

Get monthly sum by resampling and converting timestamp to period

Relative change in monthly food expenses

Monthly expenses categorized

Monthly expenses categorized

Thank You