Pandas Crash Course :: Part 1

Following the crash on Numpy, we follow with another package that is used to manipulate data: pandas. It follows the same spirit of Numpy by trying to avoid for loops through vectorization of the functions. The big difference is that the data structure DataFrame will typically have heterogeneous data, like numbers, strings, etc. This file follows the presentation of the book \textbf{Python for Data Analysis} by Wes McKinney.

Series Data Structure

In [1]:
# the usual way to import pandas
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

import os
from alpha_vantage.timeseries import TimeSeries
In [2]:
plt.rcParams['figure.figsize'] = (20.0, 10.0)
In [3]:
'''
There are two basic Data Structures in Pandas: Series (for time series) and DataFrame (for panel data).
We start with the time series.
'''

obj = pd.Series([1,2,4,8.9])
'''
Basically, it is a function that transforms a list into an indexed array.
'''
obj
Out[3]:
0    1.0
1    2.0
2    4.0
3    8.9
dtype: float64
In [4]:
# we can split index and values of the Series by
print(obj.values,end='\n\n')
print(obj.index)
[1.  2.  4.  8.9]

RangeIndex(start=0, stop=4, step=1)
In [5]:
# you can give explicit index values
obj2 = pd.Series([1,3.4,4],index = ['a','b','c'])
obj2
Out[5]:
a    1.0
b    3.4
c    4.0
dtype: float64
In [6]:
# you refer to elements of the series in the same fashion you used in Numpy
obj2['a']
Out[6]:
1.0
In [7]:
obj[[0,1]]
Out[7]:
0    1.0
1    2.0
dtype: float64
In [8]:
obj2[obj2>1.0]
Out[8]:
b    3.4
c    4.0
dtype: float64
In [9]:
# make some vectorized operations
obj3 = obj2 *3 +1
obj3
Out[9]:
a     4.0
b    11.2
c    13.0
dtype: float64
In [10]:
# and apply numpy functions to it
print(np.log(obj3))
a    1.386294
b    2.415914
c    2.564949
dtype: float64
In [11]:
# another useful thing is to transform a Python dictionary into a series
sdata = {'Lisbon' : 10 , 'Porto' : 7 , 'Coimbra' : 2}
obj4 = pd.Series(sdata)
obj4
Out[11]:
Lisbon     10
Porto       7
Coimbra     2
dtype: int64
In [12]:
# it is possible to reedit the index and don't give explicit values to its entries
cities = ['Lisbon','Porto','Coimbra','Faro']
obj4 = pd.Series(sdata,index = cities)
obj4
Out[12]:
Lisbon     10.0
Porto       7.0
Coimbra     2.0
Faro        NaN
dtype: float64
In [13]:
# One particular important thing is to identify those entries that have NaN entries
pd.isnull(obj4)
Out[13]:
Lisbon     False
Porto      False
Coimbra    False
Faro        True
dtype: bool
In [14]:
# following the principle in object oriented language, obj4 is itself a series and so
obj4.isnull()
Out[14]:
Lisbon     False
Porto      False
Coimbra    False
Faro        True
dtype: bool
In [15]:
# broadcasting is very useful in the treatment of data
newcities = ['Lisbon','Coimbra','Beja']
newsdata = {'Coimbra' : -1,'Lisbon' : 19}
obj5 = pd.Series(newsdata, index=newcities)
In [16]:
obj4+obj5
Out[16]:
Beja        NaN
Coimbra     1.0
Faro        NaN
Lisbon     29.0
Porto       NaN
dtype: float64

DataFrame

In [17]:
# there are several alternative methods to create a DataFrame. As we will later, reading from files is the most common
# in here we will do it by a dictionary of equal length lists

data = {'city':['Lisbon','Porto','Boston'],'uppt':[4,3,10],'downpt':[-1,-2,-1]}
frame1 = pd.DataFrame(data)
frame1
Out[17]:
city uppt downpt
0 Lisbon 4 -1
1 Porto 3 -2
2 Boston 10 -1
In [18]:
# we can choose the order that you much enjoy
pd.DataFrame(data,columns = ['uppt','downpt','city'])
Out[18]:
uppt downpt city
0 4 -1 Lisbon
1 3 -2 Porto
2 10 -1 Boston
In [19]:
# No problem at the creation of DataFrames if there are more columns than data.
frame2 = pd.DataFrame(data,columns=['uppt','downpt','city','visit'])
frame2
Out[19]:
uppt downpt city visit
0 4 -1 Lisbon NaN
1 3 -2 Porto NaN
2 10 -1 Boston NaN
In [20]:
# we can retrieve a column using dict-like notation
frame2['uppt']
Out[20]:
0     4
1     3
2    10
Name: uppt, dtype: int64
In [21]:
# or using the attribute
frame2.uppt
Out[21]:
0     4
1     3
2    10
Name: uppt, dtype: int64
In [22]:
# this is perhaps more useful when we think on the link between pandas and numpy
mat1 = np.array([frame2.uppt,frame2.downpt])
print(mat1,end='\n\n')
mat1.mean(axis=0)
[[ 4  3 10]
 [-1 -2 -1]]

Out[22]:
array([1.5, 0.5, 4.5])
In [23]:
# information in the rows is locked through the loc attribute.
frame2.loc[1]
Out[23]:
uppt          3
downpt       -2
city      Porto
visit       NaN
Name: 1, dtype: object
In [24]:
# we can create new columns just by giving names to it
frame2['bias']=1
In [25]:
# it is possible to swap and create new columns
frame2 = pd.DataFrame(frame2,columns = ['bias','city','uppt','downpt','visit','sports'])
frame2
Out[25]:
bias city uppt downpt visit sports
0 1 Lisbon 4 -1 NaN NaN
1 1 Porto 3 -2 NaN NaN
2 1 Boston 10 -1 NaN NaN
In [26]:
# much more useful is to create new columns from series with specific indexes
tmp = pd.Series([-100],index=[0])
frame2['newcol'] = tmp
frame2
Out[26]:
bias city uppt downpt visit sports newcol
0 1 Lisbon 4 -1 NaN NaN -100.0
1 1 Porto 3 -2 NaN NaN NaN
2 1 Boston 10 -1 NaN NaN NaN
In [27]:
# the method to delete columns
del frame2['sports']
frame2
Out[27]:
bias city uppt downpt visit newcol
0 1 Lisbon 4 -1 NaN -100.0
1 1 Porto 3 -2 NaN NaN
2 1 Boston 10 -1 NaN NaN
In [28]:
# the same thing could be achieve by using the drop attribute
frame2.drop(['visit'],axis=1)
Out[28]:
bias city uppt downpt newcol
0 1 Lisbon 4 -1 -100.0
1 1 Porto 3 -2 NaN
2 1 Boston 10 -1 NaN
In [29]:
# we could also drop rows in similar fashion
print(frame2.drop([1],axis=0))

# notice the DataFrame did not change after this operation
frame2
   bias    city  uppt  downpt visit  newcol
0     1  Lisbon     4      -1   NaN  -100.0
2     1  Boston    10      -1   NaN     NaN
Out[29]:
bias city uppt downpt visit newcol
0 1 Lisbon 4 -1 NaN -100.0
1 1 Porto 3 -2 NaN NaN
2 1 Boston 10 -1 NaN NaN
In [30]:
# to make those changes effective we have to make an assignment
frame2 = frame2.drop([1],axis=0)
frame2
Out[30]:
bias city uppt downpt visit newcol
0 1 Lisbon 4 -1 NaN -100.0
2 1 Boston 10 -1 NaN NaN
In [31]:
# the inverse process is done with the append attribute.
frame2.append({'city':'NY'},ignore_index=True)

# notice that frame2 didn't change (see bellow)
frame2
Out[31]:
bias city uppt downpt visit newcol
0 1 Lisbon 4 -1 NaN -100.0
2 1 Boston 10 -1 NaN NaN
In [32]:
# we can assign a constant value to a column or the values of a np.array
frame2 = frame2.append({'city':'NY'},ignore_index = True)
frame2.visit = 3
frame2
Out[32]:
bias city uppt downpt visit newcol
0 1.0 Lisbon 4.0 -1.0 3 -100.0
1 1.0 Boston 10.0 -1.0 3 NaN
2 NaN NY NaN NaN 3 NaN
In [33]:
# a very useful attribute is dropna. Basically shows the rows and columns in a DF without NaN entries.
frame2.dropna()
Out[33]:
bias city uppt downpt visit newcol
0 1.0 Lisbon 4.0 -1.0 3 -100.0

Descriptive Statistics

In [34]:
# we generate a dataframe from data in a np.array in the usual method
data = np.random.randn(20).reshape(10,2)
frame3 = pd.DataFrame(data,columns = ['set_a','set_b'])
frame3
Out[34]:
set_a set_b
0 -0.880350 0.984865
1 -0.243827 0.574113
2 0.510317 1.044272
3 -0.979011 -0.929865
4 0.024407 0.413863
5 0.057795 -1.586637
6 -0.308899 -0.213102
7 -0.345220 -0.218106
8 1.838912 0.438492
9 -0.399891 -1.873529
In [35]:
# we refer to the first 4 rows in the one of the columns of a DataFrame doing
frame3['set_a'][0:4]
Out[35]:
0   -0.880350
1   -0.243827
2    0.510317
3   -0.979011
Name: set_a, dtype: float64
In [36]:
# by default the sums are made along the columns
frame3.sum()
Out[36]:
set_a   -0.725767
set_b   -1.365635
dtype: float64
In [37]:
# the usual method to sum along the rows
frame3.sum(axis=1)
Out[37]:
0    0.104515
1    0.330285
2    1.554589
3   -1.908876
4    0.438270
5   -1.528842
6   -0.522001
7   -0.563326
8    2.277404
9   -2.273420
dtype: float64
In [38]:
# we do this instruction to create an NaN
frame3 = frame3.append({'set_a':0},ignore_index = True)
frame3
Out[38]:
set_a set_b
0 -0.880350 0.984865
1 -0.243827 0.574113
2 0.510317 1.044272
3 -0.979011 -0.929865
4 0.024407 0.413863
5 0.057795 -1.586637
6 -0.308899 -0.213102
7 -0.345220 -0.218106
8 1.838912 0.438492
9 -0.399891 -1.873529
10 0.000000 NaN
In [39]:
# by default the NaN are skipped.
# This makes it possible to return a value for the sum even in the case when there are NaN entries.
print(frame3.sum(),end='\n\n')

# otherwise the sum will not return a value.
print(frame3.sum(skipna=False))
set_a   -0.725767
set_b   -1.365635
dtype: float64

set_a   -0.725767
set_b         NaN
dtype: float64
In [40]:
# in a similar fashion we may do compute the cumulative sum of one of the columns
frame3['set_a'].cumsum()
Out[40]:
0    -0.880350
1    -1.124177
2    -0.613860
3    -1.592871
4    -1.568464
5    -1.510669
6    -1.819567
7    -2.164788
8    -0.325876
9    -0.725767
10   -0.725767
Name: set_a, dtype: float64
In [41]:
# that without skipna = True instruction will produce a NaN return
frame3['set_b'].cumsum()
Out[41]:
0     0.984865
1     1.558978
2     2.603249
3     1.673384
4     2.087247
5     0.500610
6     0.287508
7     0.069402
8     0.507893
9    -1.365635
10         NaN
Name: set_b, dtype: float64
In [42]:
frame3.mean()
Out[42]:
set_a   -0.065979
set_b   -0.136564
dtype: float64
In [43]:
frame3.idxmin()
Out[43]:
set_a    3
set_b    9
dtype: int64
In [44]:
# to get a general picture
frame3.describe()
Out[44]:
set_a set_b
count 11.000000 10.000000
mean -0.065979 -0.136564
std 0.759310 1.030226
min -0.979011 -1.873529
25% -0.372556 -0.751925
50% -0.243827 0.100380
75% 0.041101 0.540207
max 1.838912 1.044272

Online Data

There are several examples of places where you can collect data online. The API depends on the place where you are locating the data to download. Here we will focus in a specific source Alpha Avantage and a package alpha-vantage that is a wraper to interact with this database. To download the package run

\textit{git clone https://github.com/RomelTorres/alpha_vantage.git}

To install alpha_vantage in conda, run on command-line:

\textit{conda install alpha_vantage -c hoishing}

Example and documentation of the library may be found in AV Example by Romel Torres.

In order to run the examples bellow you need to get a free API key from the website of Alpha Vantage and store it as environment variable by

\textit{export AV_API_KEY=yourkey}

in your bash_profile. To find the ticker symbols use this link.

In [45]:
# this creates a time series variable with the API key stored in the os and the output format.
ts = TimeSeries(key=os.environ['AV_API_KEY'], output_format = 'pandas')
In [46]:
# this is the creation of a intraday lookup of the value of the some stock
data, meta_data = ts.get_intraday(symbol='JPM',interval='1min', outputsize='full')
# We can describe it
data.describe()
Out[46]:
1. open 2. high 3. low 4. close 5. volume
count 1659.000000 1659.000000 1659.000000 1659.000000 1.659000e+03
mean 109.407138 109.438627 109.375511 109.406862 2.969818e+04
std 1.132770 1.133921 1.133272 1.134359 6.921557e+04
min 107.520000 107.540000 107.500000 107.520000 0.000000e+00
25% 108.750000 108.780000 108.730000 108.747500 1.346050e+04
50% 109.110000 109.137000 109.080000 109.110000 2.107400e+04
75% 110.077500 110.095000 110.030000 110.070000 3.307300e+04
max 112.430000 112.465000 112.410000 112.430000 2.497278e+06
In [47]:
data['4. close'].plot()
plt.title('Intraday Times Series for the JP Morgan stock (1 min)')
plt.grid()
plt.show()
In [48]:
data2, meta_data2 = ts.get_daily(symbol='MSFT', outputsize='full')
# We can describe it
data2.describe()
Out[48]:
1. open 2. high 3. low 4. close 5. volume
count 5249.000000 5249.000000 5249.000000 5249.000000 5.249000e+03
mean 48.599953 49.197694 48.006461 48.614349 4.756139e+07
std 29.113364 29.562008 28.646036 29.136533 2.890383e+07
min 15.200000 15.620000 14.870000 15.150000 3.458100e+06
25% 27.050000 27.270000 26.850000 27.070000 2.790060e+07
50% 34.450000 34.780000 34.090000 34.440000 4.282870e+07
75% 63.250000 64.070000 62.470000 63.280000 6.014840e+07
max 178.940000 180.380000 175.750000 179.940000 5.910522e+08
In [49]:
data2['4. close'].plot()
plt.grid()
plt.title('Historical closed prices of Microsoft.')
plt.show()
In [50]:
# we run the same commands for APPle
data3 , metadata3 = ts.get_daily(symbol='AAPL',outputsize='full')
In [51]:
data3.describe()
Out[51]:
1. open 2. high 3. low 4. close 5. volume
count 5249.000000 5249.000000 5249.000000 5249.000000 5.249000e+03
mean 158.294142 160.067331 156.347925 158.245793 2.119058e+07
std 161.396496 162.675973 159.853490 161.283246 1.823115e+07
min 12.990000 13.190000 12.720000 13.120000 7.025000e+05
25% 37.930000 38.560000 37.130000 37.940000 6.686200e+06
50% 106.270000 107.809000 104.860000 106.030000 1.675100e+07
75% 187.600000 188.910000 185.500000 187.210000 2.972610e+07
max 702.410000 705.070000 699.570000 702.100000 1.895606e+08
In [52]:
data4 , metadata = ts.get_daily(symbol='GOOG',outputsize='full')
In [53]:
data_total = {'AAPL': data3['4. close'] , 'MSFT': data2['4. close'], 'GOOG': data4['4. close']}
df = pd.DataFrame(data_total)
In [54]:
df.columns
Out[54]:
Index(['AAPL', 'MSFT', 'GOOG'], dtype='object')
In [55]:
df.head()
Out[55]:
AAPL MSFT GOOG
1998-01-02 16.25 131.13 NaN
1998-01-05 15.88 130.38 NaN
1998-01-06 18.94 131.13 NaN
1998-01-07 17.50 129.56 NaN
1998-01-08 18.19 130.50 NaN
In [56]:
df['AAPL'].plot(legend='AAPL')
df['MSFT'].plot(legend='MSFT')
df['GOOG'].plot(legend='GOOG')
plt.grid()
plt.show()