Pandas Crash Course :: Part 1¶

Following the crash on Numpy, we follow with another package that is used to manipulate data: pandas. It follows the same spirit of Numpy by trying to avoid for loops through vectorization of the functions. The big difference is that the data structure DataFrame will typically have heterogeneous data, like numbers, strings, etc. This file follows the presentation of the book \textbf{Python for Data Analysis} by Wes McKinney.

Series Data Structure¶

# the usual way to import pandas
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

import os
from alpha_vantage.timeseries import TimeSeries

plt.rcParams['figure.figsize'] = (20.0, 10.0)

'''
There are two basic Data Structures in Pandas: Series (for time series) and DataFrame (for panel data).
We start with the time series.
'''

obj = pd.Series([1,2,4,8.9])
'''
Basically, it is a function that transforms a list into an indexed array.
'''
obj

0    1.0
1    2.0
2    4.0
3    8.9
dtype: float64

# we can split index and values of the Series by
print(obj.values,end='\n\n')
print(obj.index)

[1.  2.  4.  8.9]

RangeIndex(start=0, stop=4, step=1)

# you can give explicit index values
obj2 = pd.Series([1,3.4,4],index = ['a','b','c'])
obj2

a    1.0
b    3.4
c    4.0
dtype: float64

# you refer to elements of the series in the same fashion you used in Numpy
obj2['a']

1.0

obj[[0,1]]

0    1.0
1    2.0
dtype: float64

obj2[obj2>1.0]

b    3.4
c    4.0
dtype: float64

# make some vectorized operations
obj3 = obj2 *3 +1
obj3

a     4.0
b    11.2
c    13.0
dtype: float64

# and apply numpy functions to it
print(np.log(obj3))

a    1.386294
b    2.415914
c    2.564949
dtype: float64

# another useful thing is to transform a Python dictionary into a series
sdata = {'Lisbon' : 10 , 'Porto' : 7 , 'Coimbra' : 2}
obj4 = pd.Series(sdata)
obj4

Lisbon     10
Porto       7
Coimbra     2
dtype: int64

# it is possible to reedit the index and don't give explicit values to its entries
cities = ['Lisbon','Porto','Coimbra','Faro']
obj4 = pd.Series(sdata,index = cities)
obj4

Lisbon     10.0
Porto       7.0
Coimbra     2.0
Faro        NaN
dtype: float64

# One particular important thing is to identify those entries that have NaN entries
pd.isnull(obj4)

Lisbon     False
Porto      False
Coimbra    False
Faro        True
dtype: bool

# following the principle in object oriented language, obj4 is itself a series and so
obj4.isnull()

Lisbon     False
Porto      False
Coimbra    False
Faro        True
dtype: bool

# broadcasting is very useful in the treatment of data
newcities = ['Lisbon','Coimbra','Beja']
newsdata = {'Coimbra' : -1,'Lisbon' : 19}
obj5 = pd.Series(newsdata, index=newcities)

obj4+obj5

Beja        NaN
Coimbra     1.0
Faro        NaN
Lisbon     29.0
Porto       NaN
dtype: float64

DataFrame¶

# there are several alternative methods to create a DataFrame. As we will later, reading from files is the most common
# in here we will do it by a dictionary of equal length lists

data = {'city':['Lisbon','Porto','Boston'],'uppt':[4,3,10],'downpt':[-1,-2,-1]}
frame1 = pd.DataFrame(data)
frame1

# we can choose the order that you much enjoy
pd.DataFrame(data,columns = ['uppt','downpt','city'])

# No problem at the creation of DataFrames if there are more columns than data.
frame2 = pd.DataFrame(data,columns=['uppt','downpt','city','visit'])
frame2

# we can retrieve a column using dict-like notation
frame2['uppt']

0     4
1     3
2    10
Name: uppt, dtype: int64

# or using the attribute
frame2.uppt

0     4
1     3
2    10
Name: uppt, dtype: int64

# this is perhaps more useful when we think on the link between pandas and numpy
mat1 = np.array([frame2.uppt,frame2.downpt])
print(mat1,end='\n\n')
mat1.mean(axis=0)

[[ 4  3 10]
 [-1 -2 -1]]

array([1.5, 0.5, 4.5])

# information in the rows is locked through the loc attribute.
frame2.loc[1]

uppt          3
downpt       -2
city      Porto
visit       NaN
Name: 1, dtype: object

# we can create new columns just by giving names to it
frame2['bias']=1

# it is possible to swap and create new columns
frame2 = pd.DataFrame(frame2,columns = ['bias','city','uppt','downpt','visit','sports'])
frame2

# much more useful is to create new columns from series with specific indexes
tmp = pd.Series([-100],index=[0])
frame2['newcol'] = tmp
frame2

# the method to delete columns
del frame2['sports']
frame2

# the same thing could be achieve by using the drop attribute
frame2.drop(['visit'],axis=1)

# we could also drop rows in similar fashion
print(frame2.drop([1],axis=0))

# notice the DataFrame did not change after this operation
frame2

   bias    city  uppt  downpt visit  newcol
0     1  Lisbon     4      -1   NaN  -100.0
2     1  Boston    10      -1   NaN     NaN

# to make those changes effective we have to make an assignment
frame2 = frame2.drop([1],axis=0)
frame2

# the inverse process is done with the append attribute.
frame2.append({'city':'NY'},ignore_index=True)

# notice that frame2 didn't change (see bellow)
frame2

# we can assign a constant value to a column or the values of a np.array
frame2 = frame2.append({'city':'NY'},ignore_index = True)
frame2.visit = 3
frame2

# a very useful attribute is dropna. Basically shows the rows and columns in a DF without NaN entries.
frame2.dropna()

Descriptive Statistics¶

# we generate a dataframe from data in a np.array in the usual method
data = np.random.randn(20).reshape(10,2)
frame3 = pd.DataFrame(data,columns = ['set_a','set_b'])
frame3

# we refer to the first 4 rows in the one of the columns of a DataFrame doing
frame3['set_a'][0:4]

0   -0.880350
1   -0.243827
2    0.510317
3   -0.979011
Name: set_a, dtype: float64

# by default the sums are made along the columns
frame3.sum()

set_a   -0.725767
set_b   -1.365635
dtype: float64

# the usual method to sum along the rows
frame3.sum(axis=1)

0    0.104515
1    0.330285
2    1.554589
3   -1.908876
4    0.438270
5   -1.528842
6   -0.522001
7   -0.563326
8    2.277404
9   -2.273420
dtype: float64

# we do this instruction to create an NaN
frame3 = frame3.append({'set_a':0},ignore_index = True)
frame3

# by default the NaN are skipped.
# This makes it possible to return a value for the sum even in the case when there are NaN entries.
print(frame3.sum(),end='\n\n')

# otherwise the sum will not return a value.
print(frame3.sum(skipna=False))

set_a   -0.725767
set_b   -1.365635
dtype: float64

set_a   -0.725767
set_b         NaN
dtype: float64

# in a similar fashion we may do compute the cumulative sum of one of the columns
frame3['set_a'].cumsum()

0    -0.880350
1    -1.124177
2    -0.613860
3    -1.592871
4    -1.568464
5    -1.510669
6    -1.819567
7    -2.164788
8    -0.325876
9    -0.725767
10   -0.725767
Name: set_a, dtype: float64

# that without skipna = True instruction will produce a NaN return
frame3['set_b'].cumsum()

0     0.984865
1     1.558978
2     2.603249
3     1.673384
4     2.087247
5     0.500610
6     0.287508
7     0.069402
8     0.507893
9    -1.365635
10         NaN
Name: set_b, dtype: float64

frame3.mean()

set_a   -0.065979
set_b   -0.136564
dtype: float64

frame3.idxmin()

set_a    3
set_b    9
dtype: int64

# to get a general picture
frame3.describe()

Online Data¶

There are several examples of places where you can collect data online. The API depends on the place where you are locating the data to download. Here we will focus in a specific source Alpha Avantage and a package alpha-vantage that is a wraper to interact with this database. To download the package run

\textit{git clone https://github.com/RomelTorres/alpha_vantage.git}

To install alpha_vantage in conda, run on command-line:

\textit{conda install alpha_vantage -c hoishing}

Example and documentation of the library may be found in AV Example by Romel Torres.

In order to run the examples bellow you need to get a free API key from the website of Alpha Vantage and store it as environment variable by

\textit{export AV_API_KEY=yourkey}

in your bash_profile. To find the ticker symbols use this link.

# this creates a time series variable with the API key stored in the os and the output format.
ts = TimeSeries(key=os.environ['AV_API_KEY'], output_format = 'pandas')

# this is the creation of a intraday lookup of the value of the some stock
data, meta_data = ts.get_intraday(symbol='JPM',interval='1min', outputsize='full')
# We can describe it
data.describe()

data['4. close'].plot()
plt.title('Intraday Times Series for the JP Morgan stock (1 min)')
plt.grid()
plt.show()

data2, meta_data2 = ts.get_daily(symbol='MSFT', outputsize='full')
# We can describe it
data2.describe()

data2['4. close'].plot()
plt.grid()
plt.title('Historical closed prices of Microsoft.')
plt.show()

# we run the same commands for APPle
data3 , metadata3 = ts.get_daily(symbol='AAPL',outputsize='full')

data3.describe()

data4 , metadata = ts.get_daily(symbol='GOOG',outputsize='full')

data_total = {'AAPL': data3['4. close'] , 'MSFT': data2['4. close'], 'GOOG': data4['4. close']}
df = pd.DataFrame(data_total)

df.columns

Index(['AAPL', 'MSFT', 'GOOG'], dtype='object')

df.head()

df['AAPL'].plot(legend='AAPL')
df['MSFT'].plot(legend='MSFT')
df['GOOG'].plot(legend='GOOG')
plt.grid()
plt.show()

	set_a	set_b
0	-0.880350	0.984865
1	-0.243827	0.574113
2	0.510317	1.044272
3	-0.979011	-0.929865
4	0.024407	0.413863
5	0.057795	-1.586637
6	-0.308899	-0.213102
7	-0.345220	-0.218106
8	1.838912	0.438492
9	-0.399891	-1.873529

	set_a	set_b
0	-0.880350	0.984865
1	-0.243827	0.574113
2	0.510317	1.044272
3	-0.979011	-0.929865
4	0.024407	0.413863
5	0.057795	-1.586637
6	-0.308899	-0.213102
7	-0.345220	-0.218106
8	1.838912	0.438492
9	-0.399891	-1.873529
10	0.000000	NaN

	set_a	set_b
count	11.000000	10.000000
mean	-0.065979	-0.136564
std	0.759310	1.030226
min	-0.979011	-1.873529
25%	-0.372556	-0.751925
50%	-0.243827	0.100380
75%	0.041101	0.540207
max	1.838912	1.044272

	1. open	2. high	3. low	4. close	5. volume
count	1659.000000	1659.000000	1659.000000	1659.000000	1.659000e+03
mean	109.407138	109.438627	109.375511	109.406862	2.969818e+04
std	1.132770	1.133921	1.133272	1.134359	6.921557e+04
min	107.520000	107.540000	107.500000	107.520000	0.000000e+00
25%	108.750000	108.780000	108.730000	108.747500	1.346050e+04
50%	109.110000	109.137000	109.080000	109.110000	2.107400e+04
75%	110.077500	110.095000	110.030000	110.070000	3.307300e+04
max	112.430000	112.465000	112.410000	112.430000	2.497278e+06

	1. open	2. high	3. low	4. close	5. volume
count	5249.000000	5249.000000	5249.000000	5249.000000	5.249000e+03
mean	48.599953	49.197694	48.006461	48.614349	4.756139e+07
std	29.113364	29.562008	28.646036	29.136533	2.890383e+07
min	15.200000	15.620000	14.870000	15.150000	3.458100e+06
25%	27.050000	27.270000	26.850000	27.070000	2.790060e+07
50%	34.450000	34.780000	34.090000	34.440000	4.282870e+07
75%	63.250000	64.070000	62.470000	63.280000	6.014840e+07
max	178.940000	180.380000	175.750000	179.940000	5.910522e+08

	bias	city	uppt	downpt	visit	sports	newcol
0	1	Lisbon	4	-1	NaN	NaN	-100.0
1	1	Porto	3	-2	NaN	NaN	NaN
2	1	Boston	10	-1	NaN	NaN	NaN

	bias	city	uppt	downpt	visit	newcol
0	1.0	Lisbon	4.0	-1.0	3	-100.0
1	1.0	Boston	10.0	-1.0	3	NaN
2	NaN	NY	NaN	NaN	3	NaN

	AAPL	MSFT	GOOG
1998-01-02	16.25	131.13	NaN
1998-01-05	15.88	130.38	NaN
1998-01-06	18.94	131.13	NaN
1998-01-07	17.50	129.56	NaN
1998-01-08	18.19	130.50	NaN

	1. open	2. high	3. low	4. close	5. volume
count	5249.000000	5249.000000	5249.000000	5249.000000	5.249000e+03
mean	158.294142	160.067331	156.347925	158.245793	2.119058e+07
std	161.396496	162.675973	159.853490	161.283246	1.823115e+07
min	12.990000	13.190000	12.720000	13.120000	7.025000e+05
25%	37.930000	38.560000	37.130000	37.940000	6.686200e+06
50%	106.270000	107.809000	104.860000	106.030000	1.675100e+07
75%	187.600000	188.910000	185.500000	187.210000	2.972610e+07
max	702.410000	705.070000	699.570000	702.100000	1.895606e+08