Following the crash on Numpy, we follow with another package that is used to manipulate data: pandas. It follows the same spirit of Numpy by trying to avoid for loops through vectorization of the functions. The big difference is that the data structure DataFrame will typically have heterogeneous data, like numbers, strings, etc. This file follows the presentation of the book \textbf{Python for Data Analysis} by Wes McKinney.
# the usual way to import pandas
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
from alpha_vantage.timeseries import TimeSeries
plt.rcParams['figure.figsize'] = (20.0, 10.0)
'''
There are two basic Data Structures in Pandas: Series (for time series) and DataFrame (for panel data).
We start with the time series.
'''
obj = pd.Series([1,2,4,8.9])
'''
Basically, it is a function that transforms a list into an indexed array.
'''
obj
# we can split index and values of the Series by
print(obj.values,end='\n\n')
print(obj.index)
# you can give explicit index values
obj2 = pd.Series([1,3.4,4],index = ['a','b','c'])
obj2
# you refer to elements of the series in the same fashion you used in Numpy
obj2['a']
obj[[0,1]]
obj2[obj2>1.0]
# make some vectorized operations
obj3 = obj2 *3 +1
obj3
# and apply numpy functions to it
print(np.log(obj3))
# another useful thing is to transform a Python dictionary into a series
sdata = {'Lisbon' : 10 , 'Porto' : 7 , 'Coimbra' : 2}
obj4 = pd.Series(sdata)
obj4
# it is possible to reedit the index and don't give explicit values to its entries
cities = ['Lisbon','Porto','Coimbra','Faro']
obj4 = pd.Series(sdata,index = cities)
obj4
# One particular important thing is to identify those entries that have NaN entries
pd.isnull(obj4)
# following the principle in object oriented language, obj4 is itself a series and so
obj4.isnull()
# broadcasting is very useful in the treatment of data
newcities = ['Lisbon','Coimbra','Beja']
newsdata = {'Coimbra' : -1,'Lisbon' : 19}
obj5 = pd.Series(newsdata, index=newcities)
obj4+obj5
# there are several alternative methods to create a DataFrame. As we will later, reading from files is the most common
# in here we will do it by a dictionary of equal length lists
data = {'city':['Lisbon','Porto','Boston'],'uppt':[4,3,10],'downpt':[-1,-2,-1]}
frame1 = pd.DataFrame(data)
frame1
# we can choose the order that you much enjoy
pd.DataFrame(data,columns = ['uppt','downpt','city'])
# No problem at the creation of DataFrames if there are more columns than data.
frame2 = pd.DataFrame(data,columns=['uppt','downpt','city','visit'])
frame2
# we can retrieve a column using dict-like notation
frame2['uppt']
# or using the attribute
frame2.uppt
# this is perhaps more useful when we think on the link between pandas and numpy
mat1 = np.array([frame2.uppt,frame2.downpt])
print(mat1,end='\n\n')
mat1.mean(axis=0)
# information in the rows is locked through the loc attribute.
frame2.loc[1]
# we can create new columns just by giving names to it
frame2['bias']=1
# it is possible to swap and create new columns
frame2 = pd.DataFrame(frame2,columns = ['bias','city','uppt','downpt','visit','sports'])
frame2
# much more useful is to create new columns from series with specific indexes
tmp = pd.Series([-100],index=[0])
frame2['newcol'] = tmp
frame2
# the method to delete columns
del frame2['sports']
frame2
# the same thing could be achieve by using the drop attribute
frame2.drop(['visit'],axis=1)
# we could also drop rows in similar fashion
print(frame2.drop([1],axis=0))
# notice the DataFrame did not change after this operation
frame2
# to make those changes effective we have to make an assignment
frame2 = frame2.drop([1],axis=0)
frame2
# the inverse process is done with the append attribute.
frame2.append({'city':'NY'},ignore_index=True)
# notice that frame2 didn't change (see bellow)
frame2
# we can assign a constant value to a column or the values of a np.array
frame2 = frame2.append({'city':'NY'},ignore_index = True)
frame2.visit = 3
frame2
# a very useful attribute is dropna. Basically shows the rows and columns in a DF without NaN entries.
frame2.dropna()
# we generate a dataframe from data in a np.array in the usual method
data = np.random.randn(20).reshape(10,2)
frame3 = pd.DataFrame(data,columns = ['set_a','set_b'])
frame3
# we refer to the first 4 rows in the one of the columns of a DataFrame doing
frame3['set_a'][0:4]
# by default the sums are made along the columns
frame3.sum()
# the usual method to sum along the rows
frame3.sum(axis=1)
# we do this instruction to create an NaN
frame3 = frame3.append({'set_a':0},ignore_index = True)
frame3
# by default the NaN are skipped.
# This makes it possible to return a value for the sum even in the case when there are NaN entries.
print(frame3.sum(),end='\n\n')
# otherwise the sum will not return a value.
print(frame3.sum(skipna=False))
# in a similar fashion we may do compute the cumulative sum of one of the columns
frame3['set_a'].cumsum()
# that without skipna = True instruction will produce a NaN return
frame3['set_b'].cumsum()
frame3.mean()
frame3.idxmin()
# to get a general picture
frame3.describe()
There are several examples of places where you can collect data online. The API depends on the place where you are locating the data to download. Here we will focus in a specific source Alpha Avantage and a package alpha-vantage that is a wraper to interact with this database. To download the package run
\textit{git clone https://github.com/RomelTorres/alpha_vantage.git}
To install alpha_vantage in conda, run on command-line:
\textit{conda install alpha_vantage -c hoishing}
Example and documentation of the library may be found in AV Example by Romel Torres.
In order to run the examples bellow you need to get a free API key from the website of Alpha Vantage and store it as environment variable by
\textit{export AV_API_KEY=yourkey}
in your bash_profile. To find the ticker symbols use this link.
# this creates a time series variable with the API key stored in the os and the output format.
ts = TimeSeries(key=os.environ['AV_API_KEY'], output_format = 'pandas')
# this is the creation of a intraday lookup of the value of the some stock
data, meta_data = ts.get_intraday(symbol='JPM',interval='1min', outputsize='full')
# We can describe it
data.describe()
data['4. close'].plot()
plt.title('Intraday Times Series for the JP Morgan stock (1 min)')
plt.grid()
plt.show()
data2, meta_data2 = ts.get_daily(symbol='MSFT', outputsize='full')
# We can describe it
data2.describe()
data2['4. close'].plot()
plt.grid()
plt.title('Historical closed prices of Microsoft.')
plt.show()
# we run the same commands for APPle
data3 , metadata3 = ts.get_daily(symbol='AAPL',outputsize='full')
data3.describe()
data4 , metadata = ts.get_daily(symbol='GOOG',outputsize='full')
data_total = {'AAPL': data3['4. close'] , 'MSFT': data2['4. close'], 'GOOG': data4['4. close']}
df = pd.DataFrame(data_total)
df.columns
df.head()
df['AAPL'].plot(legend='AAPL')
df['MSFT'].plot(legend='MSFT')
df['GOOG'].plot(legend='GOOG')
plt.grid()
plt.show()