After an introduction of the basics of pandas, we will go deep into most common routines at which pandas really outstands, like reading and writing data from and to files. This file follows the presentation of the book \textbf{Python for Data Analysis} by Wes McKinney.
Among the most common tasks that we will perform in this course is to read data from csv files. Although there are other data files format, some with efficiency enhancements, there is a perfect transposition of the file contents to the pandas dataframe where it is stored. There are a lot of options in the use of the read_csv function. These goes along the increased complexity of the files that we have today. Again, our goal here it's not to be exhaustive but to put things working. Please refer to pandas documentation to full understanding of the options and the complexety of the things that will not be considered here.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# read the contents of a csv file from the command line
!cat example1.csv
Once we have a csv file we can read it with the function read_csv. This will transfer the contents of the file into a pandas dataframe.
df1 = pd.read_csv('example1.csv')
df1
Indeed you could achieve the same result by using the function read_table and specifying the separator.
garbage = pd.read_table('example1.csv',sep = ',')
garbage
We can also specify if there is a header or not. In our case, this will read, what before was a header, as a regular data row and assign the default header titles.
garbage = pd.read_csv('example1.csv',header = None)
garbage
We can correct that reading error by reading the file again or by creating the correct columns values and drop the first row of the dataframe. Do not forget to assign the return of the function to the variable. Otherwise it will remain intact.
print(garbage.columns.values)
garbage.columns = ['a','b','c','d','message']
garbage = garbage.drop([0],axis = 0)
garbage
The assignement of the columns values it is also useful in the case when the csv doesn't have header at first place. This is one of the options, as mentioned before, of the read_csv function.
!cat example2.csv
mynames = ['a','b','c','d','message']
garbage = pd.read_csv('example2.csv',header = None , names = mynames)
garbage
if not stated otherwise, the \textbf{read_csv} will create a default index for the data imported. You can use one of the column as index, even if there was no header in the original csv file.
garbage = pd.read_csv('example2.csv',header = None , names = mynames , index_col = 'message')
garbage
We can do also what is called a hierarchical index from a chosen set of columns. This results in passing a list of columns to \textbf{index_col} option.
!cat example3.csv
garbage = pd.read_csv('example3.csv',index_col = ['key1','key2'])
garbage
There are some csv files that are not so easy to read because they contain comments or empty rows. One way to avoid latter mistakes is to use \textbf{skiprows} option.
!cat example4.csv
garbage = pd.read_csv('example4.csv',index_col = ['key1','key2'],skiprows=[0,1])
garbage
One of the things that comes naturally is, after data processing, to be able to store the data in some \textit{physical} format. Because we are dealing with csv files, the natural choice is to record the data in a csv file. We use the \textbf{to_csv} function to perform this task. The process will be, sequentially, generate data by numpy, store it in a pandas dataframe and send it to a csv file.
# one way to avoid big echo prints is to use the pandas option
pd.options.display.max_rows = 15
data = np.random.randn(300).reshape(60,5)
indix = np.arange(60).reshape(60,1)
data = np.append(indix,data,axis=1)
mynames = ['minute','col_a','col_b','col_c','col_d','col_e']
df2 = pd.DataFrame(data,columns = mynames)
# after, we set the column minute as the true index of the dataframe
df2.index = df2['minute']
# we no longer need it
df2 = df2.drop(['minute'], axis = 1)
df2
We are now in conditions to save this dataframe to a csv file. We can keep the header and the index from being saved in the file using the options \textbf{index = False} and \textbf{header = False}. In this case we will use the first option. The index can always be reproduced afterwards. In this case we will also specify which columns we need to keep.
df2.to_csv('out.csv', index = False, columns = ['col_a','col_b'])
!cat out.csv
The last part of this tutorial is about selection of rows in a dataframe. For that purpose we load a csv file with all the games of the regular season 14/15 of NHL. The source of the data is Hockey Reference
season = pd.read_csv('NHL1415RS.csv')
season
season.loc[season['Visitor']=='San Jose Sharks']
season.loc[(season['Visitor']=='Boston Bruins') | (season['Home']=='Boston Bruins')]
One curiosity is to know how much games went into Overtime or Shutout. We start to change the name of the column to a more indicative name.
season = season.rename(columns = {'Unnamed: 5':'Overtime'})
season.loc[(season['Overtime']=='OT')|(season['Overtime']=='SO')]
Among these, those which finished in Shutout were:
season.loc[season['Overtime']=='SO']