Pandas Crash Course :: Part 2

After an introduction of the basics of pandas, we will go deep into most common routines at which pandas really outstands, like reading and writing data from and to files. This file follows the presentation of the book \textbf{Python for Data Analysis} by Wes McKinney.

Among the most common tasks that we will perform in this course is to read data from csv files. Although there are other data files format, some with efficiency enhancements, there is a perfect transposition of the file contents to the pandas dataframe where it is stored. There are a lot of options in the use of the read_csv function. These goes along the increased complexity of the files that we have today. Again, our goal here it's not to be exhaustive but to put things working. Please refer to pandas documentation to full understanding of the options and the complexety of the things that will not be considered here.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
In [2]:
# read the contents of a csv file from the command line
!cat example1.csv
a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,8,10,11,foo

Once we have a csv file we can read it with the function read_csv. This will transfer the contents of the file into a pandas dataframe.

In [3]:
df1 = pd.read_csv('example1.csv')
df1
Out[3]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 8 10 11 foo

Indeed you could achieve the same result by using the function read_table and specifying the separator.

In [4]:
garbage = pd.read_table('example1.csv',sep = ',')
garbage
Out[4]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 8 10 11 foo

We can also specify if there is a header or not. In our case, this will read, what before was a header, as a regular data row and assign the default header titles.

In [5]:
garbage = pd.read_csv('example1.csv',header = None)
garbage
Out[5]:
0 1 2 3 4
0 a b c d message
1 1 2 3 4 hello
2 5 6 7 8 world
3 9 8 10 11 foo

We can correct that reading error by reading the file again or by creating the correct columns values and drop the first row of the dataframe. Do not forget to assign the return of the function to the variable. Otherwise it will remain intact.

In [6]:
print(garbage.columns.values)

garbage.columns = ['a','b','c','d','message']
[0 1 2 3 4]
In [7]:
garbage = garbage.drop([0],axis = 0)
garbage
Out[7]:
a b c d message
1 1 2 3 4 hello
2 5 6 7 8 world
3 9 8 10 11 foo

The assignement of the columns values it is also useful in the case when the csv doesn't have header at first place. This is one of the options, as mentioned before, of the read_csv function.

In [8]:
!cat example2.csv
1,2,3,4,hello
5,6,7,8,world
9,8,10,11,foo
In [9]:
mynames = ['a','b','c','d','message']
garbage = pd.read_csv('example2.csv',header = None , names = mynames)
garbage
Out[9]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 8 10 11 foo

if not stated otherwise, the \textbf{read_csv} will create a default index for the data imported. You can use one of the column as index, even if there was no header in the original csv file.

In [10]:
garbage = pd.read_csv('example2.csv',header = None , names = mynames , index_col = 'message')
garbage
Out[10]:
a b c d
message
hello 1 2 3 4
world 5 6 7 8
foo 9 8 10 11

We can do also what is called a hierarchical index from a chosen set of columns. This results in passing a list of columns to \textbf{index_col} option.

In [11]:
!cat example3.csv
key1,key2,value1,value2
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16
In [12]:
garbage = pd.read_csv('example3.csv',index_col = ['key1','key2'])
garbage
Out[12]:
value1 value2
key1 key2
one a 1 2
b 3 4
c 5 6
d 7 8
two a 9 10
b 11 12
c 13 14
d 15 16

There are some csv files that are not so easy to read because they contain comments or empty rows. One way to avoid latter mistakes is to use \textbf{skiprows} option.

In [13]:
!cat example4.csv
# this is just a comment row

key1,key2,value1,value2
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16
In [14]:
garbage = pd.read_csv('example4.csv',index_col = ['key1','key2'],skiprows=[0,1])
garbage
Out[14]:
value1 value2
key1 key2
one a 1 2
b 3 4
c 5 6
d 7 8
two a 9 10
b 11 12
c 13 14
d 15 16

Writing Data to files

One of the things that comes naturally is, after data processing, to be able to store the data in some \textit{physical} format. Because we are dealing with csv files, the natural choice is to record the data in a csv file. We use the \textbf{to_csv} function to perform this task. The process will be, sequentially, generate data by numpy, store it in a pandas dataframe and send it to a csv file.

In [15]:
# one way to avoid big echo prints is to use the pandas option

pd.options.display.max_rows = 15
In [16]:
data = np.random.randn(300).reshape(60,5)
indix = np.arange(60).reshape(60,1)
data = np.append(indix,data,axis=1)
mynames = ['minute','col_a','col_b','col_c','col_d','col_e']
df2 = pd.DataFrame(data,columns = mynames)

# after, we set the column minute as the true index of the dataframe
df2.index = df2['minute']

# we no longer need it
df2 = df2.drop(['minute'], axis = 1)

df2
Out[16]:
col_a col_b col_c col_d col_e
minute
0.0 -0.229374 0.030345 -0.426607 -0.515123 2.230944
1.0 1.639544 -0.412593 -2.085287 -0.514243 1.124204
2.0 -1.429174 0.126368 -0.990036 0.452222 -0.268924
3.0 -0.056329 0.365542 -0.271714 -0.508785 0.687336
4.0 0.613422 0.947496 1.919967 -1.199429 -1.635038
5.0 -0.565272 1.796550 -1.657000 -1.038738 0.776442
6.0 -0.797080 0.521338 1.168330 1.211121 0.501081
... ... ... ... ... ...
53.0 0.853688 0.713043 1.520395 -0.626525 -1.885747
54.0 1.305621 0.949043 -0.224959 -1.487707 -0.035445
55.0 0.043903 -0.533535 0.514034 0.211439 0.301303
56.0 0.502391 -0.938721 1.242458 0.238170 1.179111
57.0 0.327024 0.947324 0.054283 0.341194 0.019672
58.0 0.892760 0.034955 0.023489 1.783330 -0.962055
59.0 -0.733943 0.305690 -0.530138 1.668690 -0.423688

60 rows × 5 columns

We are now in conditions to save this dataframe to a csv file. We can keep the header and the index from being saved in the file using the options \textbf{index = False} and \textbf{header = False}. In this case we will use the first option. The index can always be reproduced afterwards. In this case we will also specify which columns we need to keep.

In [17]:
df2.to_csv('out.csv', index = False, columns = ['col_a','col_b'])
In [18]:
!cat out.csv
col_a,col_b
-0.2293735515100776,0.030345251878961425
1.639543847540251,-0.4125928285573791
-1.4291742827670946,0.12636822708891512
-0.05632937636655008,0.3655419133103267
0.613421680426147,0.9474963958243164
-0.5652716562868394,1.7965502750601792
-0.7970796686141675,0.5213376941624629
-0.3617457345424345,-0.34632433663932205
0.33660990696853965,1.6827458991360393
-1.0788371253921236,0.18351088602899815
1.6871274227678532,0.39717584117776505
-0.5548281484416193,0.6673025320034086
0.4983098441594164,-0.381255004378754
1.027542956491078,0.5101705939340059
-2.211992650374271,-2.461130267947497
0.17943992588684396,0.1662296421838451
1.278070420193918,-0.9077681322077022
0.5061583810588663,1.714850268956829
0.274963447931571,-0.2845189734595696
-1.1533149171224044,0.8427720179695158
-0.46531889741025617,-0.23749080413264503
0.19271556786650226,0.8055704829943855
0.5325414103215268,-0.08744405483080504
0.5467802307362313,-1.583748431143847
-0.5866022642966356,-1.3289536868968455
1.5871484730054541,0.6878849880265416
0.725344247151721,1.5653121415659557
0.821809470635404,-1.272963239592468
-0.9213168501103384,-0.09239966599436579
0.8902213180957156,1.2013452609550999
0.6890479815497508,0.9371804109578689
-0.3195168821017588,1.1833632690065776
-0.40728435102554644,0.4543209947707928
-1.331773376854744,0.5257165829097902
0.3852958246734987,-0.058685140706960584
-0.5538811772183438,0.3259299341265914
0.7251436481425653,1.974982728059969
1.784962158197302,2.1336081438404793
-0.5176116502038574,1.0803944854374041
-0.3109895326815271,-1.1219629594785272
-1.9703827793992765,-1.29505910198167
-1.0008546016325612,1.285886814531565
-1.5831936176578612,0.9089431947210127
0.48952162358885704,1.724061352606316
-1.9135243634769161,-0.09894247600513466
1.4993133446644458,2.556916589273711
2.8289453058189102,-0.8108898989859673
0.011622643930629327,0.9951246396552607
0.2523683640036218,1.0635629762476744
-0.692723603079341,-0.27689297058598505
1.6990543084174217,1.1551242447687384
-1.2297635044168025,-2.714855070956765
-1.1952579985626663,-0.07949183553188152
0.8536880514265913,0.7130427975390355
1.305621101847628,0.9490427363393642
0.04390266104588277,-0.5335345349895751
0.5023908236008225,-0.9387207455967305
0.3270235949320637,0.947324047023311
0.8927597674815538,0.0349554557010172
-0.7339429912081663,0.30569014846426884

Selection in a DataFrame

The last part of this tutorial is about selection of rows in a dataframe. For that purpose we load a csv file with all the games of the regular season 14/15 of NHL. The source of the data is Hockey Reference

In [19]:
season = pd.read_csv('NHL1415RS.csv')
In [20]:
season
Out[20]:
Date Visitor G Home G.1 Unnamed: 5 Att. LOG Notes
0 2014-10-08 Philadelphia Flyers 1 Boston Bruins 2 NaN 17565 NaN NaN
1 2014-10-08 Vancouver Canucks 4 Calgary Flames 2 NaN 19289 NaN NaN
2 2014-10-08 San Jose Sharks 4 Los Angeles Kings 0 NaN 18514 NaN NaN
3 2014-10-08 Montreal Canadiens 4 Toronto Maple Leafs 3 NaN 19745 NaN NaN
4 2014-10-09 Winnipeg Jets 6 Arizona Coyotes 2 NaN 17125 NaN NaN
5 2014-10-09 Columbus Blue Jackets 3 Buffalo Sabres 1 NaN 18215 NaN NaN
6 2014-10-09 Chicago Blackhawks 3 Dallas Stars 2 SO 18768 NaN NaN
... ... ... ... ... ... ... ... ... ...
1223 2015-04-11 Ottawa Senators 3 Philadelphia Flyers 1 NaN 17027 NaN NaN
1224 2015-04-11 Minnesota Wild 2 St. Louis Blues 4 NaN 19155 NaN NaN
1225 2015-04-11 Boston Bruins 2 Tampa Bay Lightning 3 SO 19204 NaN NaN
1226 2015-04-11 Montreal Canadiens 4 Toronto Maple Leafs 3 SO 19308 NaN NaN
1227 2015-04-11 Edmonton Oilers 5 Vancouver Canucks 6 OT 18870 NaN NaN
1228 2015-04-11 Calgary Flames 1 Winnipeg Jets 5 NaN 15016 NaN NaN
1229 2015-04-11 New York Rangers 4 Washington Capitals 2 NaN 18506 NaN NaN

1230 rows × 9 columns

In [21]:
season.loc[season['Visitor']=='San Jose Sharks']
Out[21]:
Date Visitor G Home G.1 Unnamed: 5 Att. LOG Notes
2 2014-10-08 San Jose Sharks 4 Los Angeles Kings 0 NaN 18514 NaN NaN
46 2014-10-14 San Jose Sharks 6 Washington Capitals 5 SO 18506 NaN NaN
52 2014-10-16 San Jose Sharks 3 New York Islanders 4 SO 11248 NaN NaN
69 2014-10-18 San Jose Sharks 4 New Jersey Devils 2 NaN 16592 NaN NaN
76 2014-10-19 San Jose Sharks 0 New York Rangers 4 NaN 18006 NaN NaN
79 2014-10-21 San Jose Sharks 3 Boston Bruins 5 NaN 17565 NaN NaN
116 2014-10-26 San Jose Sharks 4 Anaheim Ducks 1 NaN 16954 NaN NaN
... ... ... ... ... ... ... ... ... ...
1088 2015-03-23 San Jose Sharks 2 Ottawa Senators 5 NaN 18193 NaN NaN
1104 2015-03-26 San Jose Sharks 6 Detroit Red Wings 4 NaN 20027 NaN NaN
1122 2015-03-28 San Jose Sharks 3 Philadelphia Flyers 2 SO 18783 NaN NaN
1134 2015-03-29 San Jose Sharks 2 Pittsburgh Penguins 3 SO 18620 NaN NaN
1167 2015-04-04 San Jose Sharks 3 Arizona Coyotes 5 NaN 14752 NaN NaN
1204 2015-04-09 San Jose Sharks 3 Edmonton Oilers 1 NaN 16839 NaN NaN
1221 2015-04-11 San Jose Sharks 1 Los Angeles Kings 4 NaN 18230 NaN NaN

41 rows × 9 columns

In [22]:
season.loc[(season['Visitor']=='Boston Bruins') | (season['Home']=='Boston Bruins')]
Out[22]:
Date Visitor G Home G.1 Unnamed: 5 Att. LOG Notes
0 2014-10-08 Philadelphia Flyers 1 Boston Bruins 2 NaN 17565 NaN NaN
7 2014-10-09 Boston Bruins 1 Detroit Red Wings 2 NaN 20027 NaN NaN
18 2014-10-11 Washington Capitals 4 Boston Bruins 0 NaN 17565 NaN NaN
34 2014-10-13 Colorado Avalanche 2 Boston Bruins 1 NaN 17565 NaN NaN
49 2014-10-15 Boston Bruins 3 Detroit Red Wings 2 SO 20027 NaN NaN
51 2014-10-16 Boston Bruins 4 Montreal Canadiens 6 NaN 21287 NaN NaN
64 2014-10-18 Boston Bruins 4 Buffalo Sabres 0 NaN 18685 NaN NaN
... ... ... ... ... ... ... ... ... ...
1128 2015-03-29 Boston Bruins 2 Carolina Hurricanes 1 OT 14275 NaN NaN
1142 2015-03-31 Florida Panthers 2 Boston Bruins 3 NaN 17565 NaN NaN
1155 2015-04-02 Boston Bruins 3 Detroit Red Wings 2 NaN 20027 NaN NaN
1168 2015-04-04 Toronto Maple Leafs 1 Boston Bruins 2 SO 17565 NaN NaN
1201 2015-04-08 Boston Bruins 0 Washington Capitals 3 NaN 18506 NaN NaN
1205 2015-04-09 Boston Bruins 2 Florida Panthers 4 NaN 11778 NaN NaN
1225 2015-04-11 Boston Bruins 2 Tampa Bay Lightning 3 SO 19204 NaN NaN

82 rows × 9 columns

One curiosity is to know how much games went into Overtime or Shutout. We start to change the name of the column to a more indicative name.

In [23]:
season = season.rename(columns = {'Unnamed: 5':'Overtime'})
In [24]:
season.loc[(season['Overtime']=='OT')|(season['Overtime']=='SO')]
Out[24]:
Date Visitor G Home G.1 Overtime Att. LOG Notes
6 2014-10-09 Chicago Blackhawks 3 Dallas Stars 2 SO 18768 NaN NaN
14 2014-10-09 Florida Panthers 2 Tampa Bay Lightning 3 OT 19204 NaN NaN
15 2014-10-09 Montreal Canadiens 2 Washington Capitals 1 SO 18506 NaN NaN
17 2014-10-11 Los Angeles Kings 2 Arizona Coyotes 3 OT 12859 NaN NaN
26 2014-10-11 Montreal Canadiens 4 Philadelphia Flyers 3 SO 19745 NaN NaN
29 2014-10-11 Ottawa Senators 3 Tampa Bay Lightning 2 SO 19204 NaN NaN
31 2014-10-11 Edmonton Oilers 4 Vancouver Canucks 5 SO 18870 NaN NaN
... ... ... ... ... ... ... ... ... ...
1203 2015-04-09 Winnipeg Jets 0 Colorado Avalanche 1 SO 14802 NaN NaN
1206 2015-04-09 Detroit Red Wings 3 Montreal Canadiens 4 OT 21287 NaN NaN
1211 2015-04-09 New Jersey Devils 3 Tampa Bay Lightning 4 OT 19204 NaN NaN
1222 2015-04-11 Columbus Blue Jackets 5 New York Islanders 4 SO 16170 NaN NaN
1225 2015-04-11 Boston Bruins 2 Tampa Bay Lightning 3 SO 19204 NaN NaN
1226 2015-04-11 Montreal Canadiens 4 Toronto Maple Leafs 3 SO 19308 NaN NaN
1227 2015-04-11 Edmonton Oilers 5 Vancouver Canucks 6 OT 18870 NaN NaN

306 rows × 9 columns

Among these, those which finished in Shutout were:

In [25]:
season.loc[season['Overtime']=='SO']
Out[25]:
Date Visitor G Home G.1 Overtime Att. LOG Notes
6 2014-10-09 Chicago Blackhawks 3 Dallas Stars 2 SO 18768 NaN NaN
15 2014-10-09 Montreal Canadiens 2 Washington Capitals 1 SO 18506 NaN NaN
26 2014-10-11 Montreal Canadiens 4 Philadelphia Flyers 3 SO 19745 NaN NaN
29 2014-10-11 Ottawa Senators 3 Tampa Bay Lightning 2 SO 19204 NaN NaN
31 2014-10-11 Edmonton Oilers 4 Vancouver Canucks 5 SO 18870 NaN NaN
38 2014-10-14 Buffalo Sabres 4 Carolina Hurricanes 3 SO 14930 NaN NaN
41 2014-10-14 Calgary Flames 3 Nashville Predators 2 SO 15654 NaN NaN
... ... ... ... ... ... ... ... ... ...
1174 2015-04-04 Detroit Red Wings 3 Minnesota Wild 2 SO 19246 NaN NaN
1184 2015-04-05 Ottawa Senators 2 Toronto Maple Leafs 3 SO 18919 NaN NaN
1189 2015-04-06 Los Angeles Kings 1 Vancouver Canucks 2 SO 18870 NaN NaN
1203 2015-04-09 Winnipeg Jets 0 Colorado Avalanche 1 SO 14802 NaN NaN
1222 2015-04-11 Columbus Blue Jackets 5 New York Islanders 4 SO 16170 NaN NaN
1225 2015-04-11 Boston Bruins 2 Tampa Bay Lightning 3 SO 19204 NaN NaN
1226 2015-04-11 Montreal Canadiens 4 Toronto Maple Leafs 3 SO 19308 NaN NaN

170 rows × 9 columns

In [ ]: