Pandas Crash Course :: Part 2¶

After an introduction of the basics of pandas, we will go deep into most common routines at which pandas really outstands, like reading and writing data from and to files. This file follows the presentation of the book \textbf{Python for Data Analysis} by Wes McKinney.

Among the most common tasks that we will perform in this course is to read data from csv files. Although there are other data files format, some with efficiency enhancements, there is a perfect transposition of the file contents to the pandas dataframe where it is stored. There are a lot of options in the use of the read_csv function. These goes along the increased complexity of the files that we have today. Again, our goal here it's not to be exhaustive but to put things working. Please refer to pandas documentation to full understanding of the options and the complexety of the things that will not be considered here.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# read the contents of a csv file from the command line
!cat example1.csv

a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,8,10,11,foo

Once we have a csv file we can read it with the function read_csv. This will transfer the contents of the file into a pandas dataframe.

df1 = pd.read_csv('example1.csv')
df1

Indeed you could achieve the same result by using the function read_table and specifying the separator.

garbage = pd.read_table('example1.csv',sep = ',')
garbage

We can also specify if there is a header or not. In our case, this will read, what before was a header, as a regular data row and assign the default header titles.

garbage = pd.read_csv('example1.csv',header = None)
garbage

We can correct that reading error by reading the file again or by creating the correct columns values and drop the first row of the dataframe. Do not forget to assign the return of the function to the variable. Otherwise it will remain intact.

print(garbage.columns.values)

garbage.columns = ['a','b','c','d','message']

[0 1 2 3 4]

garbage = garbage.drop([0],axis = 0)
garbage

The assignement of the columns values it is also useful in the case when the csv doesn't have header at first place. This is one of the options, as mentioned before, of the read_csv function.

!cat example2.csv

1,2,3,4,hello
5,6,7,8,world
9,8,10,11,foo

mynames = ['a','b','c','d','message']
garbage = pd.read_csv('example2.csv',header = None , names = mynames)
garbage

if not stated otherwise, the \textbf{read_csv} will create a default index for the data imported. You can use one of the column as index, even if there was no header in the original csv file.

garbage = pd.read_csv('example2.csv',header = None , names = mynames , index_col = 'message')
garbage

We can do also what is called a hierarchical index from a chosen set of columns. This results in passing a list of columns to \textbf{index_col} option.

!cat example3.csv

key1,key2,value1,value2
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16

garbage = pd.read_csv('example3.csv',index_col = ['key1','key2'])
garbage

There are some csv files that are not so easy to read because they contain comments or empty rows. One way to avoid latter mistakes is to use \textbf{skiprows} option.

!cat example4.csv

# this is just a comment row

key1,key2,value1,value2
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16

garbage = pd.read_csv('example4.csv',index_col = ['key1','key2'],skiprows=[0,1])
garbage

Writing Data to files¶

One of the things that comes naturally is, after data processing, to be able to store the data in some \textit{physical} format. Because we are dealing with csv files, the natural choice is to record the data in a csv file. We use the \textbf{to_csv} function to perform this task. The process will be, sequentially, generate data by numpy, store it in a pandas dataframe and send it to a csv file.

# one way to avoid big echo prints is to use the pandas option

pd.options.display.max_rows = 15

data = np.random.randn(300).reshape(60,5)
indix = np.arange(60).reshape(60,1)
data = np.append(indix,data,axis=1)
mynames = ['minute','col_a','col_b','col_c','col_d','col_e']
df2 = pd.DataFrame(data,columns = mynames)

# after, we set the column minute as the true index of the dataframe
df2.index = df2['minute']

# we no longer need it
df2 = df2.drop(['minute'], axis = 1)

df2

We are now in conditions to save this dataframe to a csv file. We can keep the header and the index from being saved in the file using the options \textbf{index = False} and \textbf{header = False}. In this case we will use the first option. The index can always be reproduced afterwards. In this case we will also specify which columns we need to keep.

df2.to_csv('out.csv', index = False, columns = ['col_a','col_b'])

!cat out.csv

col_a,col_b
-0.2293735515100776,0.030345251878961425
1.639543847540251,-0.4125928285573791
-1.4291742827670946,0.12636822708891512
-0.05632937636655008,0.3655419133103267
0.613421680426147,0.9474963958243164
-0.5652716562868394,1.7965502750601792
-0.7970796686141675,0.5213376941624629
-0.3617457345424345,-0.34632433663932205
0.33660990696853965,1.6827458991360393
-1.0788371253921236,0.18351088602899815
1.6871274227678532,0.39717584117776505
-0.5548281484416193,0.6673025320034086
0.4983098441594164,-0.381255004378754
1.027542956491078,0.5101705939340059
-2.211992650374271,-2.461130267947497
0.17943992588684396,0.1662296421838451
1.278070420193918,-0.9077681322077022
0.5061583810588663,1.714850268956829
0.274963447931571,-0.2845189734595696
-1.1533149171224044,0.8427720179695158
-0.46531889741025617,-0.23749080413264503
0.19271556786650226,0.8055704829943855
0.5325414103215268,-0.08744405483080504
0.5467802307362313,-1.583748431143847
-0.5866022642966356,-1.3289536868968455
1.5871484730054541,0.6878849880265416
0.725344247151721,1.5653121415659557
0.821809470635404,-1.272963239592468
-0.9213168501103384,-0.09239966599436579
0.8902213180957156,1.2013452609550999
0.6890479815497508,0.9371804109578689
-0.3195168821017588,1.1833632690065776
-0.40728435102554644,0.4543209947707928
-1.331773376854744,0.5257165829097902
0.3852958246734987,-0.058685140706960584
-0.5538811772183438,0.3259299341265914
0.7251436481425653,1.974982728059969
1.784962158197302,2.1336081438404793
-0.5176116502038574,1.0803944854374041
-0.3109895326815271,-1.1219629594785272
-1.9703827793992765,-1.29505910198167
-1.0008546016325612,1.285886814531565
-1.5831936176578612,0.9089431947210127
0.48952162358885704,1.724061352606316
-1.9135243634769161,-0.09894247600513466
1.4993133446644458,2.556916589273711
2.8289453058189102,-0.8108898989859673
0.011622643930629327,0.9951246396552607
0.2523683640036218,1.0635629762476744
-0.692723603079341,-0.27689297058598505
1.6990543084174217,1.1551242447687384
-1.2297635044168025,-2.714855070956765
-1.1952579985626663,-0.07949183553188152
0.8536880514265913,0.7130427975390355
1.305621101847628,0.9490427363393642
0.04390266104588277,-0.5335345349895751
0.5023908236008225,-0.9387207455967305
0.3270235949320637,0.947324047023311
0.8927597674815538,0.0349554557010172
-0.7339429912081663,0.30569014846426884

Selection in a DataFrame¶

The last part of this tutorial is about selection of rows in a dataframe. For that purpose we load a csv file with all the games of the regular season 14/15 of NHL. The source of the data is Hockey Reference

season = pd.read_csv('NHL1415RS.csv')

season

season.loc[season['Visitor']=='San Jose Sharks']

season.loc[(season['Visitor']=='Boston Bruins') | (season['Home']=='Boston Bruins')]

One curiosity is to know how much games went into Overtime or Shutout. We start to change the name of the column to a more indicative name.

season = season.rename(columns = {'Unnamed: 5':'Overtime'})

season.loc[(season['Overtime']=='OT')|(season['Overtime']=='SO')]

Among these, those which finished in Shutout were:

season.loc[season['Overtime']=='SO']

	col_a	col_b	col_c	col_d	col_e
minute
0.0	-0.229374	0.030345	-0.426607	-0.515123	2.230944
1.0	1.639544	-0.412593	-2.085287	-0.514243	1.124204
2.0	-1.429174	0.126368	-0.990036	0.452222	-0.268924
3.0	-0.056329	0.365542	-0.271714	-0.508785	0.687336
4.0	0.613422	0.947496	1.919967	-1.199429	-1.635038
5.0	-0.565272	1.796550	-1.657000	-1.038738	0.776442
6.0	-0.797080	0.521338	1.168330	1.211121	0.501081
...	...	...	...	...	...
53.0	0.853688	0.713043	1.520395	-0.626525	-1.885747
54.0	1.305621	0.949043	-0.224959	-1.487707	-0.035445
55.0	0.043903	-0.533535	0.514034	0.211439	0.301303
56.0	0.502391	-0.938721	1.242458	0.238170	1.179111
57.0	0.327024	0.947324	0.054283	0.341194	0.019672
58.0	0.892760	0.034955	0.023489	1.783330	-0.962055
59.0	-0.733943	0.305690	-0.530138	1.668690	-0.423688

	Date	Visitor	G	Home	G.1	Unnamed: 5	Att.	LOG	Notes
0	2014-10-08	Philadelphia Flyers	1	Boston Bruins	2	NaN	17565	NaN	NaN
1	2014-10-08	Vancouver Canucks	4	Calgary Flames	2	NaN	19289	NaN	NaN
2	2014-10-08	San Jose Sharks	4	Los Angeles Kings	0	NaN	18514	NaN	NaN
3	2014-10-08	Montreal Canadiens	4	Toronto Maple Leafs	3	NaN	19745	NaN	NaN
4	2014-10-09	Winnipeg Jets	6	Arizona Coyotes	2	NaN	17125	NaN	NaN
5	2014-10-09	Columbus Blue Jackets	3	Buffalo Sabres	1	NaN	18215	NaN	NaN
6	2014-10-09	Chicago Blackhawks	3	Dallas Stars	2	SO	18768	NaN	NaN
...	...	...	...	...	...	...	...	...	...
1223	2015-04-11	Ottawa Senators	3	Philadelphia Flyers	1	NaN	17027	NaN	NaN
1224	2015-04-11	Minnesota Wild	2	St. Louis Blues	4	NaN	19155	NaN	NaN
1225	2015-04-11	Boston Bruins	2	Tampa Bay Lightning	3	SO	19204	NaN	NaN
1226	2015-04-11	Montreal Canadiens	4	Toronto Maple Leafs	3	SO	19308	NaN	NaN
1227	2015-04-11	Edmonton Oilers	5	Vancouver Canucks	6	OT	18870	NaN	NaN
1228	2015-04-11	Calgary Flames	1	Winnipeg Jets	5	NaN	15016	NaN	NaN
1229	2015-04-11	New York Rangers	4	Washington Capitals	2	NaN	18506	NaN	NaN

	Date	Visitor	G	Home	G.1	Unnamed: 5	Att.	LOG	Notes
2	2014-10-08	San Jose Sharks	4	Los Angeles Kings	0	NaN	18514	NaN	NaN
46	2014-10-14	San Jose Sharks	6	Washington Capitals	5	SO	18506	NaN	NaN
52	2014-10-16	San Jose Sharks	3	New York Islanders	4	SO	11248	NaN	NaN
69	2014-10-18	San Jose Sharks	4	New Jersey Devils	2	NaN	16592	NaN	NaN
76	2014-10-19	San Jose Sharks	0	New York Rangers	4	NaN	18006	NaN	NaN
79	2014-10-21	San Jose Sharks	3	Boston Bruins	5	NaN	17565	NaN	NaN
116	2014-10-26	San Jose Sharks	4	Anaheim Ducks	1	NaN	16954	NaN	NaN
...	...	...	...	...	...	...	...	...	...
1088	2015-03-23	San Jose Sharks	2	Ottawa Senators	5	NaN	18193	NaN	NaN
1104	2015-03-26	San Jose Sharks	6	Detroit Red Wings	4	NaN	20027	NaN	NaN
1122	2015-03-28	San Jose Sharks	3	Philadelphia Flyers	2	SO	18783	NaN	NaN
1134	2015-03-29	San Jose Sharks	2	Pittsburgh Penguins	3	SO	18620	NaN	NaN
1167	2015-04-04	San Jose Sharks	3	Arizona Coyotes	5	NaN	14752	NaN	NaN
1204	2015-04-09	San Jose Sharks	3	Edmonton Oilers	1	NaN	16839	NaN	NaN
1221	2015-04-11	San Jose Sharks	1	Los Angeles Kings	4	NaN	18230	NaN	NaN

	Date	Visitor	G	Home	G.1	Unnamed: 5	Att.	LOG	Notes
0	2014-10-08	Philadelphia Flyers	1	Boston Bruins	2	NaN	17565	NaN	NaN
7	2014-10-09	Boston Bruins	1	Detroit Red Wings	2	NaN	20027	NaN	NaN
18	2014-10-11	Washington Capitals	4	Boston Bruins	0	NaN	17565	NaN	NaN
34	2014-10-13	Colorado Avalanche	2	Boston Bruins	1	NaN	17565	NaN	NaN
49	2014-10-15	Boston Bruins	3	Detroit Red Wings	2	SO	20027	NaN	NaN
51	2014-10-16	Boston Bruins	4	Montreal Canadiens	6	NaN	21287	NaN	NaN
64	2014-10-18	Boston Bruins	4	Buffalo Sabres	0	NaN	18685	NaN	NaN
...	...	...	...	...	...	...	...	...	...
1128	2015-03-29	Boston Bruins	2	Carolina Hurricanes	1	OT	14275	NaN	NaN
1142	2015-03-31	Florida Panthers	2	Boston Bruins	3	NaN	17565	NaN	NaN
1155	2015-04-02	Boston Bruins	3	Detroit Red Wings	2	NaN	20027	NaN	NaN
1168	2015-04-04	Toronto Maple Leafs	1	Boston Bruins	2	SO	17565	NaN	NaN
1201	2015-04-08	Boston Bruins	0	Washington Capitals	3	NaN	18506	NaN	NaN
1205	2015-04-09	Boston Bruins	2	Florida Panthers	4	NaN	11778	NaN	NaN
1225	2015-04-11	Boston Bruins	2	Tampa Bay Lightning	3	SO	19204	NaN	NaN

	Date	Visitor	G	Home	G.1	Overtime	Att.	LOG	Notes
6	2014-10-09	Chicago Blackhawks	3	Dallas Stars	2	SO	18768	NaN	NaN
14	2014-10-09	Florida Panthers	2	Tampa Bay Lightning	3	OT	19204	NaN	NaN
15	2014-10-09	Montreal Canadiens	2	Washington Capitals	1	SO	18506	NaN	NaN
17	2014-10-11	Los Angeles Kings	2	Arizona Coyotes	3	OT	12859	NaN	NaN
26	2014-10-11	Montreal Canadiens	4	Philadelphia Flyers	3	SO	19745	NaN	NaN
29	2014-10-11	Ottawa Senators	3	Tampa Bay Lightning	2	SO	19204	NaN	NaN
31	2014-10-11	Edmonton Oilers	4	Vancouver Canucks	5	SO	18870	NaN	NaN
...	...	...	...	...	...	...	...	...	...
1203	2015-04-09	Winnipeg Jets	0	Colorado Avalanche	1	SO	14802	NaN	NaN
1206	2015-04-09	Detroit Red Wings	3	Montreal Canadiens	4	OT	21287	NaN	NaN
1211	2015-04-09	New Jersey Devils	3	Tampa Bay Lightning	4	OT	19204	NaN	NaN
1222	2015-04-11	Columbus Blue Jackets	5	New York Islanders	4	SO	16170	NaN	NaN
1225	2015-04-11	Boston Bruins	2	Tampa Bay Lightning	3	SO	19204	NaN	NaN
1226	2015-04-11	Montreal Canadiens	4	Toronto Maple Leafs	3	SO	19308	NaN	NaN
1227	2015-04-11	Edmonton Oilers	5	Vancouver Canucks	6	OT	18870	NaN	NaN

	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	8	10	11	foo

	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	8	10	11	foo

	0	1	2	3	4
0	a	b	c	d	message
1	1	2	3	4	hello
2	5	6	7	8	world
3	9	8	10	11	foo

	a	b	c	d	message
1	1	2	3	4	hello
2	5	6	7	8	world
3	9	8	10	11	foo