Reading, Writing CSV files & Basic Data Analysis without Pandas
Let's import our datafile mpg.csv, which contains fuel economy data for 234 cars.
- mpg : miles per gallon
- class : car classification
- cty : city mpg
- cyl : # of cylinders
- displ : engine displacement in liters
- drv : f = front-wheel drive, r = rear wheel drive, 4 = 4wd
- fl : fuel (e = ethanol E85, d = diesel, r = regular, p = premium, c = CNG)
- hwy : highway mpg
- manufacturer : automobile manufacturer
- model : model of car
- trans : type of transmission
- year : model year
Reading a CSV file ¶
The datafile mpg.csv is located in my GitLab public repository at https://gitlab.com/datasets_a/mpg-data. The URL for accessing the raw file is https://gitlab.com/datasets_a/mpg-data/-/raw/master/mpg.csv
In this example, we will be using Python's native csv module to read and write CSV files. The csv module offers the following functions to read/write files:
- Reading
- csv.reader
- csv.DictReader
- Writing
- csv.writer
- csv.DictWriter
If you are interested in different ways of reading CSV files from GitLab, especially using the python-gitlab package, I tried to illustrate them here
csv.reader¶
import csv, requests
from io import StringIO
url = "https://gitlab.com/datasets_a/mpg-data/-/raw/master/mpg.csv"
# Reading URL data
s = requests.get(url).content
# Decoding and formatting
csvdata = list(csv.reader(StringIO(s.decode('utf-8'))))
# Printing the first 5 rows
for row in csvdata[:5]:
print(row)
csv.DictReader¶
The data can also be read into a Python dictionary format using the csv.DictReader function.
dictdata = list(csv.DictReader(StringIO(s.decode('utf-8'))))
# Printing the first 2 rows
dictdata[:2]
[^top]
filepath='../../files/py/007/mpg.csv'
with open(filepath,'w',newline='') as f:
# Creating a csv writer object
fwriter=csv.writer(f)
# writing data
fwriter.writerows(csvdata)
Screenshot of the output file:

In case, the fields and records are not available on one place, they can be written seperately. Let's seperate the fields and records from csvdata and write them to a file.
fields=csvdata[0] # The first element of the list
print(fields)
records=csvdata[1:] # This is rest of the data
#printing the first 5 records
for record in records[:5]:
print(record)
# Writing to a csv file
with open(filepath,'w',newline='') as f:
# Creating a csv writer object
fwriter=csv.writer(f)
# Writing the fields
fwriter.writerow(fields)
# writing data (records)
fwriter.writerows(records)
Screenshot of the output file:

csv.DictWriter¶
csv.DictWriter operates like a regular writer but maps dictionaries onto output rows.
fields=list(dictdata[0].keys())
print(fields)
with open(filepath,'w',newline='') as f:
# Creating a csv DictWriter object
fwriter=csv.DictWriter(f,fieldnames=fields)
# Writing fields
fwriter.writeheader()
# writing records
fwriter.writerows(dictdata)
Screenshot of the output file:

[^top]
Basic Data Analysis ¶
import csv
with open('../../files/py/007/mpg.csv') as csvfile:
mpg = list(csv.DictReader(csvfile))
mpg[:2]
csv.DictReader has read in each row of our csv file as a dictionary. len shows that our list is comprised of 234 dictionaries.
len(mpg)
*keys* gives us the column names of our csv.
mpg[0].keys()
This is how to find the average cty fuel economy across all cars. All values in the dictionaries are strings, so we need to convert to float.
total_mpg=0
for record in mpg:
total_mpg+=float(record['cty'])
avg_mpg=total_mpg/len(mpg)
print(avg_mpg)
The same can be performed using list comprehension.
avg_mpg=sum(float(record['cty']) for record in mpg) / len(mpg)
print(avg_mpg)
The number of decimal points displayed in the output can be controlled by the magic function %precision
. Let's limit the precision to two decimals. Note that this magic function does not work if we use the print
function.
%precision 2
avg_mpg
Similarly this is how to find the average hwy fuel economy across all cars.
sum(float(d['hwy']) for d in mpg) / len(mpg)
Use `set` to return the unique values for the number of cylinders the cars in our dataset have.
cylinders = set(d['cyl'] for d in mpg)
print(cylinders)
Here's a more complex example where we are grouping the cars by number of cylinder, and finding the average cty mpg for each group.
CtyMpgByCyl = []
for c in cylinders: # iterate over all the cylinder levels
summpg = 0
cyltypecount = 0
for d in mpg: # iterate over all dictionaries
if d['cyl'] == c: # if the cylinder level type matches,
summpg += float(d['cty']) # add the cty mpg
cyltypecount += 1 # increment the count
CtyMpgByCyl.append((c, summpg / cyltypecount)) # append the tuple ('cylinder', 'avg mpg')
CtyMpgByCyl.sort(key=lambda x: x[0])
CtyMpgByCyl
Use `set` to return the unique values for the class types in our dataset.
vehicleclass = set(d['class'] for d in mpg) # what are the class types
vehicleclass
And here's an example of how to find the average hwy mpg for each class of vehicle in our dataset.
HwyMpgByClass = []
for t in vehicleclass: # iterate over all the vehicle classes
summpg = 0
vclasscount = 0
for d in mpg: # iterate over all dictionaries
if d['class'] == t: # if the cylinder amount type matches,
summpg += float(d['hwy']) # add the hwy mpg
vclasscount += 1 # increment the count
HwyMpgByClass.append((t, summpg / vclasscount)) # append the tuple ('class', 'avg mpg')
HwyMpgByClass.sort(key=lambda x: x[1])
HwyMpgByClass
[^top]
Last updated 2020-12-07 18:23:55.089159 IST
Comments