Reading, Writing CSV files & Basic Data Analysis without Pandas

wordcloud

Let's import our datafile mpg.csv, which contains fuel economy data for 234 cars.

  • mpg : miles per gallon
  • class : car classification
  • cty : city mpg
  • cyl : # of cylinders
  • displ : engine displacement in liters
  • drv : f = front-wheel drive, r = rear wheel drive, 4 = 4wd
  • fl : fuel (e = ethanol E85, d = diesel, r = regular, p = premium, c = CNG)
  • hwy : highway mpg
  • manufacturer : automobile manufacturer
  • model : model of car
  • trans : type of transmission
  • year : model year

Reading a CSV file

The datafile mpg.csv is located in my GitLab public repository at https://gitlab.com/datasets_a/mpg-data. The URL for accessing the raw file is https://gitlab.com/datasets_a/mpg-data/-/raw/master/mpg.csv

In this example, we will be using Python's native csv module to read and write CSV files. The csv module offers the following functions to read/write files:

  • Reading
    • csv.reader
    • csv.DictReader
  • Writing
    • csv.writer
    • csv.DictWriter

If you are interested in different ways of reading CSV files from GitLab, especially using the python-gitlab package, I tried to illustrate them here

csv.reader
In [1]:
import csv, requests
from io import StringIO

url = "https://gitlab.com/datasets_a/mpg-data/-/raw/master/mpg.csv"
# Reading URL data
s = requests.get(url).content
# Decoding and formatting
csvdata = list(csv.reader(StringIO(s.decode('utf-8'))))
# Printing the first 5 rows
for row in csvdata[:5]:
    print(row)
['', 'manufacturer', 'model', 'displ', 'year', 'cyl', 'trans', 'drv', 'cty', 'hwy', 'fl', 'class']
['1', 'audi', 'a4', '1.8', '1999', '4', 'auto(l5)', 'f', '18', '29', 'p', 'compact']
['2', 'audi', 'a4', '1.8', '1999', '4', 'manual(m5)', 'f', '21', '29', 'p', 'compact']
['3', 'audi', 'a4', '2', '2008', '4', 'manual(m6)', 'f', '20', '31', 'p', 'compact']
['4', 'audi', 'a4', '2', '2008', '4', 'auto(av)', 'f', '21', '30', 'p', 'compact']
csv.DictReader

The data can also be read into a Python dictionary format using the csv.DictReader function.

In [2]:
dictdata = list(csv.DictReader(StringIO(s.decode('utf-8'))))
# Printing the first 2 rows
dictdata[:2]
Out[2]:
[{'': '1',
  'manufacturer': 'audi',
  'model': 'a4',
  'displ': '1.8',
  'year': '1999',
  'cyl': '4',
  'trans': 'auto(l5)',
  'drv': 'f',
  'cty': '18',
  'hwy': '29',
  'fl': 'p',
  'class': 'compact'},
 {'': '2',
  'manufacturer': 'audi',
  'model': 'a4',
  'displ': '1.8',
  'year': '1999',
  'cyl': '4',
  'trans': 'manual(m5)',
  'drv': 'f',
  'cty': '21',
  'hwy': '29',
  'fl': 'p',
  'class': 'compact'}]

[^top]

Writing a CSV file

csv.writer
In [3]:
filepath='../../files/py/007/mpg.csv'
In [4]:
with open(filepath,'w',newline='') as f:
    # Creating a csv writer object
    fwriter=csv.writer(f)
    # writing data
    fwriter.writerows(csvdata)


Screenshot of the output file:

output


In case, the fields and records are not available on one place, they can be written seperately. Let's seperate the fields and records from csvdata and write them to a file.

In [5]:
fields=csvdata[0] # The first element of the list
print(fields)
['', 'manufacturer', 'model', 'displ', 'year', 'cyl', 'trans', 'drv', 'cty', 'hwy', 'fl', 'class']
In [6]:
records=csvdata[1:] # This is rest of the data
#printing the first 5 records
for record in records[:5]:
    print(record)
['1', 'audi', 'a4', '1.8', '1999', '4', 'auto(l5)', 'f', '18', '29', 'p', 'compact']
['2', 'audi', 'a4', '1.8', '1999', '4', 'manual(m5)', 'f', '21', '29', 'p', 'compact']
['3', 'audi', 'a4', '2', '2008', '4', 'manual(m6)', 'f', '20', '31', 'p', 'compact']
['4', 'audi', 'a4', '2', '2008', '4', 'auto(av)', 'f', '21', '30', 'p', 'compact']
['5', 'audi', 'a4', '2.8', '1999', '6', 'auto(l5)', 'f', '16', '26', 'p', 'compact']
In [7]:
# Writing to a csv file
with open(filepath,'w',newline='') as f:
    # Creating a csv writer object
    fwriter=csv.writer(f)
    # Writing the fields
    fwriter.writerow(fields)
    # writing data (records)
    fwriter.writerows(records)


Screenshot of the output file:

output
csv.DictWriter

csv.DictWriter operates like a regular writer but maps dictionaries onto output rows.

In [8]:
fields=list(dictdata[0].keys())
print(fields)
['', 'manufacturer', 'model', 'displ', 'year', 'cyl', 'trans', 'drv', 'cty', 'hwy', 'fl', 'class']
In [9]:
with open(filepath,'w',newline='') as f:
    # Creating a csv DictWriter object
    fwriter=csv.DictWriter(f,fieldnames=fields)
    # Writing fields
    fwriter.writeheader()
    # writing records
    fwriter.writerows(dictdata)


Screenshot of the output file:

output

[^top]

Basic Data Analysis

In [10]:
import csv 

with open('../../files/py/007/mpg.csv') as csvfile:
    mpg = list(csv.DictReader(csvfile))
    
mpg[:2]
Out[10]:
[{'': '1',
  'manufacturer': 'audi',
  'model': 'a4',
  'displ': '1.8',
  'year': '1999',
  'cyl': '4',
  'trans': 'auto(l5)',
  'drv': 'f',
  'cty': '18',
  'hwy': '29',
  'fl': 'p',
  'class': 'compact'},
 {'': '2',
  'manufacturer': 'audi',
  'model': 'a4',
  'displ': '1.8',
  'year': '1999',
  'cyl': '4',
  'trans': 'manual(m5)',
  'drv': 'f',
  'cty': '21',
  'hwy': '29',
  'fl': 'p',
  'class': 'compact'}]


csv.DictReader has read in each row of our csv file as a dictionary. len shows that our list is comprised of 234 dictionaries.

In [11]:
len(mpg)
Out[11]:
234

*keys* gives us the column names of our csv.
In [12]:
mpg[0].keys()
Out[12]:
dict_keys(['', 'manufacturer', 'model', 'displ', 'year', 'cyl', 'trans', 'drv', 'cty', 'hwy', 'fl', 'class'])

This is how to find the average cty fuel economy across all cars. All values in the dictionaries are strings, so we need to convert to float.
In [13]:
total_mpg=0
for record in mpg:
    total_mpg+=float(record['cty'])
avg_mpg=total_mpg/len(mpg)
print(avg_mpg)
16.858974358974358


The same can be performed using list comprehension.

In [14]:
avg_mpg=sum(float(record['cty']) for record in mpg) / len(mpg)
print(avg_mpg)
16.858974358974358


The number of decimal points displayed in the output can be controlled by the magic function %precision. Let's limit the precision to two decimals. Note that this magic function does not work if we use the print function.

In [15]:
%precision 2
avg_mpg
Out[15]:
16.86

Similarly this is how to find the average hwy fuel economy across all cars.
In [16]:
sum(float(d['hwy']) for d in mpg) / len(mpg)
Out[16]:
23.44

Use `set` to return the unique values for the number of cylinders the cars in our dataset have.
In [17]:
cylinders = set(d['cyl'] for d in mpg)
print(cylinders)
{'5', '8', '4', '6'}

Here's a more complex example where we are grouping the cars by number of cylinder, and finding the average cty mpg for each group.
In [18]:
CtyMpgByCyl = []

for c in cylinders: # iterate over all the cylinder levels
    summpg = 0
    cyltypecount = 0
    for d in mpg: # iterate over all dictionaries
        if d['cyl'] == c: # if the cylinder level type matches,
            summpg += float(d['cty']) # add the cty mpg
            cyltypecount += 1 # increment the count
    CtyMpgByCyl.append((c, summpg / cyltypecount)) # append the tuple ('cylinder', 'avg mpg')

CtyMpgByCyl.sort(key=lambda x: x[0])
CtyMpgByCyl
Out[18]:
[('4', 21.01), ('5', 20.50), ('6', 16.22), ('8', 12.57)]

Use `set` to return the unique values for the class types in our dataset.
In [19]:
vehicleclass = set(d['class'] for d in mpg) # what are the class types
vehicleclass
Out[19]:
{'2seater', 'compact', 'midsize', 'minivan', 'pickup', 'subcompact', 'suv'}

And here's an example of how to find the average hwy mpg for each class of vehicle in our dataset.
In [20]:
HwyMpgByClass = []

for t in vehicleclass: # iterate over all the vehicle classes
    summpg = 0
    vclasscount = 0
    for d in mpg: # iterate over all dictionaries
        if d['class'] == t: # if the cylinder amount type matches,
            summpg += float(d['hwy']) # add the hwy mpg
            vclasscount += 1 # increment the count
    HwyMpgByClass.append((t, summpg / vclasscount)) # append the tuple ('class', 'avg mpg')

HwyMpgByClass.sort(key=lambda x: x[1])
HwyMpgByClass
Out[20]:
[('pickup', 16.88),
 ('suv', 18.13),
 ('minivan', 22.36),
 ('2seater', 24.80),
 ('midsize', 27.29),
 ('subcompact', 28.14),
 ('compact', 28.30)]

[^top]

Last updated 2020-12-07 18:23:55.089159 IST

Comments