Reading, Writing CSV files & Basic Data Analysis without Pandas

Krishnakanth Allika

2020-12-07 18:13

Comments

wordcloud

Contents

Reading a CSV file
Writing a CSV file
Basic Data Analysis

Let's import our datafile mpg.csv, which contains fuel economy data for 234 cars.

mpg : miles per gallon
class : car classification
cty : city mpg
cyl : # of cylinders
displ : engine displacement in liters
drv : f = front-wheel drive, r = rear wheel drive, 4 = 4wd
fl : fuel (e = ethanol E85, d = diesel, r = regular, p = premium, c = CNG)
hwy : highway mpg
manufacturer : automobile manufacturer
model : model of car
trans : type of transmission
year : model year

Reading a CSV file ¶

The datafile mpg.csv is located in my GitLab public repository at https://gitlab.com/datasets_a/mpg-data. The URL for accessing the raw file is https://gitlab.com/datasets_a/mpg-data/-/raw/master/mpg.csv

In this example, we will be using Python's native csv module to read and write CSV files. The csv module offers the following functions to read/write files:

Reading
- csv.reader
- csv.DictReader
Writing
- csv.writer
- csv.DictWriter

If you are interested in different ways of reading CSV files from GitLab, especially using the python-gitlab package, I tried to illustrate them here

csv.reader¶

In [1]:

import csv, requests
from io import StringIO

url = "https://gitlab.com/datasets_a/mpg-data/-/raw/master/mpg.csv"
# Reading URL data
s = requests.get(url).content
# Decoding and formatting
csvdata = list(csv.reader(StringIO(s.decode('utf-8'))))
# Printing the first 5 rows
for row in csvdata[:5]:
    print(row)

['', 'manufacturer', 'model', 'displ', 'year', 'cyl', 'trans', 'drv', 'cty', 'hwy', 'fl', 'class']
['1', 'audi', 'a4', '1.8', '1999', '4', 'auto(l5)', 'f', '18', '29', 'p', 'compact']
['2', 'audi', 'a4', '1.8', '1999', '4', 'manual(m5)', 'f', '21', '29', 'p', 'compact']
['3', 'audi', 'a4', '2', '2008', '4', 'manual(m6)', 'f', '20', '31', 'p', 'compact']
['4', 'audi', 'a4', '2', '2008', '4', 'auto(av)', 'f', '21', '30', 'p', 'compact']

csv.DictReader¶

The data can also be read into a Python dictionary format using the csv.DictReader function.

In [2]:

dictdata = list(csv.DictReader(StringIO(s.decode('utf-8'))))
# Printing the first 2 rows
dictdata[:2]

Out[2]:

[{'': '1',
  'manufacturer': 'audi',
  'model': 'a4',
  'displ': '1.8',
  'year': '1999',
  'cyl': '4',
  'trans': 'auto(l5)',
  'drv': 'f',
  'cty': '18',
  'hwy': '29',
  'fl': 'p',
  'class': 'compact'},
 {'': '2',
  'manufacturer': 'audi',
  'model': 'a4',
  'displ': '1.8',
  'year': '1999',
  'cyl': '4',
  'trans': 'manual(m5)',
  'drv': 'f',
  'cty': '21',
  'hwy': '29',
  'fl': 'p',
  'class': 'compact'}]

[^top]

Writing a CSV file ¶

csv.writer¶

In [3]:

filepath='../../files/py/007/mpg.csv'

In [4]:

with open(filepath,'w',newline='') as f:
    # Creating a csv writer object
    fwriter=csv.writer(f)
    # writing data
    fwriter.writerows(csvdata)

Screenshot of the output file:

In case, the fields and records are not available on one place, they can be written seperately. Let's seperate the fields and records from csvdata and write them to a file.

In [5]:

fields=csvdata[0] # The first element of the list
print(fields)

['', 'manufacturer', 'model', 'displ', 'year', 'cyl', 'trans', 'drv', 'cty', 'hwy', 'fl', 'class']

In [6]:

records=csvdata[1:] # This is rest of the data
#printing the first 5 records
for record in records[:5]:
    print(record)

['1', 'audi', 'a4', '1.8', '1999', '4', 'auto(l5)', 'f', '18', '29', 'p', 'compact']
['2', 'audi', 'a4', '1.8', '1999', '4', 'manual(m5)', 'f', '21', '29', 'p', 'compact']
['3', 'audi', 'a4', '2', '2008', '4', 'manual(m6)', 'f', '20', '31', 'p', 'compact']
['4', 'audi', 'a4', '2', '2008', '4', 'auto(av)', 'f', '21', '30', 'p', 'compact']
['5', 'audi', 'a4', '2.8', '1999', '6', 'auto(l5)', 'f', '16', '26', 'p', 'compact']

In [7]:

# Writing to a csv file
with open(filepath,'w',newline='') as f:
    # Creating a csv writer object
    fwriter=csv.writer(f)
    # Writing the fields
    fwriter.writerow(fields)
    # writing data (records)
    fwriter.writerows(records)

Screenshot of the output file:

csv.DictWriter¶

csv.DictWriter operates like a regular writer but maps dictionaries onto output rows.

In [8]:

fields=list(dictdata[0].keys())
print(fields)

['', 'manufacturer', 'model', 'displ', 'year', 'cyl', 'trans', 'drv', 'cty', 'hwy', 'fl', 'class']

In [9]:

with open(filepath,'w',newline='') as f:
    # Creating a csv DictWriter object
    fwriter=csv.DictWriter(f,fieldnames=fields)
    # Writing fields
    fwriter.writeheader()
    # writing records
    fwriter.writerows(dictdata)

Screenshot of the output file:

[^top]

Basic Data Analysis ¶

In [10]:

import csv 

with open('../../files/py/007/mpg.csv') as csvfile:
    mpg = list(csv.DictReader(csvfile))
    
mpg[:2]

Out[10]:

[{'': '1',
  'manufacturer': 'audi',
  'model': 'a4',
  'displ': '1.8',
  'year': '1999',
  'cyl': '4',
  'trans': 'auto(l5)',
  'drv': 'f',
  'cty': '18',
  'hwy': '29',
  'fl': 'p',
  'class': 'compact'},
 {'': '2',
  'manufacturer': 'audi',
  'model': 'a4',
  'displ': '1.8',
  'year': '1999',
  'cyl': '4',
  'trans': 'manual(m5)',
  'drv': 'f',
  'cty': '21',
  'hwy': '29',
  'fl': 'p',
  'class': 'compact'}]

csv.DictReader has read in each row of our csv file as a dictionary. len shows that our list is comprised of 234 dictionaries.

In [11]:

len(mpg)

Out[11]:

*keys* gives us the column names of our csv.

In [12]:

mpg[0].keys()

Out[12]:

dict_keys(['', 'manufacturer', 'model', 'displ', 'year', 'cyl', 'trans', 'drv', 'cty', 'hwy', 'fl', 'class'])

This is how to find the average cty fuel economy across all cars. All values in the dictionaries are strings, so we need to convert to float.

In [13]:

total_mpg=0
for record in mpg:
    total_mpg+=float(record['cty'])
avg_mpg=total_mpg/len(mpg)
print(avg_mpg)

16.858974358974358

The same can be performed using list comprehension.

In [14]:

avg_mpg=sum(float(record['cty']) for record in mpg) / len(mpg)
print(avg_mpg)

16.858974358974358

The number of decimal points displayed in the output can be controlled by the magic function %precision. Let's limit the precision to two decimals. Note that this magic function does not work if we use the print function.

In [15]:

%precision 2
avg_mpg

Out[15]:

16.86

Similarly this is how to find the average hwy fuel economy across all cars.

In [16]:

sum(float(d['hwy']) for d in mpg) / len(mpg)

Out[16]:

23.44

Use `set` to return the unique values for the number of cylinders the cars in our dataset have.

In [17]:

cylinders = set(d['cyl'] for d in mpg)
print(cylinders)

{'5', '8', '4', '6'}

Here's a more complex example where we are grouping the cars by number of cylinder, and finding the average cty mpg for each group.

In [18]:

CtyMpgByCyl = []

for c in cylinders: # iterate over all the cylinder levels
    summpg = 0
    cyltypecount = 0
    for d in mpg: # iterate over all dictionaries
        if d['cyl'] == c: # if the cylinder level type matches,
            summpg += float(d['cty']) # add the cty mpg
            cyltypecount += 1 # increment the count
    CtyMpgByCyl.append((c, summpg / cyltypecount)) # append the tuple ('cylinder', 'avg mpg')

CtyMpgByCyl.sort(key=lambda x: x[0])
CtyMpgByCyl

Out[18]:

[('4', 21.01), ('5', 20.50), ('6', 16.22), ('8', 12.57)]

Use `set` to return the unique values for the class types in our dataset.

In [19]:

vehicleclass = set(d['class'] for d in mpg) # what are the class types
vehicleclass

Out[19]:

{'2seater', 'compact', 'midsize', 'minivan', 'pickup', 'subcompact', 'suv'}

And here's an example of how to find the average hwy mpg for each class of vehicle in our dataset.

In [20]:

HwyMpgByClass = []

for t in vehicleclass: # iterate over all the vehicle classes
    summpg = 0
    vclasscount = 0
    for d in mpg: # iterate over all dictionaries
        if d['class'] == t: # if the cylinder amount type matches,
            summpg += float(d['hwy']) # add the hwy mpg
            vclasscount += 1 # increment the count
    HwyMpgByClass.append((t, summpg / vclasscount)) # append the tuple ('class', 'avg mpg')

HwyMpgByClass.sort(key=lambda x: x[1])
HwyMpgByClass

Out[20]:

[('pickup', 16.88),
 ('suv', 18.13),
 ('minivan', 22.36),
 ('2seater', 24.80),
 ('midsize', 27.29),
 ('subcompact', 28.14),
 ('compact', 28.30)]

[^top]

Last updated 2020-12-07 18:23:55.089159 IST