Pandas Series

wordcloud

Creating pandas Series

We are going to explore the pandas Series structure and learn how to store and manipulate single dimensional indexed data in the Series object.

The series is one of the core data structures in pandas. You think of it a cross between a list and a dictionary. The items are all stored in an order and there's labels with which you can retrieve them. An easy way to visualize this is two columns of data. The first is the special index, a lot like keys in a dictionary. While the second is your actual data. It's important to note that the data column has a label of its own and can be retrieved using the .name attribute. This is different than with dictionaries and is useful when it comes to merging multiple columns of data.

In [1]:
# Importing pandas
import pandas as pd


We can create a series by passing in a list of values. is, Pandas automatically assigns an index starting with zero and sets the name of the series to None. One of the easiest ways to create a series is to use an array-like object, like a list.

In [2]:
strings = ['a', 'b', 'c', 'd']
pd.Series(strings)
Out[2]:
0    a
1    b
2    c
3    d
dtype: object


The result is a Series object which is nicely rendered to the screen. We see here that the pandas has automatically identified the type of data in this Series as "object" and set the dytpe parameter as appropriate. We see that the values are indexed with integers, starting at zero.

We don't have to use strings. If we passed in a list of whole numbers, for instance, we could see that panda sets the type to int64. Underneath panda stores series values in a typed array using the Numpy library. This offers significant speedup when processing data versus traditional python lists.

In [3]:
numbers = [4,2,3,9,0]
pd.Series(numbers)
Out[3]:
0    4
1    2
2    3
3    9
4    0
dtype: int64

[^top]

Missing data

In Python, we have the None type to indicate a lack of data. If we create a list of strings and we have one element, a None type, pandas inserts it as a None and uses the type object for the underlying array.

In [4]:
strings = ['a', 'b', 'c', 'd', None]
pd.Series(strings)
Out[4]:
0       a
1       b
2       c
3       d
4    None
dtype: object


If we create a list of numbers, integers or floats, and put in the None type, pandas automatically converts this to a special floating point value designated as NaN, which stands for "Not a Number".

In [5]:
numbers = [4,2,3,9,0,None]
pd.Series(numbers)
Out[5]:
0    4.0
1    2.0
2    3.0
3    9.0
4    0.0
5    NaN
dtype: float64


pandas represents NaN as a floating point number, and because integers can be typecast to floats, pandas went and converted our integers to floats. So when you're wondering why the list of integers you put into a Series is not floats, it's probably because there is some missing data.

NaN is not the same as None, though they are used to denote missing data.

In [6]:
import numpy as np
# Comparing NaN to None
np.nan == None
Out[6]:
False


It would be interesting to note that two NaN values cannot be equated, so it's not possible to check if a value is NaN by using a conditional operator like ==.

In [7]:
np.nan == np.nan
Out[7]:
False


Instead, we need to use special functions to test for the presence of not a number, such as the Numpy library isnan().

In [8]:
np.isnan(np.nan)
Out[8]:
True


A series can be created directly from dictionary data. If you do this, the index is automatically assigned to the keys of the dictionary that you provided and not just incrementing integers.

In [9]:
# Here's a dictionary of fictional nations/kindgoms and their capital cities
capital_cities={'Azure':'Azure City',
               'Land of Oz':'Emerald City',
               'Discworld':'Ankh-Morpork',
               'Carja':'Meridian',
               'Kingdom of Loathing':'Seaside Town'}
cc=pd.Series(capital_cities)
cc
Out[9]:
Azure                    Azure City
Land of Oz             Emerald City
Discworld              Ankh-Morpork
Carja                      Meridian
Kingdom of Loathing    Seaside Town
dtype: object


We see that, since it was string data, pandas set the data type of the series to "object". We see that the index, the first column, is also a list of strings.

We can get the index object using the index attribute.

In [10]:
cc.index
Out[10]:
Index(['Azure', 'Land of Oz', 'Discworld', 'Carja', 'Kingdom of Loathing'], dtype='object')


We can also separate your index creation from the data by passing in the index as a list explicitly to the series.

In [11]:
cc=pd.Series(['Azure City', 'Emerald City', 'Ankh-Morpork', 'Meridian', 'Seaside Town'],
            index=['Azure', 'Land of Oz', 'Discworld', 'Carja', 'Kingdom of Loathing'])
cc
Out[11]:
Azure                    Azure City
Land of Oz             Emerald City
Discworld              Ankh-Morpork
Carja                      Meridian
Kingdom of Loathing    Seaside Town
dtype: object


We can create a pandas series from a dictionary and also provide an index list. In this case, pandas will create a series with index from the index list instead of dictionay keys. It then matches the values in the index list with the dictionay keys. If there is a match, then the key-value pair are included in the series. If there is no match, then the value for that index is assigned as NaN.

In [12]:
pd.Series(capital_cities,index=['Land of Oz', 'Discworld', 'Carja', 'Coruscant'])
Out[12]:
Land of Oz    Emerald City
Discworld     Ankh-Morpork
Carja             Meridian
Coruscant              NaN
dtype: object

[^top]

Querying pandas Series

A pandas Series can be queried either by the index position or the index label. If you don't give an index to the series when querying, the position and the label are effectively the same values. To query by numeric location, starting at zero, use the iloc[] attribute. To query by the index label, you can use the loc[] attribute.

If we wanted to see the fourth entry we would use the iloc[] attribute with the parameter 3.

In [13]:
cc.iloc[3]
Out[13]:
'Meridian'


We can query the same element using loc[] by specifying the index.

In [14]:
cc.loc['Carja']
Out[14]:
'Meridian'


Note that the indexing operators iloc and loc are not methods, but are attributes. So square brackets are used to query them instead of brackets.

In some cases, the usage of iloc[] and loc[] are optional as pandas automatically understands whether we are querying by index location or by the index name based on the input. This happens when the index values are not integers.

In [15]:
cc[3]
Out[15]:
'Meridian'
In [16]:
cc['Carja']
Out[16]:
'Meridian'

However, if the indices are numbers, omitting iloc[] or loc[] could be problematic.

In [17]:
# Prime numbers and their squares
prime_squares={2:4,3:9,5:25,7:49}
ps=pd.Series(prime_squares)
ps
Out[17]:
2     4
3     9
5    25
7    49
dtype: int64


Executing ps[0] would throw a key error because there's no item in the series with an index of zero.

In [18]:
ps[0]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~\miniconda3\envs\coursera\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2894             try:
-> 2895                 return self._engine.get_loc(casted_key)
   2896             except KeyError as err:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: 0

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
<ipython-input-18-f6da4e651934> in <module>
----> 1 ps[0]

~\miniconda3\envs\coursera\lib\site-packages\pandas\core\series.py in __getitem__(self, key)
    880 
    881         elif key_is_scalar:
--> 882             return self._get_value(key)
    883 
    884         if is_hashable(key):

~\miniconda3\envs\coursera\lib\site-packages\pandas\core\series.py in _get_value(self, label, takeable)
    987 
    988         # Similar to Index.get_value, but we do not fall back to positional
--> 989         loc = self.index.get_loc(label)
    990         return self.index._get_values_for_loc(self, loc, label)
    991 

~\miniconda3\envs\coursera\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2895                 return self._engine.get_loc(casted_key)
   2896             except KeyError as err:
-> 2897                 raise KeyError(key) from err
   2898 
   2899         if tolerance is not None:

KeyError: 0

It's a good practice to use iloc[] or loc[] explicitly to avoid such errors.

In [19]:
ps.iloc[0]
Out[19]:
4
In [20]:
ps.loc[2]
Out[20]:
4

[^top]

Pandas is fast and concise

When working with data, we usually deal with performing operations of large set of numbers. This could be trying to find a certain number, or summarizing data or transforming the data in some way.

Let's generate a list of 10,000 random numbers and perform some basic operations ont them using lists and then with numpy and pandas and compare the time taken to perform these operations.

  1. We will first generate 10,000 random numbers.
  2. Perform a basic arithmetic operation of adding 5 to each.
  3. Calculate the average of the numbers.

We will time these operations using an iPython magic function called %%timeit. This function will run the code in a cell a specified number of times (1000 in this case) and calculate the average execution time. Cellular magic functions should be preceded with two percentage signs and should be on the first line of the cell.

In [21]:
import random
import numpy as np
import pandas as pd

Lists

In [22]:
# Generating a list of 10000 random numbers
rand_list=[]
for i in range(10000):
    rand_list.append(random.random())
# Adding 5 to each number
for i in range(len(rand_list)):
    rand_list[i]+=5
# Calculating the mean
rand_sum=0
for i in rand_list:
    rand_sum+=i
rand_mean=rand_sum/len(rand_list)
print(rand_sum,rand_mean)
55009.73485016286 5.500973485016286


Now let's time it for an average of 1000 runs.

In [23]:
%%timeit -n 1000
# Generating a list of 10000 random numbers
rand_list=[]
for i in range(10000):
    rand_list.append(random.random())
# Adding 5 to each number
for i in range(len(rand_list)):
    rand_list[i]+=5
# Calculating the mean
rand_sum=0
for i in rand_list:
    rand_sum+=i
rand_mean=rand_sum/len(rand_list)
15.2 ms ± 491 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

It took about 15 milliseconds to perform the above operations using lists.

NumPy

Let's try doing the same with NumPy.

In [24]:
# Generating an array of 10000 random numbers
rand_array=np.random.random(10000)
# Adding 5 to each number
rand_array=rand_array+5
# Calculating the mean
rand_mean=rand_array.mean()
rand_mean
Out[24]:
5.502927873805677


Let's time it for an average of 1000 runs.

In [25]:
%%timeit -n 1000
# Generating an array of 10000 random numbers
rand_array=np.random.random(10000)
# Adding 5 to each number
rand_array=rand_array+5
# Calculating the mean
rand_mean=rand_array.mean()
661 µs ± 74.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

It took only 0.7 milliseconds with NumPy which about 20 times faster than with lists.

pandas

Let's try the same with pandas and see how fast it performs.

In [26]:
# Generating an array of 10000 random numbers
rand_series=pd.Series(np.random.random(10000))
# Adding 5 to each number
rand_series=rand_series+5
# Calculating the mean
rand_mean=rand_series.mean()
rand_mean
Out[26]:
5.497772539213378


Timing it for 1000 runs

In [27]:
%%timeit -n 1000
# Generating an array of 10000 random numbers
rand_series=pd.Series(np.random.random(10000))
# Adding 5 to each number
rand_series=rand_series+5
# Calculating the mean
rand_mean=rand_series.mean()
2.36 ms ± 265 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

It took 2 milliseconds to perform these operations using pandas Series. This is significantly faster than using lists. This is because pandas and the underlying numpy libraries support a method of computation called vectorization. Vectorization works with most of the functions in the numpy library, including the sum function. Put more simply, vectorization is the ability for a computer to execute multiple instructions at once, and with high performance chips, especially graphics cards, you can get dramatic speedups. Modern graphics cards can run thousands of instructions in parallel. We should start thinking about functional programming (as opposed to object-oriented programming) to utilise the power of parallel processing.

I was wondering why pandas Series was slower than NumPy arrays. Here is an explanation to it. While creating a pandas Series, pandas still depends on Python and makes calls to several Python functions. Here is a visualization of pandas indexing a series. Each coloured arc is a different function call in Python.

pandas

In contract, NumPy performs all the array indexing by itself and nothing is visible to Python. As you can see below, there are no Python function calls when using NumPy.

NumPy

Another detailed comparision of execution times between pandas and NumPy is available here. According to the website, though NumPy is faster than pandas on smaller sets of data, pandas is equally fast and may be faster than NumPy as the size of the dataset grows. Pandas also offers a 2d dataframe structure and better usability than NumPy for data science operations.

[^top]

Manipulating pandas Series data

The loc[] and iloc[] attributes lets us not only view data but modify data in place or add new data. If the value you pass in as the index doesn't exist, then a new entry is added.

In [28]:
# A series of 5 random numbers between 0 and 9 (inclusive)
x=pd.Series(np.random.randint(0,10,5))
x
Out[28]:
0    9
1    5
2    9
3    6
4    0
dtype: int32


Let's modify the third element and change it from 2 to 15.

In [29]:
x.iloc[2]=15
x
Out[29]:
0     9
1     5
2    15
3     6
4     0
dtype: int32


We can add a new element to the series using loc[].

In [30]:
x.loc['nine']=9
x
Out[30]:
0        9
1        5
2       15
3        6
4        0
nine     9
dtype: int64


The indices need not be of the same data type but the series data must be of the same type. Pandas will type cast data to appropriate type. In this case, adding a floating point number changed all other integers to floating point numbers.

In [31]:
x.loc['decimal']=20.5
x
Out[31]:
0           9.0
1           5.0
2          15.0
3           6.0
4           0.0
nine        9.0
decimal    20.5
dtype: float64


Adding a string converted the data type of the series to object.

In [32]:
x.loc['string']='zero'
x
Out[32]:
0             9
1             5
2            15
3             6
4             0
nine          9
decimal    20.5
string     zero
dtype: object


Indices of pandas series are not unique like keys of a relational database.

In [33]:
y=pd.Series([0.9,9,9.9,99],index=['nine','nine','nine','nine'])
y
Out[33]:
nine     0.9
nine     9.0
nine     9.9
nine    99.0
dtype: float64


We can combine pandas series to a new series using append.

In [34]:
z=x.append(y)
z
Out[34]:
0             9
1             5
2            15
3             6
4             0
nine          9
decimal    20.5
string     zero
nine        0.9
nine          9
nine        9.9
nine         99
dtype: object


When we query a pandas series by index, the result is all the values that match with that index.

In [35]:
z.loc['nine']
Out[35]:
nine      9
nine    0.9
nine      9
nine    9.9
nine     99
dtype: object

[^top]

Last updated 2020-12-20 17:49:22.201266 IST

Comments