Hierarchical Clustering

Krishnakanth Allika

2020-10-02 01:19

R version 4.0.0 (2020-04-24) -- "Arbor Day"
Copyright (C) 2020 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

setwd("C:/images")
library(swirl)

| Hi! Type swirl() when you are ready to begin.

swirl()

| Welcome to swirl! Please sign in. If you've been here before, use the same name as
| you did then. If you are new, call yourself something unique.

What shall I call you? Krishnakanth Allika

| Please choose a course, or type 0 to exit swirl.

1: Exploratory Data Analysis
2: Take me to the swirl course repository!

Selection: 1

| Please choose a lesson, or type 0 to return to course menu.

 1: Principles of Analytic Graphs   2: Exploratory Graphs             
 3: Graphics Devices in R           4: Plotting Systems               
 5: Base Plotting System            6: Lattice Plotting System        
 7: Working with Colors             8: GGPlot2 Part1                  
 9: GGPlot2 Part2                  10: GGPlot2 Extras                 
11: Hierarchical Clustering        12: K Means Clustering             
13: Dimension Reduction            14: Clustering Example             
15: CaseStudy

Selection: 11

| Attempting to load lesson dependencies...

| Package ‘ggplot2’ loaded correctly!

| This lesson requires the ‘fields’ package. Would you like me to install it for you
| now?

1: Yes
2: No

Selection: 1

| Trying to install package ‘fields’ now...
WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:

https://cran.rstudio.com/bin/windows/Rtools/
also installing the dependencies ‘dotCall64’, ‘spam’, ‘maps’

package ‘dotCall64’ successfully unpacked and MD5 sums checked
package ‘spam’ successfully unpacked and MD5 sums checked
package ‘maps’ successfully unpacked and MD5 sums checked
package ‘fields’ successfully unpacked and MD5 sums checked

| Package ‘fields’ loaded correctly!

| Package ‘jpeg’ loaded correctly!

| Package ‘datasets’ loaded correctly!

| | 0%

| Hierarchical_Clustering. (Slides for this and other Data Science courses may be
| found at github https://github.com/DataScienceSpecialization/courses/. If you care
| to use them, they must be downloaded as a zip file and viewed locally. This lesson
| corresponds to 04_ExploratoryAnalysis/hierarchicalClustering.)

...

|= | 2%
| In this lesson we'll learn about hierarchical clustering, a simple way of quickly
| examining and displaying multi-dimensional data. This technique is usually most
| useful in the early stages of analysis when you're trying to get an understanding of
| the data, e.g., finding some pattern or relationship between different factors or
| variables. As the name suggests hierarchical clustering creates a hierarchy of
| clusters.

...

|== | 3%
| Clustering organizes data points that are close into groups. So obvious questions
| are "How do we define close?", "How do we group things?", and "How do we interpret
| the grouping?" Cluster analysis is a very important topic in data analysis.

...

|==== | 5%
| To give you an idea of what we're talking about, consider these random points we
| generated. We'll use them to demonstrate hierarchical clustering in this lesson.
| We'll do this in several steps, but first we have to clarify our terms and concepts.

...

|===== | 6%
| Hierarchical clustering is an agglomerative, or bottom-up, approach. From Wikipedia
| (http://en.wikipedia.org/wiki/Hierarchical_clustering), we learn that in this
| method, "each observation starts in its own cluster, and pairs of clusters are
| merged as one moves up the hierarchy." This means that we'll find the closest two
| points and put them together in one cluster, then find the next closest pair in the
| updated picture, and so forth. We'll repeat this process until we reach a reasonable
| stopping place.

...

|====== | 8%
| Note the word "reasonable". There's a lot of flexibility in this field and how you
| perform your analysis depends on your problem. Again, Wikipedia tells us, "one can
| decide to stop clustering either when the clusters are too far apart to be merged
| (distance criterion) or when there is a sufficiently small number of clusters
| (number criterion)."

...

|======= | 10%
| First, how do we define close? This is the most important step and there are several
| possibilities depending on the questions you're trying to answer and the data you
| have. Distance or similarity are usually the metrics used.

...

|========= | 11%
| In the given plot which pair points would you first cluster? Use distance as the
| metric.

1: 5 and 6
2: 1 and 4
3: 10 and 12
4: 7 and 8

Selection: 1

| Perseverance, that's the answer.

|========== | 13%
| It's pretty obvious that out of the 4 choices, the pair 5 and 6 were the closest
| together. However, there are several ways to measure distance or similarity.
| Euclidean distance and correlation similarity are continuous measures, while
| Manhattan distance is a binary measure. In this lesson we'll just briefly discuss
| the first and last of these. It's important that you use a measure of distance that
| fits your problem.

...

|=========== | 15%
| Euclidean distance is what you learned about in high school algebra. Given two
| points on a plane, (x1,y1) and (x2,y2), the Euclidean distance is the square root of
| the sums of the squares of the distances between the two x-coordinates (x1-x2) and
| the two y-coordinates (y1-y2). You probably recognize this as an application of the
| Pythagorean theorem which yields the length of the hypotenuse of a right triangle.

...

|============ | 16%
| It shouldn't be hard to believe that this generalizes to more than two dimensions as
| shown in the formula at the bottom of the picture shown here.

...

...

...

|================ | 21%
| You want to travel from the point at the lower left to the one on the top right. The
| shortest distance is the Euclidean (the green line), but you're limited to the grid,
| so you have to follow a path similar to those shown in red, blue, or yellow. These
| all have the same length (12) which is the number of small gray segments covered by
| their paths.

...

|================= | 23%
| More formally, Manhattan distance is the sum of the absolute values of the distances
| between each coordinate, so the distance between the points (x1,y1) and (x2,y2) is
| |x1-x2|+|y1-y2|. As with Euclidean distance, this too generalizes to more than 2
| dimensions.

...

|=================== | 24%
| Now we'll go back to our random points. You might have noticed that these points
| don't really look randomly positioned, and in fact, they're not. They were actually
| generated as 3 distinct clusters. We've put the coordinates of these points in a
| data frame for you, called dataFrame.

...

|==================== | 26%
| We'll use this dataFrame to demonstrate an agglomerative (bottom-up) technique of
| hierarchical clustering and create a dendrogram. This is an abstract picture (or
| graph) which shows how the 12 points in our dataset cluster together. Two clusters
| (initially, these are points) that are close are connected with a line, We'll use
| Euclidean distance as our metric of closeness.

...

|===================== | 27%
| Run the R command dist with the argument dataFrame to compute the distances between
| all pairs of these points. By default dist uses Euclidean distance as its metric,
| but other metrics such as Manhattan, are available. Just use the default.

dist(dataFrame)

            1          2          3          4          5          6          7  
2  0.34120511                                                                    
3  0.57493739 0.24102750                                                         
4  0.26381786 0.52578819 0.71861759                                              
5  1.69424700 1.35818182 1.11952883 1.80666768                                   
6  1.65812902 1.31960442 1.08338841 1.78081321 0.08150268                        
7  1.49823399 1.16620981 0.92568723 1.60131659 0.21110433 0.21666557             
8  1.99149025 1.69093111 1.45648906 2.02849490 0.61704200 0.69791931 0.65062566  
9  2.13629539 1.83167669 1.67835968 2.35675598 1.18349654 1.11500116 1.28582631  
10 2.06419586 1.76999236 1.63109790 2.29239480 1.23847877 1.16550201 1.32063059  
11 2.14702468 1.85183204 1.71074417 2.37461984 1.28153948 1.21077373 1.37369662  
12 2.05664233 1.74662555 1.58658782 2.27232243 1.07700974 1.00777231 1.17740375  
            8          9         10         11  
2                                               
3                                               
4                                               
5                                               
6                                               
7                                               
8                                               
9  1.76460709                                   
10 1.83517785 0.14090406                        
11 1.86999431 0.11624471 0.08317570             
12 1.66223814 0.10848966 0.19128645 0.20802789

| Great job!

|====================== | 29%
| You see that the output is a lower triangular matrix with rows numbered from 2 to 12
| and columns numbered from 1 to 11. Entry (i,j) indicates the distance between points
| i and j. Clearly you need only a lower triangular matrix since the distance between
| points i and j equals that between j and i.

...

|======================== | 31%
| From the output of dist, what is the minimum distance between two points?

1: 0.08317
2: -0.0700
3: 0.1085
4: 0.0815

Selection: 4

| Excellent job!

...

|========================== | 34%
| Looking at the picture, what would be another good pair of points to put in another
| cluster given that 5 and 6 are already clustered?

1: 7 and the cluster containing 5 ad 6
2: 10 and 11
3: 7 and 8
4: 1 and 4

Selection: 2

| You are amazing!

|=========================== | 35%
| So 10 and 11 are another pair of points that would be in a second cluster. We'll
| start creating our dendrogram now. Here're the original plot and two beginning
| pieces of the dendrogram.

...

|============================= | 37%
| We can keep going like this in the obvious way and pair up individual points, but as
| luck would have it, R provides a simple function which you can call which creates a
| dendrogram for you. It's called hclust() and takes as an argument the pairwise
| distance matrix which we looked at before. We've stored this matrix for you in a
| variable called distxy. Run hclust now with distxy as its argument and put the
| result in the variable hc.

hc<-hclust(distxy)

| Perseverance, that's the answer.

|============================== | 39%
| You're probably curious and want to see hc.

...

|=============================== | 40%
| Call the R function plot with one argument, hc.

plot(hc)

| That's correct!

|================================ | 42%
| Nice plot, right? R's plot conveniently labeled everything for you. The points we
| saw are the leaves at the bottom of the graph, 5 and 6 are connected, as are 10 and
| 11. Moreover, we see that the original 3 groupings of points are closest together as
| leaves on the picture. That's reassuring. Now call plot again, this time with the
| argument as.dendrogram(hc).

plot(as.dendrogram(hc))

| Keep up the great work!

|================================== | 44%
| The essentials are the same, but the labels are missing and the leaves (original
| points) are all printed at the same level. Notice that the vertical heights of the
| lines and labeling of the scale on the left edge give some indication of distance.
| Use the R command abline to draw a horizontal blue line at 1.5 on this plot. Recall
| that this requires 2 arguments, h=1.5 and col="blue".

abline(h=1.5,col="blue")

| Keep working like that and you'll get there!

|=================================== | 45%
| We see that this blue line intersects 3 vertical lines and this tells us that using
| the distance 1.5 (unspecified units) gives us 3 clusters (1 through 4), (9 through
| 12), and (5 through 8). We call this a "cut" of our dendrogram. Now cut the
| dendrogam by drawing a red horizontal line at .4.

abline(h=0.4,col="red")

| You are really on a roll!

|==================================== | 47%
| How many clusters are there with a cut at this distance?

5
[1] 5

| Keep up the great work!

|===================================== | 48%
| We see that by cutting at .4 we have 5 clusters, indicating that this distance is
| small enough to break up our original grouping of points. If we drew a horizontal
| line at .05, how many clusters would we get

5
[1] 5

| That's not exactly what I'm looking for. Try again. Or, type info() for more
| options.

| Recall that our shortest distance was around .08, so a distance smaller than that
| would make all the points their own private clusters.

12
[1] 12

| Perseverance, that's the answer.

|====================================== | 50%
| Try it now (draw a horizontal line at .05) and make the line green.

abline(h=0.05,col="green")

| Your dedication is inspiring!

|======================================== | 52%
| So the number of clusters in your data depends on where you draw the line! (We said
| there's a lot of flexibility here.) Now that we've seen the practice, let's go back
| to some "theory". Notice that the two original groupings, 5 through 8, and 9 through
| 12, are connected with a horizontal line near the top of the display. You're
| probably wondering how distances between clusters of points are measured.

...

|========================================= | 53%
| There are several ways to do this. We'll just mention two. The first is called
| complete linkage and it says that if you're trying to measure a distance between two
| clusters, take the greatest distance between the pairs of points in those two
| clusters. Obviously such pairs contain one point from each cluster.

...

|========================================== | 55%
| So if we were measuring the distance between the two clusters of points (1 through
| 4) and (5 through 8), using complete linkage as the metric we would use the distance
| between points 4 and 8 as the measure since this is the largest distance between the
| pairs of those groups.

...

|=========================================== | 56%
| The distance between the two clusters of points (9 through 12) and (5 through 8),
| using complete linkage as the metric, is the distance between points 11 and 8 since
| this is the largest distance between the pairs of those groups.

...

|============================================= | 58%
| As luck would have it, the distance between the two clusters of points (9 through
| 12) and (1 through 4), using complete linkage as the metric, is the distance between
| points 11 and 4.

...

|============================================== | 60%
| We've created the dataframe dFsm for you containing these 3 points, 4, 8, and 11.
| Run dist on dFsm to see what the smallest distance between these 3 points is.

dist(dFsm)
1 2
2 2.028495
3 2.374620 1.869994

| Keep up the great work!

|=============================================== | 61%
| We see that the smallest distance is between points 2 and 3 in this reduced set,
| (these are actually points 8 and 11 in the original set), indicating that the two
| clusters these points represent ((5 through 8) and (9 through 12) respectively)
| would be joined (at a distance of 1.869) before being connected with the third
| cluster (1 through 4). This is consistent with the dendrogram we plotted.

...

|================================================ | 63%
| The second way to measure a distance between two clusters that we'll just mention is
| called average linkage. First you compute an "average" point in each cluster (think
| of it as the cluster's center of gravity). You do this by computing the mean
| (average) x and y coordinates of the points in the cluster.

...

|================================================== | 65%
| Then you compute the distances between each cluster average to compute the
| intercluster distance.

...

|=================================================== | 66%
| Now look at the hierarchical cluster we created before, hc.

hc

Call:
hclust(d = distxy)

Cluster method : complete
Distance : euclidean
Number of objects: 12

| Excellent work!

|==================================================== | 68%
| Which type of linkage did hclust() use to agglomerate clusters?

1: complete
2: average

Selection: 1

| You nailed it! Good job!

|===================================================== | 69%
| In our simple set of data, the average and complete linkages aren't that different,
| but in more complicated datasets the type of linkage you use could affect how your
| data clusters. It is a good idea to experiment with different methods of linkage to
| see the varying ways your data groups. This will help you determine the best way to
| continue with your analysis.

...

|======================================================= | 71%
| The last method of visualizing data we'll mention in this lesson concerns heat maps.
| Wikipedia (http://en.wikipedia.org/wiki/Heat_map) tells us a heat map is "a
| graphical representation of data where the individual values contained in a matrix
| are represented as colors. ... Heat maps originated in 2D displays of the values in
| a data matrix. Larger values were represented by small dark gray or black squares
| (pixels) and smaller values by lighter squares."

...

|======================================================== | 73%
| You've probably seen many examples of heat maps, for instance weather radar and
| displays of ocean salinity. From Wikipedia (http://en.wikipedia.org/wiki/Heat_map)
| we learn that heat maps are often used in molecular biology "to represent the level
| of expression of many genes across a number of comparable samples (e.g. cells in
| different states, samples from different patients) as they are obtained from DNA
| microarrays."

...

|========================================================= | 74%
| We won't say too much on this topic, but a very nice concise tutorial on creating
| heatmaps in R exists at
| http://sebastianraschka.com/Articles/heatmaps_in_r.html#clustering. Here's an image
| from the tutorial to start you thinking about the topic. It shows a sample heat map
| with a dendrogram on the left edge mapping the relationship between the rows. The
| legend at the top shows how colors relate to values.

...

|========================================================== | 76%
| R provides a handy function to produce heat maps. It's called heatmap. We've put the
| point data we've been using throughout this lesson in a matrix. Call heatmap now
| with 2 arguments. The first is dataMatrix and the second is col set equal to
| cm.colors(25). This last is optional, but we like the colors better than the default
| ones.

heatmap(dataMatrix,col=cm.colors(25))

| That's a job well done!

|============================================================ | 77%
| We see an interesting display of sorts. This is a very simple heat map - simple
| because the data isn't very complex. The rows and columns are grouped together as
| shown by colors. The top rows (labeled 5, 6, and 7) seem to be in the same group
| (same colors) while 8 is next to them but colored differently. This matches the
| dendrogram shown on the left edge. Similarly, 9, 12, 11, and 10 are grouped together
| (row-wise) along with 3 and 2. These are followed by 1 and 4 which are in a separate
| group. Column data is treated independently of rows but is also grouped.

...

|============================================================= | 79%
| We've subsetted some vehicle data from mtcars, the Motor Trend Car Road Tests which
| is part of the package datasets. The data is in the matrix mt and contains 6 factors
| of 11 cars. Run heatmap now with mt as its only argument.

heatmap(mt)

| Keep working like that and you'll get there!

|============================================================== | 81%
| This looks slightly more interesting than the heatmap for the point data. It shows a
| little better how the rows and columns are treated (clustered and colored)
| independently of one another. To understand the disparity in color (between the left
| 4 columns and the right 2) look at mt now.

mt

                  mpg cyl  disp  hp drat    wt  
Dodge Challenger 15.5   8 318.0 150 2.76 3.520  
AMC Javelin      15.2   8 304.0 150 3.15 3.435  
Camaro Z28       13.3   8 350.0 245 3.73 3.840  
Pontiac Firebird 19.2   8 400.0 175 3.08 3.845  
Fiat X1-9        27.3   4  79.0  66 4.08 1.935  
Porsche 914-2    26.0   4 120.3  91 4.43 2.140  
Lotus Europa     30.4   4  95.1 113 3.77 1.513  
Ford Pantera L   15.8   8 351.0 264 4.22 3.170  
Ferrari Dino     19.7   6 145.0 175 3.62 2.770  
Maserati Bora    15.0   8 301.0 335 3.54 3.570  
Volvo 142E       21.4   4 121.0 109 4.11 2.780

| You're the best!

|=============================================================== | 82%
| See how four of the columns are all relatively small numbers and only two (disp and
| hp) are large? That explains the big difference in color columns. Now to understand
| the grouping of the rows call plot with one argument, the dendrogram object denmt
| we've created for you.

plot(denmt)

| Excellent work!

|================================================================= | 84%
| We see that this dendrogram is the one displayed at the side of the heat map. How
| was this created? Recall that we generalized the distance formula for more than 2
| dimensions. We've created a distance matrix for you, distmt. Look at it now.

distmt

                 Dodge Challenger AMC Javelin Camaro Z28 Pontiac Firebird Fiat X1-9  
AMC Javelin              14.00890                                                    
Camaro Z28              100.27404   105.57041                                        
Pontiac Firebird         85.80733    99.28330   86.22779                             
Fiat X1-9               253.64640   240.51305  325.11191        339.12867            
Porsche 914-2           206.63309   193.29419  276.87318        292.15588  48.29642  
Lotus Europa            226.48724   212.74240  287.59666        311.37656  49.78046  
Ford Pantera L          118.69012   123.31494   19.20778        101.66275 336.65679  
Ferrari Dino            174.86264   161.03078  216.72821        255.01117 127.67016  
Maserati Bora           185.78176   185.02489  102.48902        188.19917 349.02042  
Volvo 142E              201.35337   187.68535  266.49555        286.74036  60.40302  
                 Porsche 914-2 Lotus Europa Ford Pantera L Ferrari Dino Maserati Bora  
AMC Javelin                                                                            
Camaro Z28                                                                             
Pontiac Firebird                                                                       
Fiat X1-9                                                                              
Porsche 914-2                                                                          
Lotus Europa          33.75246                                                         
Ford Pantera L       288.56998    297.51961                                            
Ferrari Dino          87.81135     80.33743      224.44761                             
Maserati Bora        303.85577    303.20992       86.84620    223.52346                
Volvo 142E            18.60543     27.74042      277.43923     70.27895     289.02233

| You nailed it! Good job!

|================================================================== | 85%
| See how these distances match those in the dendrogram? So hclust really works!
| Let's review now.

...

|=================================================================== | 87%
| What is the purpose of hierarchical clustering?

1: Inspire other researchers
2: Give an idea of the relationships between variables or observations
3: None of the others
4: Present a finished picture

Selection:
Enter an item from the menu, or 0 to exit
Selection: 2

| You are amazing!

|==================================================================== | 89%
| True or False? When you're doing hierarchical clustering there are strict rules that
| you MUST follow.

1: False
2: True

Selection: 1

| You got it!

|====================================================================== | 90%
| True or False? There's only one way to measure distance.

1: False
2: True

Selection: 1

| Excellent work!

|======================================================================= | 92%
| True or False? Complete linkage is a method of computing distances between clusters.

1: True
2: False

Selection: 1

| That's a job well done!

|======================================================================== | 94%
| True or False? Average linkage uses the maximum distance between points of two
| clusters as the distance between those clusters.

1: True
2: False

Selection: 2

| You nailed it! Good job!

|========================================================================= | 95%
| True or False? The number of clusters you derive from your data depends on the
| distance at which you choose to cut it.

1: True
2: False

Selection: 1

| Your dedication is inspiring!

|=========================================================================== | 97%
| True or False? Once you decide basics, such as defining a distance metric and
| linkage method, hierarchical clustering is deterministic.

1: True
2: False

Selection: 1

| You nailed it! Good job!

|============================================================================ | 98%
| Congratulations! We hope this lesson didn't fluster you or get you too heated!

...

|=============================================================================| 100%
| Would you like to receive credit for completing this course on Coursera.org?

1: No
2: Yes

Selection: 2
What is your email address? xxxxxx@xxxxxxxxxxxx
What is your assignment token? xXxXxxXXxXxxXXXx
Grade submission succeeded!

| That's the answer I was looking for.

| You've reached the end of this lesson! Returning to the main menu...

| Please choose a course, or type 0 to exit swirl.

1: Exploratory Data Analysis
2: Take me to the swirl course repository!

Selection: 0

| Leaving swirl now. Type swirl() to resume.

rm(list=ls())

Last updated 2020-10-02 01:19:36.088624 IST

Comments