Exploratory Graphs

Krishnakanth Allika

2020-10-02 00:50

library(swirl)
swirl()

| Welcome to swirl! Please sign in. If you've been here before, use the same name as you did
| then. If you are new, call yourself something unique.

What shall I call you? Krishnakanth Allika

| Please choose a course, or type 0 to exit swirl.

1: Exploratory Data Analysis
2: Take me to the swirl course repository!

Selection: 1

| Please choose a lesson, or type 0 to return to course menu.

1: Principles of Analytic Graphs 2: Exploratory Graphs
3: Graphics Devices in R 4: Plotting Systems
5: Base Plotting System 6: Lattice Plotting System
7: Working with Colors 8: GGPlot2 Part1
9: GGPlot2 Part2 10: GGPlot2 Extras
11: Hierarchical Clustering 12: K Means Clustering
13: Dimension Reduction 14: Clustering Example
15: CaseStudy

Selection: 2
| | 0%

| Exploratory_Graphs. (Slides for this and other Data Science courses may be found at github
| https://github.com/DataScienceSpecialization/courses/. If you care to use them, they must
| be downloaded as a zip file and viewed locally. This lesson corresponds to
| 04_ExploratoryAnalysis/exploratoryGraphs.)

Error in (function (srcref) : unimplemented type (29) in 'eval'
Error: no more error handlers available (recursive errors?); invoking 'abort' restart
Error in readline("...") :
INTEGER() can only be applied to a 'integer', not a 'unknown type #29'
In addition: Warning message:
In readline("...") : type 29 is unimplemented in 'type2char'

| Leaving swirl now. Type swirl() to resume.

swirl()

| Welcome to swirl! Please sign in. If you've been here before, use the same name as you did
| then. If you are new, call yourself something unique.

What shall I call you? Krishnakanth Allika

| Would you like to continue with one of these lessons?

1: Exploratory Data Analysis Exploratory Graphs
2: No. Let me start something new.

Selection: 1

...

|= | 1%
| In this lesson, we'll discuss why graphics are an important tool for data scientists and
| the special role that exploratory graphs play in the field.

...

|== | 3%
| Which of the following would NOT be a good reason to use graphics in data science?

1: To understand data properties
2: To find a color that best matches the shirt you're wearing
3: To find patterns in data
4: To suggest modeling strategies

Selection: 2

| All that practice is paying off!

...

|==== | 5%
| Which of the following cliches captures the essence of graphics?

1: To err is human, to forgive divine
2: A rose by any other name smells as sweet
3: A picture is worth a 1000 words
4: The apple doesn't fall far from the tree

Selection: 3

| Excellent work!

|====== | 7%
| Exploratory graphs serve mostly the same functions as graphs. They help us find patterns
| in data and understand its properties. They suggest modeling strategies and help to debug
| analyses. We DON'T use exploratory graphs to communicate results.

...

|======= | 8%
| Instead, exploratory graphs are the initial step in an investigation, the "quick and
| dirty" tool used to point the data scientist in a fruitful direction. A scientist might
| need to make a lot of exploratory graphs in order to develop a personal understanding of
| the problem being studied. Plot details such as axes, legends, color and size are cleaned
| up later to convey more information in an aesthetically pleasing way.

...

|======== | 9%
| To demonstrate these ideas, we've copied some data for you from the U.S. Environmental
| Protection Agency (EPA) which sets national ambient air quality standards for outdoor air
| pollution. These Standards say that for fine particle pollution (PM2.5), the "annual mean,
| averaged over 3 years" cannot exceed 12 micro grams per cubic meter. We stored the data
| from the U.S. EPA web site in the data frame pollution. Use the R function head to see the
| first few entries of pollution.

head(pollution)
pm25 fips region longitude latitude
1 9.771185 01003 east -87.74826 30.59278
2 9.993817 01027 east -85.84286 33.26581
3 10.688618 01033 east -87.72596 34.73148
4 11.337424 01049 east -85.79892 34.45913
5 12.119764 01055 east -86.03212 34.01860
6 10.827805 01069 east -85.35039 31.18973

| You nailed it! Good job!

|========= | 11%
| We see right away that there's at least one county exceeding the EPA's standard of 12
| micrograms per cubic meter. What else do we see?

...

|========== | 12%
| We see 5 columns of data. The pollution count is in the first column labeled pm25. We'll
| work mostly with that. The other 4 columns are a fips code indicating the state (first 2
| digits) and county (last 3 digits) with that count, the associated region (east or west),
| and the longitude and latitude of the area. Now run the R command dim with pollution as an
| argument to see how long the table is.

dim(pollution)
[1] 576 5

| Great job!

|=========== | 13%
| So there are 576 entries in pollution. We'd like to investigate the question "Are there
| any counties in the U.S. that exceed that national standard (12 micro grams per cubic
| meter) for fine particle pollution?" We'll look at several one dimensional summaries of
| the data to investigate this question.

...

|============ | 15%
| The first technique uses the R command summary, a 5-number summary which returns 6
| numbers. Run it now with the pm25 column of pollution as its argument. Recall that the
| construct for this is pollution$pm25.

summary(pollution$pm25)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.383 8.549 10.047 9.836 11.356 18.441

| You are doing so well!

|============= | 16%
| This shows us basic info about the pm25 data, namely its Minimum (0 percentile) and
| Maximum (100 percentile) values, and three Quartiles of the data. These last indicate the
| pollution measures at which 25%, 50%, and 75% of the counties fall below. In addition to
| these 5 numbers we see the Mean or average measure of particulate pollution across the 576
| counties.

...

|============== | 17%
| Half the measured counties have a pollution level less than or equal to what number of
| micrograms per cubic meter?

1: 10.050
2: 9.836
3: 8.549
4: 11.360

Selection: 1

| You're the best!

|=============== | 19%
| To save you a lot of typing we've saved off pollution$pm25 for you in the variable ppm.
| You can use ppm now in place of the longer expression. Try it now as the argument of the R
| command quantile. See how the results look a lot like the results of the output of the
| summary command.

quantile(ppm)
0% 25% 50% 75% 100%
3.382626 8.548799 10.046697 11.356012 18.440731

| All that hard work is paying off!

1: the maximum value
2: the median
3: the minimum value
4: the mean

Selection: 4

| Excellent work!

|================== | 21%
| Now we'll plot a picture, specifically a boxplot. Run the R command boxplot with ppm as an
| input. Also specify the color parameter col equal to "blue".

boxplot(ppm,col="blue")

graph

| That's a job well done!

|=================== | 23%
| The boxplot shows us the same quartile data that summary and quantile did. The lower and
| upper edges of the blue box respectively show the values of the 25% and 75% quantiles.

...

|==================== | 24%
| What do you think the horizontal line inside the box represents?

1: the maximum value
2: the mean
3: the minimum value
4: the median

Selection: 4

| Nice work!

|===================== | 25%
| The "whiskers" of the box (the vertical lines extending above and below the box) relate to
| the range parameter of boxplot, which we let default to the value 1.5 used by R. The
| height of the box is the interquartile range, the difference between the 75th and 25th
| quantiles. In this case that difference is 2.8. The whiskers are drawn to be a length of
| range2.8 or 1.52.8. This shows us roughly how many, if any, data points are outliers,
| that is, beyond this range of values.

...

|====================== | 27%
| Note that boxplot is part of R's base plotting package. A nice feature that this package
| provides is its ability to overlay features. That is, you can add to (annotate) an
| existing plot.

...

|======================= | 28%
| To see this, run the R command abline with the argument h equal to 12. Recall that 12 is
| the EPA standard for air quality.

abline(h=12)

graph

| That's a job well done!

|======================== | 29%
| What do you think this command did?

1: drew a horizontal line at 12
2: hid 12 random data points
3: drew a vertical line at 12
4: nothing

Selection: 1

| Keep up the great work!

|========================= | 31%
| So abline "adds one or more straight lines through the current plot." We see from the plot
| that the bulk of the measured counties comply with the standard since they fall under the
| line marking that standard.

...

|=========================== | 32%
| Now use the R command hist (another function from the base package) with the argument ppm.
| Specify the color parameter col equal to "green". This will plot a histogram of the data.

hist(ppm,col="green")

graph

| You nailed it! Good job!

|============================ | 33%
| The histogram gives us a little more detailed information about our data, specifically the
| distribution of the pollution counts, or how many counties fall into each bucket of
| measurements.

...

|============================= | 35%
| What are the most frequent pollution counts?

1: between 9 and 12
2: between 12 and 14
3: between 6 and 8
4: under 5

Selection: 1

| You're the best!

|============================== | 36%
| Now run the R command rug with the argument ppm.

rug(ppm)

graph

| That's correct!

|=============================== | 37%
| This one-dimensional plot, with its grayscale representation, gives you a little more
| detailed information about how many data points are in each bucket and where they lie
| within the bucket. It shows (through density of tick marks) that the greatest
| concentration of counties has between 9 and 12 micrograms per cubic meter just as the
| histogram did.

...

|================================ | 39%
| To illustrate this a little more, we've defined for you two vectors, high and low,
| containing pollution data of high (greater than 15) and low (less than 5) values
| respectively. Look at low now and see how it relates to the output of rug.

low
[1] 3.494351 4.186090 4.917140 4.504539 4.793644 4.601408 4.195688 4.625279 4.460193
[10] 4.978397 4.324736 4.175901 3.382626 4.132739 4.955570 4.565808

| All that hard work is paying off!

|================================= | 40%
| It confirms that there are two data points between 3 and 4 and many between 4 and 5. Now
| look at high.

high
[1] 16.19452 15.80378 18.44073 16.66180 15.01573 17.42905 16.25190 16.18358

| Excellent job!

|================================== | 41%
| Again, we see one data point greater than 18, one between 17 and 18, several between 16
| and 17 and two between 15 and 16, verifying what rug indicated.

...

|=================================== | 43%
| Now rerun hist with 3 arguments, ppm as its first, col equal to "green", and the argument
| breaks equal to 100.

hist(ppm,col="green",breaks=100)

graph

| All that practice is paying off!

|===================================== | 44%
| What do you think the breaks argument specifies in this case?

1: the number of data points to graph
2: the number of counties exceeding the EPA standard
3: the number of buckets to split the data into
4: the number of stars in the sky

Selection: 3

| You are amazing!

|====================================== | 45%
| So this histogram with more buckets is not nearly as smooth as the preceding one. In fact,
| it's a little too noisy to see the distribution clearly. When you're plotting histograms
| you might have to experiment with the argument breaks to get a good idea of your data's
| distribution. For fun now, rerun the R command rug with the argument ppm.

rug(ppm)

graph

| That's a job well done!

|======================================= | 47%
| See how rug works with the existing plot? It automatically adjusted its pocket size to
| that of the last plot plotted.

...

|======================================== | 48%
| Now rerun hist with ppm as the data and col equal to "green".

hist(ppm,col="green")

graph

| Great job!

|========================================= | 49%
| Now run the command abline with the argument v equal to 12 and the argument lwd equal to
| 2.

abline(v=12,lwd=2)

graph

| You are doing so well!

|========================================== | 51%
| See the vertical line at 12? Not very visible, is it, even though you specified a line
| width of 2? Run abline with the argument v equal to median(ppm), the argument col equal to
| "magenta", and the argument lwd equal to 4.

abline(v=median(ppm),col="magenta",lwd=4)

graph

| You are quite good my friend!

|=========================================== | 52%
| Better, right? Thicker and more of a contrast in color. This shows that although the
| median (50%) is below the standard, there are a fair number of counties in the U.S that
| have pollution levels higher than the standard.

...

|============================================ | 53%
| Now recall that our pollution data had 5 columns of information. So far we've only looked
| at the pm25 column. We can also look at other information. To remind yourself what's there
| run the R command names with pollution as the argument.

names(pollution)
[1] "pm25" "fips" "region" "longitude" "latitude"

| Keep up the great work!

|============================================= | 55%
| Longitude and latitude don't sound interesting, and each fips is unique since it
| identifies states (first 2 digits) and counties (last 3 digits). Let's look at the region
| column to see what's there. Run the R command table on this column. Use the construct
| pollution$region. Store the result in the variable reg.

reg<-table(pollution$region)

| All that practice is paying off!

|============================================== | 56%
| Look at reg now.

reg

east west
442 134

| Nice work!

|================================================ | 57%
| Lot more counties in the east than west. We'll use the R command barplot (another type of
| one-dimensional summary) to plot this information. Call barplot with reg as its first
| argument, the argument col equal to "wheat", and the argument main equal to the string
| "Number of Counties in Each Region".

barplot(reg,col="wheat",main="Number of Counties in Each Region")

graph

| You are quite good my friend!

|================================================= | 59%
| What do you think the argument main specifies?

1: the y axis label
2: the title of the graph
3: the x axis label
4: I can't tell

Selection: 2

| You are doing so well!

|================================================== | 60%
| So we've seen several examples of one-dimensional graphs that summarize data. Two
| dimensional graphs include scatterplots, multiple graphs which we'll see more examples of,
| and overlayed one-dimensional plots which the R packages such as lattice and ggplot2
| provide.

...

|=================================================== | 61%
| Some graphs have more than two-dimensions. These include overlayed or multiple
| two-dimensional plots and spinning plots. Some three-dimensional plots are tricky to
| understand so have limited applications. We'll see some examples now of more complicated
| graphs, in particular, we'll show two graphs together.

...

|==================================================== | 63%
| First we'll show how R, in one line and using base plotting, can display multiple
| boxplots. We simply specify that we want to see the pollution data as a function of
| region. We know that our pollution data characterized each of the 576 entries as belonging
| to one of two regions (east and west).

...

|===================================================== | 64%
| We use the R formula y ~ x to show that y (in this case pm25) depends on x (region). Since
| both come from the same data frame (pollution) we can specify a data argument set equal to
| pollution. By doing this, we don't have to type pollution$pm25 (or ppm) and | pollution$region. We can just specify the formula pm25~region. Call boxplot now with this
| formula as its argument, data equal to pollution, and col equal to "red".

boxplot(pm25~region,data=pollution,col="red")

graph

| Perseverance, that's the answer.

|====================================================== | 65%
| Two for the price of one! Similarly we can plot multiple histograms in one plot, though to
| do this we have to use more than one R command. First we have to set up the plot window
| with the R command par which specifies how we want to lay out the plots, say one above the
| other. We also use par to specify margins, a 4-long vector which indicates the number of
| lines for the bottom, left, top and right. Type the R command
| par(mfrow=c(2,1),mar=c(4,4,2,1)) now. Don't expect to see any new result.

par(mfrow=c(2,1),mar=c(4,4,2,1))

| That's a job well done!

|======================================================= | 67%
| So we set up the plot window for two rows and one column with the mfrow argument. The mar
| argument set up the margins. Before we plot the histograms let's explore the R command
| subset which, not surprisingly, "returns subsets of vectors, matrices or data frames which
| meet conditions". We'll use subset to pull off the data we want to plot. Call subset now
| with pollution as its first argument and a boolean expression testing region for equality
| with the string "east". Put the result in the variable east.

east<-subset(pollution,region=="east")

| Keep working like that and you'll get there!

|======================================================== | 68%
| Use head to look at the first few entries of east.

head(east)
pm25 fips region longitude latitude
1 9.771185 01003 east -87.74826 30.59278
2 9.993817 01027 east -85.84286 33.26581
3 10.688618 01033 east -87.72596 34.73148
4 11.337424 01049 east -85.79892 34.45913
5 12.119764 01055 east -86.03212 34.01860
6 10.827805 01069 east -85.35039 31.18973

| Excellent work!

|========================================================== | 69%
| So east holds more information than we need. We just want to plot a histogram with the
| pm25 portion. Call hist now with the pm25 portion of east as its first argument and col
| equal to "green" as its second.

hist(east$pm25,col="green")

graph

| You got it!

|=========================================================== | 71%
| See? The command par told R we were going to have one column with 2 rows, so it placed
| this histogram in the top position.

...

|============================================================ | 72%
| Now, here's a challenge for you. Plot the histogram of the counties from the west using
| just one R command. Let the appropriate subset command (with the pm25 portion specified)
| be the first argument and col (equal to "green") the second. To cut down on your typing,
| use the up arrow key to get your last command and replace "east" with the subset command.
| Make sure the boolean argument checks for equality between region and "west".

hist(subset(pollution,region=="west")$pm25,col="green")

graph

| You are really on a roll!

|============================================================= | 73%
| See how R does all the labeling for you? Notice that the titles are different since we
| used different commands for the two plots. Let's look at some scatter plots now.

...

|============================================================== | 75%
| Scatter plots are two-dimensional plots which show the relationship between two variables,
| usually x and y. Let's look at a scatterplot showing the relationship between latitude and
| the pm25 data. We'll use plot, a function from R's base plotting package.

...

|=============================================================== | 76%
| We've seen that we can use a function call as an argument when calling another function.
| We'll do this again when we call plot with the arguments latitude and pm25 which are both
| from our data frame pollution. We'll call plot from inside the R command with which
| evaluates "an R expression in an environment constructed from data". We'll use pollution
| as the first argument to with and the call to plot as the second. This allows us to avoid
| typing "pollution$" before the arguments to plot, so it saves us some typing and adds to
| your base of R knowledge. Try this now.

with(pollution,plot(latitude,pm25))

graph

| You nailed it! Good job!

|================================================================ | 77%
| Note that the first argument is plotted along the x-axis and the second along the y. Now
| use abline to add a horizontal line at 12. Use two additional arguments, lwd equal to 2
| and lty also equal to 2. See what happens.

abline(h=12,lwd=2,lty=2)

graph

| That's correct!

|================================================================= | 79%
| See how lty=2 made the line dashed? Now let's replot the scatterplot. This time, instead
| of using with, call plot directly with 3 arguments. The first 2 are pollution$latitude and | ppm. The third argument, col, we'll use to add color and more information to our plot. Set | this argument (col) equal to pollution$region and see what happens.

plot(pollution$latitude,ppm,col=pollution$region)

graph

| Perseverance, that's the answer.

|================================================================== | 80%
| We've got two colors on the map to distinguish between counties in the east and those in
| the west. Can we figure out which color is east and which west? See that the high (greater
| than 50) and low (less than 25) latitudes are both red. Latitudes indicate distance from
| the equator, so which half of the U.S. (east or west) has counties at the extreme north
| and south?

1: east
2: west

Selection: 2

| That's a job well done!

|==================================================================== | 81%
| As before, use abline to add a horizontal line at 12. Use two additional arguments, lwd
| equal to 2 and lty also equal to 2.

abline(h=12,lwd=2,lty=2)

graph

| You are amazing!

|===================================================================== | 83%
| We see many counties are above the healthy standard set by the EPA, but it's hard to tell
| overall, which region, east or west, is worse.

...

|====================================================================== | 84%
| Let's plot two scatterplots distinguished by region.

...

|======================================================================= | 85%
| As we did with multiple histograms, we first have to set up the plot window with the R
| command par. This time, let's plot the scatterplots side by side (one row and two
| columns). We also need to use different margins. Type the R command par(mfrow = c(1, 2),
| mar = c(5, 4, 2, 1)) now. Don't expect to see any new result.

par(mfrow = c(1, 2),mar = c(5, 4, 2, 1))

| You are quite good my friend!

|======================================================================== | 87%
| For the first scatterplot, on the left, we'll plot the latitudes and pm25 counts from the
| west. We already pulled out the information for the counties in the east. Let's now get
| the information for the counties from the west. Create the variable west by using the
| subset command with pollution as the first argument and the appropriate boolean as the
| second.

west<-subset(pollution,region="west")

| Keep trying! Or, type info() for more options.

| Type west <- subset(pollution,region=="west") at the command prompt.

west<-subset(pollution,region=="west")

| That's the answer I was looking for.

|========================================================================= | 88%
| Now call plot with three arguments. These are west$latitude (x-axis), west$pm25 (y-axis),
| and the argument main equal to the string "West" (title). Do this now.

plot(west$latitude,west$pm25,main="West")

graph

| You are really on a roll!

|========================================================================== | 89%
| For the second scatterplot, on the right, we'll plot the latitudes and pm25 counts from
| the east.

...

|=========================================================================== | 91%
| As before, use the up arrow key and change the 3 "West" strings to "East".

plot(east$latitude,east$pm25,main="East")

graph

| You're the best!

|============================================================================ | 92%
| See how R took care of all the details for you? Nice, right? It looks like there are more
| dirty counties in the east but the extreme dirt (greater than 15) is in the west.

...

|============================================================================= | 93%
| Let's summarize and review.

...

|=============================================================================== | 95%
| Which of the following characterizes exploratory plots?

1: quick and dead
2: slow and clean
3: quick and dirty
4: slow and steady

Selection: 3

| That's the answer I was looking for.

|================================================================================ | 96%
| True or false? Plots let you summarize the data (usually graphically) and highlight any
| broad features

1: False
2: True

Selection: 2

| You are amazing!

|================================================================================= | 97%
| Which of the following do plots NOT do?

1: Explore basic questions and hypotheses (and perhaps rule them out)
2: Conclude that you are ALWAYS right
3: Suggest modeling strategies for the "next step"
4: Summarize the data (usually graphically) and highlight any broad features

Selection: 2

| That's a job well done!

|================================================================================== | 99%
| Congrats! You've concluded exploring this lesson on graphics. We hope you didn't find it
| too quick or dirty.

...

|===================================================================================| 100%
| Would you like to receive credit for completing this course on Coursera.org?

1: Yes
2: No

Selection: 2

| Excellent job!

| You've reached the end of this lesson! Returning to the main menu...

| Please choose a course, or type 0 to exit swirl.

1: Exploratory Data Analysis
2: Take me to the swirl course repository!

Selection: 0

| Leaving swirl now. Type swirl() to resume.

rm(list=ls())

Last updated 2020-10-02 00:50:50.719002 IST

Comments