Selecting subsets of a data.frame is easy in R if you define the predicates manually.
But if you need to define many conditions the standard slicing and subsetting methods
For this illustration I want to pick some large number of numerical ranges and label
all of the rows that match any of the predicates.
The key is using outer to match against many predicates and then checking that any of them was satisfied.
peaks <- pi*c(0,2,4,6,8,10)
low <- peaks - pi/4
high <- peaks + pi/4
ranges <- data.frame(low=low,high=high)
df <- data.frame(x=x,y=y)
# given a vector x
# which elements are contained in one of the ranges
# defined by the high and low columns of the ranges data.frame
inranges <- function(x, ranges)
c<-a & b
aaply(c,1,function(y) any(y) )
# I can now add a new column that indicates which rows matched
df$peaks <- inranges(df$x, ranges)
p <- ggplot(df,aes(x=x,y=y))
p <- p + geom_point(aes(color=peaks))
#or I can subset the data to only the matching rows:
df.peaks <- subset(df,inranges(x,ranges))
p <- ggplot(df.peaks,aes(x=x,y=y))
p <- p + geom_point()
This is an illustration of representing point count in a graphic using transparency. This is easy to do in ggplot2 if you use one of the barchart type of geoms. However I think there are other situations where it would be useful to apply aesthetics based on point count.
Since Hadley did a lot of his canonical examples using this data I thought it would be helpful for comparing and contrasting.
This chart shows the distribution of the price/carat of diamonds segmented by quartile of carats and clarity. The transparency shows how many diamonds each bar represents. This makes it easy to see where the action is.
# create copy of diamonds
df <- diamonds
# compute the quartiles of carat
df$carat.qtiles <- cut(df$carat,unlist(quantile(df$carat)),include.lowest=T)
# plot the probability distribution of price/carat, faceted by clarity and carat quartile
# key point: using the count per bar to set the alpha level. This lets you see how much
# data is represented by each bar (it would be nice to be able to do this
# anytime an aggregate is done...boxplots, bins, etc.)
p <- ggplot(data=df, aes(x=price/carat,y=..count../sum(..count..)))
p <- p + geom_histogram(aes(alpha=..count..),binwidth=1000) +facet_grid(clarity~carat.qtiles)
Currently in ggplot2 this method will only work if the ..output.. variables related to count are available. There are a number of areas that could benefit from this capability. It should also be easy to add more output variables to the elements of ggplot for which this behavior would be natural.
- geom_boxplot: Geoms that aggregates multiple points are good candidates for this
- facet_*: It would be interesting to be able to add a visual cue to each facet to show how many points are in each.
- The most appealing idea on this so far is to enable scaling of the facet area by point count (or other things).
- Ordering of the facets by point count would also be extremely useful.
- Thresholding by count. This would be great to easily chop low-signal facets and keep the visualization clean.
- Other half-baked ideas include background color, alpha box border…
- psych::pairs.panels and corrgram::corrgram using mtcars data
- multivariate modeling is challenging
- pair plots make it easy to get a quick understanding of each variable and the relationships between them
Multivariate analysis and modeling can be really challenging. Getting the job done well requires you to know your data really well. People often use the metaphor the you know something well if you “know it like the back of your hand”. However we look at our hands everyday but probably do not recall the details of where each freckle or wrinkle is. You want to know your data in a much more detailed way.
One very valuable first step when working with a new multivariate data set is to look at the relationships between each pair of variables. There are a number of ways to do this in R and I often prefer to use two different scatter plot matrix methods to get a feel for the relationships between the variables.
Here is an example using the mtcars dataset in R.
- getting to know your numerical data
- predictive modeling (feature selection, technique choice,…)
why use it?
- you can see points with an ellipse superimposed in the lower region
- you can see the data distribution on the diagonal for each variable
- you can see the correlation values in the upper region
- works with categorical data
why use it?
- pie chart in the lower region gives a quick visual view of correlations
- min/max values of each variable on the diagonal
- correlation confidence intervals in the upper region (in parens below the value)
- only works with numerical variables
corrgram(df,lower.panel=panel.pie, upper.panel=panel.conf, diag.panel=panel.minmax)
Based on these plots it is easy to see some important high-level relationships between the variables.
- mpg is strongly inversely proportional to:
- cyl : number of cylinders
- disp: engine displacement
- hp: horsepower
- wt: vehicle weight
- mpg is negatively proportional to:
- drat: rear axel ratio
- qsec: time to get drive 1/4 mile
- rear axel ratio and weight do not have a strong relationship with the 1/4-mile time. This means that if you want to predict 1/4-mile time, you would not want to use these as unconditional predictor. In fact it might cause you to start looking for interactions between the variables so you can do conditional modeling.
- rear axel ratio is inversely proportional to wt, hp, disp and cyl. I know nothing about cars, but now I know that heavier, more powerful cars tend to have a smaller rear axel ratio.
There is also a lot of great basic summary info here:
- A distribution plot for each variable
- The min and max of each variable
This still only provides a very superficial understanding of the data, but this is a good start. There are lots of different options and ways to use both packages, so you can adapt how you use these functions for your own style and preferences.
I’ve been a big fan of ggplot2 for a long time but plyr has been in my toolkit for less than a year and it is now one of my most-used R packages. It is how aggregate/*apply would have been if they were awesome.
In five lines this code computes the cumulative distribution functions of all of the variables in the iris data set and creates a colored, faceted plot to visualize the data.
#cummulative distribution of iris data
# 0.1% increments
#compute quantiles by species for each variable
#melt the data.frame for easy ggplot faceting on variable
# plot using color for species and facets for variables
p+geom_point(aes(color = factor(Species)))+facet_wrap(~variable)