Jesse S.A. Bridgewater

User Centric Design with Big Data

  • Contact

A word cloud where the x and y axes mean something

Posted by bridgewater on April 18, 2012
Posted in: Uncategorized. Tagged: ggplot2, plyr, rstats, wordcloud. 17 Comments
Latest version of word cloud

A word cloud with meaningful spatial properties.

Ok so I have now done two iterations on a better way to visualize term frequencies using R, ggplot2 and plyr. The first was ok but ugly, the second was better but still ugly.

How to read it:

  • Frequency is segmented in to 20% quantiles
  • The frequency is on the y axis
  • Word size is proportional to frequency
  • Words with similar frequency are in approximately alphabetical order from left to right.
  • Color is still random (this could be better)

This one is now good enough that I will start using it in my own presentations and announce my retirement from the prestigious and highly-paid world of word-cloud improvement!

Here’s the code.

library(languageR)

# get english word freq data
data(english)
df <- english[,c("Word","WrittenFrequency")]
df <- df[sample.int(NROW(df),200),]
df <- unique(df)
df$freq <- df$WrittenFrequency/sum(df$WrittenFrequency)
qtiles <- quantile(df$freq, seq(0,1,.2))
twotiles <- quantile(df$WrittenFrequency, seq(0,1,10/NROW(df)))
qdf <- data.frame(cut = qtiles,quantile= as.numeric(strsplit(names(qtiles),"%")))
df$qtilerange <- cut(df$freq,breaks=qtiles,labels=F)
df$twotiles <- as.factor(cut(df$WrittenFrequency,breaks=twotiles,labels=F))
df$quantile <- qdf[(df$qtilerange+1),"quantile"]
df$quantilecut <- qdf[df$qtilerange,"cut"]
df <- df[order(df$quantile),]
df$quantile <- as.factor(df$quantile)
df$quantile <- reorder(df$quantile,NROW(df):1)
df$WordColor <- factor(sample.int(4,NROW(df),replace=T))
df <- df[!is.na(df$quantile),]

ticks <- ddply(df,c("quantile"),summarize,ticks=quantile(WrittenFrequency,c(.2,.8)))$ticks
ticks <- round(unique(c(max(df$WrittenFrequency),ticks)),2)

df <- ddply(df,c("twotiles"),summarize,
            Word=sort(Word),
            WordColor=WordColor, 
            WrittenFrequency=WrittenFrequency, 
            quantile=quantile,
            x=seq(-min(WrittenFrequency)/mean(WrittenFrequency),max(WrittenFrequency)/mean(WrittenFrequency),length.out=length(WrittenFrequency))
            )

library(ggplot2)
# frequency label on the yaxis # x axis is frequency scale  (log data in this example) # word name is shown in the facet label
p <- ggplot(df,aes(x=x,y=WrittenFrequency))
p <- p + geom_text(aes(label=Word,size=WrittenFrequency,color=WordColor),family="Courier",face="bold")
p <- p + opts(axis.text.x=theme_blank(), axis.title.x=theme_blank(),panel.grid.major=theme_blank()) 
p <- p + scale_y_continuous(breaks=ticks)
p <- p +  facet_grid(quantile~.,scales="free_y",space="free",labeller = label_both)
p + opts(strip.text.y = theme_text(angle = 0, size = 15, hjust = 0.5, vjust = 0.5),
         axis.text.y = theme_text(angle = 0, size = 15, hjust = 0.5, vjust = 0.5),
         axis.title.y = theme_blank(),
         legend.text=theme_blank(),legend.position = "none",
         title="Word Frequency")
Advertisement

Word cloud alternatives

Posted by bridgewater on April 16, 2012
Posted in: Uncategorized. Tagged: ggplot2, rstats, wordcloud. 6 Comments
an alternative to word clouds

This is an attempt to make word clouds more quantitative. It still needs more work in order to be an aesthetic competitor to the classic word cloud.

Here is an alternative to word clouds that makes it easier to get insights, but also has some of the aesthetic appeal of the traditional word cloud.
My first attempt at this looked pretty bad and this is not too much better, but hopefully someone else will help improve it.

library(languageR)

# get english word freq data
data(english)
df <- english[,c("Word","WrittenFrequency")]
df <- df[sample.int(NROW(df),500),]
df <- unique(df)
df$freq <- df$WrittenFrequency/sum(df$WrittenFrequency)
qtiles <- quantile(df$freq, seq(0,1,.2))
qdf <- data.frame(cut = qtiles,quantile= as.numeric(strsplit(names(qtiles),"%")))
df$qtilerange <- cut(df$freq,breaks=qtiles,labels=F)
df$quantile <- qdf[(df$qtilerange+1),"quantile"]
df$quantilecut <- qdf[df$qtilerange,"cut"]
df <- df[order(df$quantile),]
df$quantile <- as.factor(df$quantile)
df$quantile <- reorder(df$quantile,NROW(df):1)
df$WordColor <- factor(sample.int(5,NROW(df),replace=T))
df <- df[!is.na(df$quantile),]

ticks <- ddply(df,c("quantile"),summarize,ticks=quantile(WrittenFrequency,c(.2,.8)))$ticks
ticks <- round(unique(c(max(df$WrittenFrequency),ticks)),2)

rollfun <- function(x) {
  numb <- 10
  scale <- mean(x)/(max(x)-min(x)+0.01)
  tmp <- rnorm(1,0,scale)
  tmp <- ifelse( tmp < -numb, -numb,tmp)
  tmp <- ifelse( tmp > numb, numb,tmp)
  tmp
}

roll <- rollapply(df$WrittenFrequency,5,rollfun)
df$x <- c(roll,roll[1:(NROW(df)-NROW(roll))] )

library(ggplot2)
# frequency label on the yaxis # x axis is frequency scale  (log data in this example) # word name is shown in the facet label
p <- ggplot(df,aes(x=x,y=WrittenFrequency))
p <- p + geom_text(aes(label=Word,size=sqrt(WrittenFrequency),color=WordColor),family="Courier",face="bold")
p <- p + opts(axis.text.x=theme_blank(), axis.title.x=theme_blank(),panel.grid.major=theme_blank()) 
p <- p + scale_y_continuous(breaks=ticks)
p <- p +  facet_grid(quantile~.,scales="free",space="free",labeller = label_both)
p + opts(strip.text.y = theme_text(angle = 0, size = 15, hjust = 0.5, vjust = 0.5),
         axis.text.y = theme_text(angle = 0, size = 15, hjust = 0.5, vjust = 0.5),
         axis.title.y = theme_blank(),
         legend.text=theme_blank(),legend.position = "none",
         title="Word Frequency")

Stop squinting at word clouds in the hope of getting insights

Posted by bridgewater on April 11, 2012
Posted in: Uncategorized. Tagged: rstats, text, visualization. 6 Comments
Someone recently asked on twitter about about peoples' preferences for cloud generators in R.  
I replied that I thought the "null" word cloud generator was best. By this I mean that I think the word cloud is a bad visualization method. 
Why? Here is one article with a good perspective, but you can search for examples and see what insights you can get from word clouds; I think they usually obscure the insights. If you are trying to understand raw text then you really want to do better text mining rather than just word frequencies.  And if you want to just look at term frequencies, the word cloud is a very fuzzy way to go about it.

So the natural followup question is how to plot phrase/word frequency data.

Here is an example of the kind of thing that I usually do.  This is only for raw term frequency data (you will need to tabulate it yourself first, which is easy).  For real text mining analysis you can always use packages from the CRAN Task View. 

library(languageR)

# get english word freq data
data(english)
df <- english[,c("Word","WrittenFrequency")]
#reorder by freq for plotting
df <- df[order(-df$WrittenFrequency),]
df$Word <- reorder(df$Word,1:NROW(df))
#get the top 75 words
df <- head(df,75)

library(ggplot2)

# frequency label on the yaxis # x axis is frequency scale  (log data in this example) # word name is shown in the facet label
p <- ggplot(df,aes(x=WrittenFrequency,y=WrittenFrequency))
p <- p + geom_point(size=5)
p + facet_grid(Word~.,scales="free") +  opts(strip.text.y = theme_text(),axis.title.y= theme_blank())

There are lots of things you can do to make it fancier and prettier.  Does anyone have something better?

Stupid R tricks: using outer to create many data.frame subsets

Posted by bridgewater on February 11, 2012
Posted in: data, Programming, visualization. Tagged: rstats. 5 Comments

Selecting subsets of a data.frame is easy in R if you define the predicates manually.
But if you need to define many conditions the standard slicing and subsetting methods
are cumbersome.

For this illustration I want to pick some large number of numerical ranges and label
all of the rows that match any of the predicates.

The key is using outer to match against many predicates and then checking that any of them was satisfied.

peaks <- pi*c(0,2,4,6,8,10)
low <- peaks - pi/4
high <- peaks + pi/4
ranges <- data.frame(low=low,high=high)

x<- seq(0,10*pi,0.01)
y<- cos(x)
df <- data.frame(x=x,y=y)

# given a vector x
# which elements are contained in one of the ranges
# defined by the high and low columns of the ranges data.frame
library(plyr)
inranges <- function(x, ranges)
{
  a<-outer(x,ranges$low, ">")
  b<-outer(x,ranges$high, "<")
  c<-a & b
  aaply(c,1,function(y) any(y) )
}

# I can now add a new column that indicates which rows matched
df$peaks <- inranges(df$x, ranges)

library(ggplot2)
p <- ggplot(df,aes(x=x,y=y))
p <- p + geom_point(aes(color=peaks))
p

#or I can subset the data to only the matching rows:

df.peaks <- subset(df,inranges(x,ranges))

p <- ggplot(df.peaks,aes(x=x,y=y))
p <- p + geom_point()
p

Data scientists: don’t let your company’s infrastructure make you soft

Posted by bridgewater on January 27, 2012
Posted in: data, GNU/Linux, hadoop. 46 Comments

There has been a lot written about the skills needed to be a Data Scientist.  Not only should you be able to do these standard things:

  1. Wrangle data (get, transform, persist)
  2. Model (explore, explain and predict)
  3. Take action (visualize, summarize, prototype)

…but I would argue that you should also be able to start with a bare machine (or cluster) and bootstrap a scalable infrastructure for analysis in short order. This does not mean you need to be able to administer a 1000-node hadoop cluster, but you should be able to set up a small cluster that can process TBs of log data into something that has business value.

For people who work for a big company it is easy to fall into the habit of using whatever infrastructure is available. Your IT department may have set up a hadoop cluster, there may be databases that are pre-configured and there are probably a lot of nice productivity tools that make it easier to analyze data at work.  It makes perfect sense for companies to provide these conveniences and it probably makes your job easier.  But it is also easy to get too cozy with this tool chain and come to rely on it.

In this series of posts I am going to talk about the analysis stack on my personal computers that help me do those things.

  1. R (and RStudio)
  2. MySQL
  3. Hadoop (Scala, Cascading, scalding, scoobi)
  4. …

It took me a while to get this set up but I have a goal of being able to start from scratch and install a complete working data science setup in 6 hours or less.

Simple enough to comprehend?

Posted by bridgewater on October 10, 2011
Posted in: politics. Tagged: Politics. Leave a comment

The Tea Party, Occupy Wallstreet and many other movements (that are not about human rights) share the same problems. In order to gain a big following they have to have very, very simple ideas at their core. Charles Stross wrote a nice blog post about a totally different topic but I am going to shamelessly quote out of context.

I think these ideas are mostly delusional because they rely on a fundamental misapprehension about the world around us — namely that we live in a society that can be made simple enough to comprehend.
Stross: insufficient data

Governing gets harder as the world gets more complex because our ability and desire to understand complexity is not growing exponentially.

Using transparency for data count intuition

Posted by bridgewater on September 27, 2011
Posted in: data, Programming, visualization. Tagged: analysis, rstats. 1 Comment

This is an illustration of representing point count in a graphic using transparency. This is easy to do in ggplot2 if you use one of the barchart type of geoms.  However I think there are other situations where it would be useful to apply aesthetics based on point count.

Since Hadley did a lot of his canonical examples using this data I thought it would be helpful for comparing and contrasting.

This chart shows the distribution of the price/carat of diamonds segmented by quartile of carats and clarity.  The transparency shows how many diamonds each bar represents.  This makes it easy to see where the action is.

 library(ggplot2)
# create copy of diamonds
 df <- diamonds
# compute the quartiles of carat
 df$carat.qtiles <- cut(df$carat,unlist(quantile(df$carat)),include.lowest=T)
# plot the probability distribution of price/carat, faceted by clarity and carat quartile
 # key point: using the count per bar to set the alpha level. This lets you see how much
 # data is represented by each bar (it would be nice to be able to do this
 # anytime an aggregate is done...boxplots, bins, etc.)
 p <- ggplot(data=df, aes(x=price/carat,y=..count../sum(..count..)))
 p <- p + geom_histogram(aes(alpha=..count..),binwidth=1000) +facet_grid(clarity~carat.qtiles)
 p

Currently in ggplot2 this method will only work if the ..output.. variables related to count are available. There are a number of areas that could benefit from this capability.  It should also be easy to add more output variables to the elements of ggplot for which this behavior would be natural.

  1. geom_boxplot:  Geoms that aggregates multiple points are good candidates for this
  2. facet_*: It would be interesting to be able to add a visual cue to each facet to show how many points are in each.
    1. The most appealing idea on this so far is to enable scaling of the facet area by point count (or other things).
    2. Ordering of the facets by point count would also be extremely useful.
    3. Thresholding by count.  This would be great to easily chop low-signal facets and keep the visualization clean.
    4. Other half-baked ideas include background color, alpha box border…

Your metrics are broken

Posted by bridgewater on September 14, 2011
Posted in: data. Tagged: analysis, goals, metrics, progress. Leave a comment

How do I know?  Well it is simple;  almost everyone evaluates situations in the world using metrics that do not represent their goals with high-fidelity.

For example, Peter Thiel is a great businessman and smart guy but like everyone else, his metrics are broken.  His thesis is that “innovation is dead”.

“If you look outside the computer and the internet, there has been 40 years of stagnation,” said Thiel, who pointed to one of his favorite examples: the dearth of innovation in transportation. ”We are no longer moving faster,” Thiel noted. Transportation speeds, which accelerated across history, peaked with the debut of the Concord in 1976. One decade after 9/11, Thiel says, we are back to the travel speeds of the 1960s.

http://www.forbes.com/sites/nicoleperlroth/2011/09/12/paypal-founders-innovation-is-dead/

Is going faster and faster a good measure of progress? Is there a point where transportation is fast enough? It is clear that there is not technological barrier to having faster planes but society has made it clear that it does not care to invest in that area to gain that extra speed. Maybe other metrics like miles/passenger/(Joule of energy used) is a more relevant metric. Or maybe it is bad too.

Progress != Growth:

Most people associate progress with growth, but GDP growth by itself is not a good long-term goal because it cannot go on forever. If growth is not sustainable then we should not go after it past a certain point.  I do not know the right metric to tell how sustainable a unit of GDP growth is, but I do know that a sustainability component is required to fix the metric.

Why this matters a lot

Creating metrics that reflect your goals (as a person, company, country, ..) is important because people and organizations optimize their activity to metrics. If you are a politician who is judged by whether GDP  goes up, you will pursue polices that try to increase GDP.  If you are a public company that is judged by short-term earnings growth then you will put a lot of energy into optimizing that.

Fixing metrics is simple but hard

Fixing metrics is very hard in practice but it is conceptually simple because the reason for broken metrics is usually easy to identify.

Top three reasons why most metrics are broken:

  1. The metric is venerable.  It used to make sense but the world changed and it is not hi-fi anymore.
  2. The metric is too simple.  The world is complicated and goals are similarly complex. Simple metrics usually leave out important factors. People like simple metrics so they get popular and gain momentum.
  3. The metric looks for keys under the lamp post…rather than down the street in the dark where you dropped them.  This is related to being too simple but complex metrics can also have this failing.  Some goals are hard to represent with metrics with high-fidelity.  But that does not stop people from creating metrics to measure those goals. Those metrics are usually chosen for convenience rather than fidelity. An imperfect metric is fine as long as people are aware of the problems and use the metric accordingly.

Even after you figure out that your metrics are broken it is really hard to fix them.  A hi-fi metric provides real insight into the world and that is always a challenge.  You may even conclude in some cases that there is no simple collections of metrics for a given goal. But fixing your metrics (or your understanding of your metrics) is crucial because failure follows a bad metric around diligently.

map-reduce == assembly

Posted by bridgewater on August 17, 2011
Posted in: data, hadoop, Programming. 2 Comments

Map-reduce is great. It has made it possible to process insane amounts of data on commodity hardware. However it is a very low-level programming abstraction and too low for most problems that analysts and “data scientists”  encounter.

M-R is the assembly programming of big data. It is vital as the base level of the stack. Just as assembly is unproductive for general programming compared to python, ruby or <your-favorite-high-level-language>, M-R is too low level for doing significant analysis work.

PIG and Cascading (and other languages that build on top of M-R) are built with language constructs that match what analysts need to do:

  • load complex data
  • join multiple data sets
  • filter rows
  • project out columns
  • aggregate based on columns
  • apply functions to aggregates

Very few non-trivial analysis problems map effortlessly onto the map-reduce model.  Most problems will require many M-R stages. This can make for brittle code that is hard to maintain. It might seem like you are saving effort by keeping the stack simple and using raw M-R or streaming through python, but productivity will usually suffer.

Getting to know multivariate data

Posted by bridgewater on July 25, 2011
Posted in: data, visualization. Tagged: analysis, rstats. 4 Comments
psych::pairs.panelscorrgram::corrgram on mtcars data
psych::pairs.panels and corrgram::corrgram using mtcars data

Core Ideas:

  • multivariate modeling is challenging
  • pair plots make it easy to get a quick understanding of each variable and the relationships between them

Multivariate analysis and modeling can be really challenging. Getting the job done well requires you to know your data really well. People often use the metaphor the you know something well if you “know it like the back of your hand”. However we look at our hands everyday but probably do not recall the details of where each freckle or wrinkle is. You want to know your data in a much more detailed way.

One very valuable first step when working with a new multivariate data set is to look at the relationships between each pair of variables. There are a number of ways to do this in R and I often prefer to use two different scatter plot matrix methods to get a feel for the relationships between the variables.

Here is an example using the mtcars dataset in R.

df<-mtcars[,c(1,2,3,4,5,6,7)]

Scenario(s):

  1. getting to know your numerical data
  2. predictive modeling (feature selection, technique choice,…)

psych::pairs.panels

why use it?

  1. you can see points with an ellipse superimposed in the lower region
  2. you can see the data distribution on the diagonal for each variable
  3. you can see the correlation values in the upper region
  4. works with categorical data
library(psych) 
pairs.panels(df)

psych::pairs.panels

corrgram::corrgram

why use it?

  1. pie chart in the lower region gives a quick visual view of correlations
  2. min/max values of each variable on the diagonal
  3. correlation confidence intervals in the upper region (in parens below the value)

gotchas:

  1. only works with numerical variables
library(corrgram) 
corrgram(df,lower.panel=panel.pie, upper.panel=panel.conf, diag.panel=panel.minmax)

corrgram::corrgram on mtcars data

Based on these plots it is easy to see some important high-level relationships between the variables.

  1. mpg is strongly inversely proportional to:
    1. cyl : number of cylinders
    2. disp: engine displacement
    3. hp: horsepower
    4. wt: vehicle weight
  2. mpg is negatively proportional to:
    1. drat: rear axel ratio
    2. qsec: time to get drive 1/4 mile
  3. rear axel ratio and weight do not have a strong relationship with the 1/4-mile time. This means that if you want to predict 1/4-mile time, you would not want to use these as unconditional predictor. In fact it might cause you to start looking for interactions between the variables so you can do conditional modeling.
  4. rear axel ratio is inversely proportional to wt, hp, disp and cyl. I know nothing about cars, but now I know that heavier, more powerful cars tend to have a smaller rear axel ratio.

There is also a lot of great basic summary info here:

  1. A distribution plot for each variable
  2. The min and max of each variable

This still only provides a very superficial understanding of the data, but this is a good start. There are lots of different options and ways to use both packages, so you can adapt how you use these functions for your own style and preferences.

Posts navigation

← Older Entries
  • Twitter: @drbridgewater

    • I think "unified theories" of data stack are definitely worth exploring. To your point, it forces you to think abou… twitter.com/i/web/status/1… 1 week ago
    • @posco Tesla nailed this with the model S. I think focusing on desirable EVs is what Tesla will be remembered for 2 weeks ago
    • RT @gasca: "I have seen the greatest minds of my generation driven mad by endless planning sessions"... I wrote up a post on how to move… 2 weeks ago
    • 100% agree. The most ambitious high school students are likely more capable than an average 1st year grad student. twitter.com/ProfArmani/sta… 3 weeks ago
    • @posco Incredible achievement man. 3 weeks ago
    • This is why it's easier to have style when you are young. The older you get , the more time you have had to think a… twitter.com/i/web/status/1… 3 weeks ago
    • RT @shreyas: Early on, deciding the perfect North Star Metric is not nearly as important as identifying the right North Star Problem. 3 weeks ago
    • RT @shreyas: The desire to feel smart and sound smart is the root of so much product failure. 3 weeks ago
    • I am looking to hire a data scientist or analyst with engineering leanings. Brightline is working to help kids wit… twitter.com/i/web/status/1… 2 months ago
    • RT @AlecStapp: Biggest permanent change due to the pandemic is remote work: Office occupancy still down by more than 50% (!) and the retur… 6 months ago
    Follow @drbridgewater
Blog at WordPress.com.
Jesse S.A. Bridgewater
Create a free website or blog at WordPress.com.
Privacy & Cookies: This site uses cookies. By continuing to use this website, you agree to their use.
To find out more, including how to control cookies, see here: Cookie Policy
  • Follow Following
    • Jesse S.A. Bridgewater
    • Already have a WordPress.com account? Log in now.
    • Jesse S.A. Bridgewater
    • Customize
    • Follow Following
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar
 

Loading Comments...