Data scientists: don’t let your company’s infrastructure make you soft

Posted on January 27, 2012. Filed under: data, GNU/Linux, hadoop |

There has been a lot written about the skills needed to be a Data Scientist.  Not only should you be able to do these standard things:

  1. Wrangle data (get, transform, persist)
  2. Model (explore, explain and predict)
  3. Take action (visualize, summarize, prototype)

…but I would argue that you should also be able to start with a bare machine (or cluster) and bootstrap a scalable infrastructure for analysis in short order. This does not mean you need to be able to administer a 1000-node hadoop cluster, but you should be able to set up a small cluster that can process TBs of log data into something that has business value.

For people who work for a big company it is easy to fall into the habit of using whatever infrastructure is available. Your IT department may have set up a hadoop cluster, there may be databases that are pre-configured and there are probably a lot of nice productivity tools that make it easier to analyze data at work.  It makes perfect sense for companies to provide these conveniences and it probably makes your job easier.  But it is also easy to get too cozy with this tool chain and come to rely on it.

In this series of posts I am going to talk about the analysis stack on my personal computers that help me do those things.

  1. R (and RStudio)
  2. MySQL
  3. Hadoop (Scala, Cascading, scalding, scoobi)

It took me a while to get this set up but I have a goal of being able to start from scratch and install a complete working data science setup in 6 hours or less.


Read Full Post | Make a Comment ( None so far )

Recently on Jesse S.A. Bridgewater...

Simple enough to comprehend?

Posted on October 10, 2011. Filed under: Uncategorized | Tags: |

Using transparency for data count intuition

Posted on September 27, 2011. Filed under: Uncategorized | Tags: , |

Your metrics are broken

Posted on September 14, 2011. Filed under: Uncategorized | Tags: , , , |

map-reduce == assembly

Posted on August 17, 2011. Filed under: data, hadoop, Programming |

Getting to know multivariate data

Posted on July 25, 2011. Filed under: Uncategorized | Tags: , |

Simple plyr/ggplot example of cummulative distribution plots

Posted on June 10, 2011. Filed under: Uncategorized | Tags: |

Careful experiments

Posted on January 3, 2011. Filed under: Uncategorized |

My favorite R packages (installed with one command)

Posted on December 21, 2010. Filed under: Uncategorized | Tags: |

Load R packages…directly from cran if needed

Posted on December 12, 2010. Filed under: Uncategorized | Tags: |

Sorting python dictionaries by value

Posted on March 5, 2008. Filed under: Programming | Tags: , , |

  • Twitter

  • My del.icio.us links

Liked it here?
Why not try sites on the blogroll...

Follow

Get every new post delivered to your Inbox.

Join 68 other followers