There has been a lot written about the skills needed to be a Data Scientist. Not only should you be able to do these standard things:
- Wrangle data (get, transform, persist)
- Model (explore, explain and predict)
- Take action (visualize, summarize, prototype)
…but I would argue that you should also be able to start with a bare machine (or cluster) and bootstrap a scalable infrastructure for analysis in short order. This does not mean you need to be able to administer a 1000-node hadoop cluster, but you should be able to set up a small cluster that can process TBs of log data into something that has business value.
For people who work for a big company it is easy to fall into the habit of using whatever infrastructure is available. Your IT department may have set up a hadoop cluster, there may be databases that are pre-configured and there are probably a lot of nice productivity tools that make it easier to analyze data at work. It makes perfect sense for companies to provide these conveniences and it probably makes your job easier. But it is also easy to get too cozy with this tool chain and come to rely on it.
In this series of posts I am going to talk about the analysis stack on my personal computers that help me do those things.
- R (and RStudio)
- MySQL
- Hadoop (Scala, Cascading, scalding, scoobi)
- …
It took me a while to get this set up but I have a goal of being able to start from scratch and install a complete working data science setup in 6 hours or less.