|
I'm on a panel at Data Week, and the topic is: "Data science: People, algorithms, or data?"
Your thoughts? |
|
Well, The Unreasonable Effectiveness of Data argues that tons of data and simple algorithms is better than fancy algorithms without lots of data. And at best, great data scientists will only generate great algorithms (admittedly, they might generate great insights into data). So I'd rank data > data scientists > algorithms, because having great data gets rid of the need for great data scientists or algorithms, and data scientists can create the algorithms when they need to. Professionally I've had some success improving Google local search by applying simple algorithms to huge data sets. |
|
My order would be data scientists > algorithms > data. The basis of my ranking is this wonderful article on metamarkets blog. It is a well articulated rebuttal of the Norvig's unreasonable effectiveness of data thesis. I believe the scientists craft good algorithms that can run even on insufficient data. |
|
This is a tough question and if they are considered orthogonal, I would have to say DATA > PEOPLE > ALGORITHMS. People will probably argue that "people" is what takes a bunch of data and makes it into gold... but there are plenty of companies that do not collect the right data, or do not collect it in the correct form (say aggregates instead of raw data where the raw data is more powerful) such that no matter how much of a rockstar someone is they can't do anything with the data. Once you have rich "data", it is up to the "people" to make sense of it and you need to have good people. These people implement simple or complex "algorithms" that provide great insights. I list "algorithms" as last because if treated as independent of the other two, all we have are canned implementations that don't help much. Realistically, it is something like: DATA > PEOPLE * ALGORITHMS |
|
People definately get #1. Why? They multiply all the other factors. Good people will lead to good data (feature generation, including other data sources, intuition) and be able to code/adapt existing algorithms. Also, much more important: People will be able to tell what algorithms to use and how to model things. So the tough question is whether algorithm or data come 2nd. In my opinion, it depends on the problem. :) We know for one that very simple models can solve simple problems extremly well, often better than complicated models. This is not only due to overfitting, but also due simple models being easier to work with. E.g. you can train them much faster and don't need so much experience with them to get good results. You will never be able to model text on character level with a simple model like a Markov chain. But you will also have a a much tougher time to train an RNN instead of Markov chain to model something as simple as Newtonian dynamics. And in the last case, a Kalman filter with (people generated!) parameters will win anyway. |
|
http://gigaom.com/data/forget-your-fancy-data-science-try-overkill-analytics/ This recent article is about more computers being the magic formula. |