|
(This is slightly off-topic for this site, but I suspect this will be interesting to many people that follow the QA forums.) There are many steps required to build and deploy a system with a significant machine learning or data mining component. I am interested in how much time is spent per step in developing real systems and the community's opinion of the importance of the various steps. To study this question I am conducting a short survey (5-10 minutes) of the community's experience with developing real systems. The survey can be found at http://www.surveymonkey.com/s/39YCVRX Thank you in advance to anyone who can spare a few minutes to take the survey. Update2012/04/09 Thank you to everyone who participated in the survey. The results have (finally!) been published as A Study of the Importance of and Time Spent on Different Modeling Steps in SIGKDD Explorations. Sincerely, Art Munson |
|
In my experience the most difficult part of any modeling exercise is finding the right features. Extracting features is an art and in the real world, the data is never what you think it is. One of the most significant steps in determining success, and considerably more subtle than most of the mathematical steps is the user experience design. You can make or break any modeling effort here with less effort than anything that you can do with math or algorithms. The same algorithm and the same results can be completely, hopelessly bad or fantastically good depending on presentation. Take, for instance, the framing of the results. Suppose you build a web page classifier that is 50% correct in putting web pages into categories and present the results to users as a complete and correct classification of all of the web. The nearly guaranteed result is that users will complain bitterly that you didn't produce very good results. Now take the same engine and show the results as "you might possibly like these web-pages" in the context of pages recently viewed. At this point, the user's expectations will be lower and the result is likely to be a "glass half full" appraisal of the results. One recent result that I saw involved an analysis of a search engine. A graph of click-through-rate versus search result position showed a very strong drop at position 11 due to user reticence to click to a second page of results. In such a situation, simply increasing the number of results to 25 is likely to nearly double the number of relevant results found by users with no change in algorithm. I don't know of many search algorithm changes that could double the quality of search results, much less double the rate that users benefited from the results. Interestingly, neither feature extraction nor presentation is much the focus of machine learning research. Much research is applied to the iota of difference that you might find in the results of a logistic regression versus an SVM, but little has to do with how to extract useful information from real data. The KDD contests and the recent trend toward very wide, sparse datasets is a refreshing change in this regard. quite good suggestion !
(Sep 06 '11 at 23:27)
lpvoid
|
Please note that I posted this same survey in the spring of 2010 to a couple mailing lists, and the results from the roughly 20 responses were very interesting. This time I hope to get enough responses to be confident in the results and communicate them to the community. There is no need to complete the survey a second time if you responded in the spring.
This looks interesting. I would be curious to hear any high-level conclusions from the survey results.
I'm okay with this as long as you post an answer summarizing your results.
@Joseph: I will be sure to post a summary of the results, in answer to this question.
http://www.surveymonkey.com/s/39YCVRX is not valid now !