|
I'm processing Wikipedia into a semantic model, and there are lots of different options to play with: do I weight terms based on part of speech? Do I put an extra logarithm around tf-idf values? Where do I cut off small articles? Where do I cut off word-article links? It's a high dimensional feature space, and there's going to be lots of dependencies between features (e.g., the more aggressively I prune articles, the smaller the window I'll be looking for term tf-idf dropoffs). Two additional difficulties: Building a model is a multi-day process, and while I can put together test cases, they're less rigorous than I'd like (semantics is a bit hard to quantify: if I look closely at why the model comes to its results I learn a lot more than just looking at how similar the model classifies a group of human classified documents). What do other people do in cases like this, where model building is expensive and human evaluation is likely superior to anything automated? Optimize a dimension at a time? Do an ad-hoc beam search? Use some automated optimization technique? If you use something automated to select feature sets, is there a way to encode your human intuitions (e.g., if you turn feature X on, you're going to want to turn features Y and Z off)? |
|
There is no general solution to this problem. The best you can do is to first gain as much prior knowledge as possible by looking at what others have done in similar settings. Then I would divide the process in macro-optimization and micro-optimization. By the micro-level, I mean feature weighting, normalization, balancing, hyper-parameter tuning, etc. You can attack these issues by heuristic search using some form of cross-validation or development data. Generally these settings are important, but not transferable to other problems or choices at the macro-level, so you need to tune them for your specific data and model. If this tuning takes too much time, you could consider doing the tuning on a sub-set of the data. Some things, such as tf-idf vs log(tf-idf) could simply be fixed and the remaining choices tuned. By the macro-level, I mean the selection of model family, feature types to include, annotation scheme, evaluation criteria, etc. This is the interesting part and where you can do the real improvement. For each choice you make at this level you generally have to do the tuning at the micro-level. The best strategy for improving the system I think is to consider a hypothesis at the macro-level, do the tuning on the micro-level, train the full model and then automatically and manually check the results. Only by actually looking at the data and the errors the system makes, can you learn something and form new hypotheses. Even by looking at the data alone, you can form hypotheses about what the model should look like, what features should be included etc. but there are always interactions between these choices that you cannot foresee. |