I have been working on a project where I am trying to predict the exact price or the price interval of a service and the data are composed of the index of options that the customers click (e.g. the 7th option is clicked by the customer, the customer wants option 7, and so on) However, the data is highly imperfect and the algorithms I have been running for months indicate that the distinctive features of options that specify the key features of the service are held in the dark corners of the database and the IT guys are reluctant to retrieve them, claiming it is almost impossible. So, I’m shut out of a high rate of important data.

My problem starts at this point. Since distinctive data are not available, I don’t have much to predict on. Imagine that you are trying to predict the price of a car and what you have is “I want a car with a steering wheel and 4 doors”. There are 146 categories of jobs and I got to extract features for only 3 of them. If I had the perfect data, my predictions would likely yield a MAPE of at most 5% which is much more than the maximum error rate for these 3 categories. Nonetheless, for the rest of the categories, it is now around 65% at its best. Here are my approaches:

• I used the categories as classes.

• I clustered the input (options) to see If there are any links for each input cluster

• I generated output clusters (price intervals) and tried to derive what specifications each price range has. I constructed a data matrix, rows indicating each demand and columns indicating the option indices. Hence, each cell is binary according to the selection of the option for a specific demand. The matrix is highly sparse. Here are the approaches of how I tried to deal with them:

• Tried to find out that the cluster to which a new given service belongs. After that, I found the distance of the new service to the other services in the cluster. Then, I tried to create optimum distance ranges and tried to predict the price using each distance

• Gave up the distance ranges and tried to predict the price using each

• Tried to predict the weight of each option on price for each cluster.

• Tried to create association rules.

• Fuzzify the distances and then predict the price.

Here are the methods I have used for prediction and optimization

• As can be predicted, some variations of Neural Networks

• Random Forests

• First hierarchical, then K-Means Clustering or Self Organizing Maps tuned by Silhouette and C indices

• Simulated Annealing/Particle Swarm Optimization and 5 other local/global search favoring metaheuristics

• A priori and fuzzy APACS

To be specific, there is only one constraint, that is, the price interval to be predicted should contain the actual price (no exceptions). I work with two objectives, the exact price to be predicted should have the minimum error rate. The price interval should be as small as possible while being reasonable (I cannot say a customer that his service will cost 50 to 5000 Dollars, etc)

I have a total of 27 variations/combinations of algorithms which were not helpful. The cluster features are almost the same with some exceptions. What more can I try? Where am I possibly wrong?

asked May 13 '14 at 08:08

Ay%C3%A7a%20Altay's gravatar image

Ayça Altay
1112

edited May 13 '14 at 08:11

Be the first one to answer this question!
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.