|
Hi, I have the following data: 1) Web pages ranked by traffic. I don't have the absolute traffic, but I know a given page ranks 1st, 2nd etc... in terms of total page views. 2) A list of search strings and their relative popularity. So again, I don't know how often they are used, but I know for example the search term 'Ford' is more frequently searched for than 'clutch'. 3) For each page, I have it's ranking for any particular search string. So I know for example Page A would be first result for 'Ford' and Page B would be result number 174 for the same search string. 4) I have a few other signals, both numerical and properties ex: page age, page classified by topic etc... I am trying to find the combination of search strings a page should cater to in order to rank as high as possible in terms of total traffic. The results have to be human readable so I can understand them and act upon them. I can scatter plot search terms by frequency of use and results returned. Intuitively, it's clear that the search terms which are used often and return fewer results are the best to target as there is a higher probability my page is seen in the limited results. The problem is all my data ranking is relative and I know that the shape of the search term frequency use curve is exponential. ie: most frequent search term is probably used 10x more times than the 5th one which is 10x more times used than the 10th one. Also, some search terms will return 10,000 results and others will return 100. So it makes it very difficult to estimate the shape of the optimum region of frequency vs. number of returned results. I don't want to predict position or anything like that. I just want to know what to use to get the maximum boost. So, for example, I'd be very happy to know which combination of terms are more frequent in the first quartile compared to the others for example. What method would you use to analyse this? (I use RapidMiner as my tool of choice) Thank you so much! I read a ton of stuff and my brain is so overloaded I don't even know where to start... PS: No, this is not data from Google... Although I am sure thousands of researchers have tried to estimate which signals they use using similar techniques. |
|
Is set of all possible search strings fixed? In other words, you have 5,000 possible things you can search for, and you are trying to predict which is the best to use for any given page. If this is the case, I might be able to give you some advice. If not (and you have an infinite number of search strings), you are going to need an obscenely massive amount of data. Yes, the possible strings I wish to optimize for are limited. I have about 150,000 but it's very long tail. The first few thousand must cover 90%+ of all searches so it's these I care about. Even just 1000 would be a good start.
(Apr 11 '14 at 23:38)
Louis Tremblay
Would a collaborative filtering approach work? If you assume that a page's rank is similar to a user rating (in which each search string is a user), then it is not that different than any recommendation engine problem (where you try to determine what movie to recommend to a user, for instance). The only problem with this approach is that it sounds like you are actually trying to manipulate the outcome of the searches themselves. Collaborative filtering could help you predict which search will yield the highest rank of the existing searches, but not how to increase that ranking specifically. At the very least though, if weighted by traffic, it could provide a list of the most valuable search strings at the moment.
(Apr 12 '14 at 00:51)
Daniel E Margolis
Yes, I am specifically trying to increase ranking of a new page by looking at the keywords contained in existing pages and their actual ranking vs. most popular searches. So if pages P1, P2... P100 get a lot of traffic and U1, U2...U1000 get little traffic. I want to look at their keyword mix and try to figure out what keywords combinations are specific to P1 to P100 that seem to push them upwards in the research result ranks. I think knowing the relative use popularity of search strings, the number of results each ones return, and which page they return, should provide very useful signals in this case. For example, I know which search strings return the popular pages and at what rank. Since I already know that, I don't see what a recommender system would give me more. Unless there is something I don't understand in your suggestion. So to recap what I know is: 1) pages sorted in order of traffic. P1, P2, P3... Thank you for your help! It is very appreciated.
(Apr 14 '14 at 11:43)
Louis Tremblay
|