5
2

Hi, I am very new to this field. I am working to develop a recommendation model where I push the right kind of content(text) to users based on his stated(optional) interests and how he has interacted with content previously.This particularly important to me as I want to (mainly) engage the user in intensive tasks with the content (e.g. summarising, transcreating etc) hence it is absolutely essential to get the reccomendations right. 1. I understand that I should serve him content similar to one he has interacted with previously, and hence there are text features I should look at particularly. Any good pointers as to what I should be looking for? 2. I am not sure collaborative filtering methods would be able to deliver for this probem. Is this so? Please let me know what you think.

Any other comments, suggestions in this direction are welcome. Any pointers to papers or websites that already do something similar are welcome.

Edit: I should add, recommendation of content based on what other users similar to him have interacted with might not be of much use, as since they(other users) have already begun working on that content (e.g begun summarisation) its a job ideally they should themselves complete. And it wont serve the primary purpose of making the former work on new relevant content.

asked Dec 31 '10 at 07:10

Green%20Edu's gravatar image

Green Edu
96137

edited Dec 31 '10 at 08:12


5 Answers:

Although the question is vague on several points, it appears that the intended users of the application are subject matter experts of sorts who will be summarizing and otherwise converting or improving the text content, in ways which imply understanding and expertise in specific domains.
The users' area of interest/expertise is explicitly provided as they (optionally) state their interest, and it is inferred from the way the users interacted with content formerly shown to them.

From the question, we can maybe assume that the application has (or should have) a list of domains of interest (the list off which users may optionally enumerate their interest).
Another assumption is that the semantic content of the texts selected for the users is the main driver of their interest. Features such as text length, author, date etc. may play a secondary role but, on the whole, the texts should be selected, for a given user on the basis of the topic(s) discussed therein.

With this in mind a tentative list of features to focus on is

  • "Salient" words in the text
    for example the top 20 words ranked by their tf-idf.
  • Keywords:
    for example non trivial words from the title (if title happens to be explicit) or more generally from the table of contents, from highlighted words (if text is marked-up), or from captions (of images, graphs...)
    The availability of such keywords is contingent to the specific format and structure of the texts, but it is mentioned here generically because such words are typically very representative, semantically speaking, and can be easily parsed, with relatively simple patterns/heuristics.
  • Named Entities
    This would imply that the text be pre-processed for recognizing Named Entities. Such a task may be doubly rewarding however: first by providing a set of typically relevant features, second, because the NER process itself may be supported with a lexical repository which is mapped to the various area of interest/expertise.
  • particular fields from the metadata:
    Author,
    Date (should be a relative date, possibly on a log scale, i.e. 1= less than 2 days old, 2 = less than week old, 3= less than month old, etc.)
    Number of words,
    Subject / kewords ?
  • feedback from user.
    Could be a 1-5 rating, from "not interested at all" to "I positively loved it, send me more right now!". Remember to allow the user not not rate every single text proposed to him/her; or maybe allowing for differed rating, as in "a-priori interested, but not right now".

As to the validity of collaborative filtering... That's a broad subject... I'll be back on that...

answered Jan 01 '11 at 15:26

ecotone's gravatar image

ecotone
115137

This is quite an interesting problem, because you have a lot of information to harvest :

  • A Documents can be characterized based on their content (keywords or semantic signature can be used)
  • B Users can be characterized by their self-description (the categories they have picked up in their profile)
  • C Users and Documents can be characterized based on which documents/users they have watched/been watched by.

So from C you have a matrix, from which you can extract another representation of Documents and Users, using SVD or RSVD or many other methods. This is true unless you have no document watched by two users, in this case, you won't obtain anything usable.

So you have two incompatible measures (A and B), and a matrix from which you can extract no information. That's really bad unless you can inject the A and B knowledge into the matrix C.

  • My method (which is still undisclosed, sorry) can do it, but I don't know any other method that could do that.
  • However, you can cheat a little, instead of one user per row in your matrix, you could use a matrix with one row per interest, and one column per vocabulary word, then apply a SVD on it after some TF-IDF or equivalent step. This way you'll be able to characterize a new document based on its words.
  • Finally you could use HOOI on the tensor Words/Documents/Users/Interests. That would be actually awesome.

answered Jan 03 '11 at 15:47

Guillaume%20Pitel's gravatar image

Guillaume Pitel
176148

Interesting thing about hunch. Yes we are trying to build such taste profiles. So that users interact with data more and more often. Btw is there a technical name for knowledge transfer? Or any paper/article you can point me to. (googling knowledge transfer natural langauge processing is leading me elsewhere)

answered Jan 02 '11 at 05:32

Green%20Edu's gravatar image

Green Edu
96137

Check this one http://portal.acm.org/citation.cfm?id=1273592, it is called transfer learning, sorry for the mistake

(Jan 02 '11 at 05:52) Leon Palafox ♦

You might want to take a look in knowledge transfer, that is, how other kind of data relates statistically to your current space.i.e. How the variety on the movies you watch correlates with the variety of furnitures you have in your house.

Is kind of a hot topic right now, and an engine built with a correct implementation of it might be something of an improvement over existing engines.

Try looking into "hunch.com" which is a startup that has a fairly similar idea.

Other ways are to look into the Netflix Contests, which had as a target to provide recommendations based on previously seen movies, some papers were written and their solutions are more on the practical side than in the theoretical, so if you want to look fast clean implementations that is the way to go.

answered Jan 02 '11 at 03:33

Leon%20Palafox's gravatar image

Leon Palafox ♦
40857194128

Thanks for the answer. It has given me good pointers in the direction I should move. To be a little specific, the content transformers would be everyday users, I want to serve them content according to their interest to ensure sustained involvement. Any comments on using OpenCalais for majority of things that you described? Would you suggest any other?

answered Jan 01 '11 at 23:20

Green%20Edu's gravatar image

Green Edu
96137

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.