|
Recently I have been mulling with my peers about how to implement a recommendation of articles to read based on a user interests. The system that is already in place is a simple IR system that ranks articles similar to the one you have read and rank them by similarity to the current one on a side bar for the user to read. Now the next step is to map user interests to extract ONLY THOSE ARTICLES THE USER IS INTERESTED IN. And present them to the user (Raw Data for a user interests may be assumed to be collected from their data on twitter/facebook or his selection a set of interests from a web interface). The user data I collected from random twitter profiles looks like this(Ignore the numbers to the right of the Bigrams) My peers suggest that we should use Yago ontology and map both the articles and the user interests data to this ontology to find a correlation. The ontology proposed is probably going to be one from the yago and dbpedia datasets. The expectation is like for example, an article like this: [http://www.examiner.com/conservative-in-atlanta/what-would-reagan-tell-obama-1] Would pick up on topics like "Barack Obama", "Ronald Reagan". However, it would also notice a couple words that are mentioned often: "unemployment" and "economy", and would thus be able to generate something like the following "pastiche" of the article: http://api.myapi/parse?url=http://www.examiner.com/conservative-in-atlanta/what-would-reagan-tell-obama-1 { known_topics: ["Barack_Obama", "Ronald_Reagan"], related_topics: ["George_Bush", "Bill_Clinton", "President_of_the_United_States"], known_words: ["unemployment", "economic", "presidency"], related_words: ["economic process", "presidential", "market forces"] } Then the user interests may be easily queried against the fields above by also mapping them to similar structures. I was looking into YAGO2 and i see a plethora of information in it that could help me, but how do I start designing such a system? 1) What is the right way to choose the right relations(out of so many) to query for a given Named Entity? I am thinking that I should somehow query first to find what category the "entity" belongs to, and then follow a set of hard coded rules to get the kind of information I want. 2) "Barack Obama" would probably be in different class from "US elections 2008" but they do have something in common, how do i design a sound inference logic over such data to extract important facts? Like Barack Obama ran for US election in 2008 and is thus related to words like politics, president etc. Is an ontology the right tool for this? Where can I find good literature/tutorials on using YAGO and its design? |
|
I suggest you have a look at the wikipedia-miner project. Specially its ML-based disambiguator and linkdetector tools could be useful in your project. Thanks for the link. The task for finding related items could definitely be improved by wikipedia miner. At this point I am already getting good related items. So what I would want is to do some recommendations. This is inherently different from finding related items, disambiguation, alias detection. Like , for example if you have an entity like "harry potter(charchter)" a set of related items might be the books, the characters in harry potter etc. But a set of recommended items would be, say, perhaps characters from one of JRR Tolkeins books etc. One lead I have is to take the Wiki Categories from an Identified entity and recommending entities that are in the wiki category page. But the technique needs some IR metrics to limit the high recall in some cases.
(Mar 23 '11 at 11:11)
kpx
|
|
I am not sure if this idea could be generalized, but if you look at the article "Harry Potter (character)" it is a subcategory of a number of interesting categories such as "Child characters in film", "Child characters in written fiction", "Child superheroes", etc. So if you crawl the hierarchy up one level and look at the siblings of H.P. article you would find other child superheroes such as:
|
|
OK so I did finally get some good data for my recommendation system. But wait there was a more important task first! What I realized was that some tasks were needed to be achieved first before recommendation was attempted namely Named entity normalization, alias detection, disambiguation before any recommendation system be built. This is mainly so because a) It is necessary to reduce the redundancy between named entities to reduce noise and consolidate metrics b) It helps a lot to know what related entities are closest to an entity. For example, If you are a news website (like the one i am working for) tracking entities by there hit ratio on your website to craft your news better you can get a list of entities that are closely related to your trending topics. You can study their hit ratio and see what items are related and use those as well. Here is how I achieved this: I ran NLTK tagger on my dataset to extract Noun bigrams via POS Tagging; Then used a simple wikipedia databases to find related items corresponding to noun bigrams in wikipedia. I calculated the cosine similarity between the related items of two query entities and weighted them to keep the top K results, this in general gave me similar entities within my dataset. So for example take a query entities E1 and find its similarity to E2 and E3 and there corresponding related items Rn: E1: [R1, R2, R3, R4] E2 : [R2, R4] E3 : [R3] S1_2 = E1.E2/|E1|*|E2| = 0.707 Similarly for other entities Now I wanted expanded the query entities to find other related entities. I took the related items for the top K matched entities above and just kept those that were in my dataset, this is equivalent to query expansion and gave me more results. This is a mockup of the results. Its a page that displays 5-tuples(hard coded to avoid clutter) of randomly chosen related named entities. My data set was a news corpus for the client I work for. There is an important considerations like the generality of an entity like "United States" is very general and highly occurring and it is probably not useful, I reverse weighted these by a simple IDF metric. Sounds simple right? Well it is! The only quirk is normalization of the Named entities to some ground form. You are essentially querying your dataset entity set against wikipedia, to get related entities, and you are also "looking back" wikipedia entities that are in your dataset for doing the query expansion. This normalization step helps remove a lot of the crap the NLTK POS Tagger spits out. Th similarity measures are calculated via a graph like structure and not pair wise making it extremely efficient. And only 2 NOSQL storage tables are required to store the data. So the system is not complex infrastructure wise. One pleasant side effects are that wikipedia offers direct Disambiguation dataset to get those too. Abbreviation/Alias detection was done via levensthein distance measure. My next challenge is to build a recommendation system on top of this, which is essentially different from related items systems. Pointers much appreciated! |
Ok I have found a technical report describing YAGO2 extensively http://domino.mpi-inf.mpg.de/internet/reports.nsf/c125634c000710d0c12560400034f45a/97dff14cb0fd1562c12577d9002c0d46/$FILE/MPI-I-2010-5-007.pdf I will go over it. This should answer my questions on how exactly YAGO is built. Now all I need to understand is how to work with it..