3
1

I am looking for machine learning techniques that are used for mapping the topics in one corpus to the topics in another corpus. They need not be from the same domain, and the actual mapping can represent a "realated to" relation instead of "same as" relation.

For example, the first corpus might contain articles describing various cuisines (eg: chinese, italian, indian) and the second corpus contains short descriptions of various restaurants. Are there common machine learning approaches that are used to extract associations between topics in the first corpus (cuisines) to topics in the second corpus (restaurants) ?

Given a small amount of training data, are there any effective semi-supervised approaches to the same problem?

asked Feb 03 '11 at 01:37

kungpaochicken's gravatar image

kungpaochicken
66124

You have an interesting problem. I think a little more detail might help. Do you have the topics already, or do you need to extract them as part of this process? If you have the topics, are the documents already labeled with the appropriate topics?

(Feb 03 '11 at 02:32) TR FitzGibbon

The topics don't exist - at least directly, however there is some structure in the document, like title, summary , section titles, etc.. - The title of a document serves well as a proxy for the topic, however, the topics have to be extracted. Also, there is no tag information.

(Feb 04 '11 at 01:36) kungpaochicken

2 Answers:

You might try modeling it woth a Hierarchichal Dirichlet Process, where you have a meta topic of food, where Cuisines and restaurants are included, that metatopic will in turn model the kind of mapping you are looking for.

I am also thinking a PCA could help you to relate these 2 variables.

answered Feb 03 '11 at 04:51

Leon%20Palafox's gravatar image

Leon Palafox ♦
40857194128

Can you expand a little bit on the PCA? (For a document-term matrix, since there are 2 corpora, there are 2 matrices, one for each corpus), Also, I have a small amount of (expensive) training data, and I would like to use that to improve model performance.

(Feb 04 '11 at 01:46) kungpaochicken

In esscence PCA bascially allows you to create a topic vector that relates your variables, it usually depends on the variables being part of the same set, but if you have 2 sets of the same length you might be able to get something out there.

Try to check the lecture by Andrew Ng on PCA, it will explain it a lot better than me.

(Feb 04 '11 at 02:03) Leon Palafox ♦

An older, often forgotten, method from statistics is canonical correlation analysis (CCA). Wikipedia has a reasonable description that makes me think CCA or some fancy version to handle text data could be relevant.

A second idea you might try is an auto-encoding neural network (sometimes called a self-organizing map). The hidden layer will be the shared topic space in which the documents will be clustered. Training is based on the squared error of the network outputs compared to the network inputs. It would be important to randomly mix the two corpora together when presenting training examples. After training, freeze the weights and extract the hidden unit activations for each example to get the document's projection into topic space.

answered Feb 17 '11 at 18:20

Art%20Munson's gravatar image

Art Munson
64611316

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.