|
I am looking for machine learning techniques that are used for mapping the topics in one corpus to the topics in another corpus. They need not be from the same domain, and the actual mapping can represent a "realated to" relation instead of "same as" relation. For example, the first corpus might contain articles describing various cuisines (eg: chinese, italian, indian) and the second corpus contains short descriptions of various restaurants. Are there common machine learning approaches that are used to extract associations between topics in the first corpus (cuisines) to topics in the second corpus (restaurants) ? Given a small amount of training data, are there any effective semi-supervised approaches to the same problem? |
|
You might try modeling it woth a Hierarchichal Dirichlet Process, where you have a meta topic of food, where Cuisines and restaurants are included, that metatopic will in turn model the kind of mapping you are looking for. I am also thinking a PCA could help you to relate these 2 variables. Can you expand a little bit on the PCA? (For a document-term matrix, since there are 2 corpora, there are 2 matrices, one for each corpus), Also, I have a small amount of (expensive) training data, and I would like to use that to improve model performance.
(Feb 04 '11 at 01:46)
kungpaochicken
In esscence PCA bascially allows you to create a topic vector that relates your variables, it usually depends on the variables being part of the same set, but if you have 2 sets of the same length you might be able to get something out there. Try to check the lecture by Andrew Ng on PCA, it will explain it a lot better than me.
(Feb 04 '11 at 02:03)
Leon Palafox ♦
|
|
An older, often forgotten, method from statistics is canonical correlation analysis (CCA). Wikipedia has a reasonable description that makes me think CCA or some fancy version to handle text data could be relevant. A second idea you might try is an auto-encoding neural network (sometimes called a self-organizing map). The hidden layer will be the shared topic space in which the documents will be clustered. Training is based on the squared error of the network outputs compared to the network inputs. It would be important to randomly mix the two corpora together when presenting training examples. After training, freeze the weights and extract the hidden unit activations for each example to get the document's projection into topic space. |
You have an interesting problem. I think a little more detail might help. Do you have the topics already, or do you need to extract them as part of this process? If you have the topics, are the documents already labeled with the appropriate topics?
The topics don't exist - at least directly, however there is some structure in the document, like title, summary , section titles, etc.. - The title of a document serves well as a proxy for the topic, however, the topics have to be extracted. Also, there is no tag information.