I have a collection of documents with word sequences. These word sequences are just a set of words randomly dumped into the documents viz. they aren't correct sentences which throws natural language processing out of the window. Each word may be conditionally dependent (based on frequency of occurrence) on one (or more than one) words. For example, the word 'iceskating' may appear with the word 'sport' more frequently than with 'news'. Also, 'sport' may occur with 'fox-news' given that 'skate', 'football' also occur. I would like to capture such probabilistic relationships in a graph (preferably directed acyclic graph). Moreover, I would also like to capture hierarchies between these words [1]. For example, sport -> football -> messi.

I was wondering if an appropriate graphical model (hierarchical bayes) would be something which I should look into.

Edit - In addition, anything which can help me "infer" probabilities of new nodes would be wonderful too. I must also add the "data sparseness" issue is very much prevalent in the set of documents.

[1] Mark Sanderson and Bruce Croft. 1999. Deriving concept hierarchies from text. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '99). ACM, New York, NY, USA, 206-213. DOI=10.1145/312624.312679 http://doi.acm.org/10.1145/312624.312679

Disclaimer : I am pretty much an amateur in probability theory.

asked Jan 11 '12 at 13:47

Dexter's gravatar image

Dexter
416243438

edited Jan 11 '12 at 13:49


One Answer:

The simplest way to account for these things is to use a decomposition-based idexing method, such as latent semantic indexing or latent dirichlet allocation. These methods will give you sets of words which roughly correlate, and they allow words to belong to more than one set, accounting for polysemy. They are attractive because directly estimating word-word co-occurrences is far too expensive given the power-law distribution of words: there's never enough data to estimate V^2 parameters where V is the vocabulary size. You can look at these methods as low-rank representations of these word-word co-occurrences.

answered Jan 11 '12 at 14:26

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

Do these methods allow me to create an ontological structure, specifically a directed acyclic graph ? Also, these word co-occurrences incorporate some kind of subsumption relationships which need to be captured.

(Jan 11 '12 at 15:10) Dexter
1

Some of these create tree structures and some create DAGs. See for example the Pachinko Allocation family of algorithms: http://en.wikipedia.org/wiki/Pachinko_allocation

(Jan 11 '12 at 15:13) Alexandre Passos ♦

I found an example of what you're trying to connote [1]. Let me know if I'm correct. However, this is something which's already done. :-(

[1] Jie Tang, Ho-fung Leung, Qiong Luo, Dewei Chen, and Jibin Gong. 2009. Towards ontology learning from folksonomies. In Proceedings of the 21st international jont conference on Artifical intelligence (IJCAI'09), Hiroaki Kitano (Ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2089-2094.

(Jan 11 '12 at 15:19) Dexter

Learning ontologies is far from a solved problem. I'm not familiar with the work you referenced, but there are many directions to go in this field, including redefining the problem to something more reasonable.

(Jan 11 '12 at 15:22) Alexandre Passos ♦

Alexandre, You are a life saver. Both [1] and [2] look a good read. I will delve more into it.

[1] Li, Wei; McCallum, Andrew (2006). "Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations". Proceedings of the 23rd International Conference on Machine Learning.

[2] Mimno, David; Li, Wei; McCallum, Andrew (2007). "Mixtures of Hierarchical Topics with Pachinko Allocation". Proceedings of the 24th International Conference on Machine Learning.

(Jan 11 '12 at 15:25) Dexter

Alexandre, It was about learning ontologies from folksonomies. The approach looked similar to what you proposed. Even a decent enough solution in the space is considered to be good, ain't it? I am curious to know what redefinition would you suggest.

(Jan 11 '12 at 15:34) Dexter
showing 5 of 6 show all
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.