This question is cross-posted on CrossValidated. However, after stumbling on this site, I speculated that the community here might be better equipped to answer my question.

I am working on a project where I want to extract some information about the content of a series of open-ended essays. In this particular project, 148 people wrote essays about a hypothetical student organization as part of a larger experiment. Although in my field (social psychology), the typical way to analyze these data would be to code the essays by hand, I'd like to do this quantitatively, since hand-coding is both labor-intensive and a bit too subjective for my taste.

During my investigations about ways to quantitatively analyze free response data, I stumbled upon topic modeling. Since topic modeling appears to use a term-document matrix to extract groups of words supposedly generated by the latent topics in a corpus, it appears to be just the tool I was looking for. Unfortunately, when I've applied topic modeling to my data, I've discovered two issues:

  1. The topics uncovered by topic modelling are sometimes hard to interpret
  2. When I re-fit my topic models with a different random seed, the topics seem to change dramatically

Issue 2 in particular concerns me. Therefore, I have a two related questions:

  1. Is there anything I can do in the LDA procedure to optimize my model fit procedure for interpretability and stability? Personally, I don't care as much about finding the model with the lowest perplexity and / or best model fit -- I mainly want to use this procedure to help me understand and characterize what the participants in this study wrote in their essays. However, I certainly do not want my results to be an artifact of the random seed!
  2. Related to the above question, are there any standards for how much data you need to do an LDA? Most of the papers I've seen that have used this method analyze large corpora (e.g., an archive of all Science papers from the past 20 years), but, since I'm using experimental data, my corpus of documents is much smaller.

I have posted the essay data here for anyone who wants to get his or her hands dirty, and I have pasted the R code I'm using below.

require(tm)
require(topicmodels)

# Create a corpus from the essay 
c <- Corpus(DataframeSource(essays))
inspect(c)

# Remove punctuation and put the words in lower case
c <- tm_map(c, removePunctuation)
c <- tm_map(c, tolower)

# Create a DocumentTermMatrix.  The stopwords are the LIWC function word categories
# I have a copy of the LIWC dictionary, but if you want to do a similar analysis,
# use the default stop words in tm
dtm <- DocumentTermMatrix(c, control = list(stopwords = 
  c(dict$funct, dict$pronoun, dict$ppron, dict$i, dict$we, dict$you, dict$shehe, 
        dict$they, dict$inpers, dict$article, dict$aux)))

# Term frequency inverse-document frequency to select the desired words
term_tfidf <- tapply(dtm$v/rowSums(as.matrix(dtm))[dtm$i], dtm$j, mean) * log2(nDocs(dtm)/colSums(as.matrix(dtm)))
summary(term_tfidf)

dtm <- dtm[, term_tfidf >= 0.04]

lda <- LDA(dtm, k = 5, seed = 532)
perplexity(lda)
(terms <- terms(lda, 10))
(topics <- topics(lda))

asked Jul 01 '13 at 15:04

Patrick%20S%20Forscher's gravatar image

Patrick S Forscher
1113

edited Jul 01 '13 at 15:06


3 Answers:

I would say you know the answer to the problem ... your data of 148 essays is much too small to estimate anything... basically 100 points is enough to fit a single univariate regression ( ie 2 variables)...

can you not train on similar data and then apply trained model to your 148 essays?

answered Jul 01 '13 at 19:51

SeanV's gravatar image

SeanV
33629

If I had a similar dataset in hand, I would not need to use topic modeling on this one, as I would already know the topical structure of these data. :) Also, collected data from this experiment can take quite a bit of time, since it takes around 1.5 hours of experimenter labor per participant.

Basically, I understand the limitations involved in having only 148 essays using more conventional statistical methods (e.g., simple linear regression model using 3 parameters with a model R^2 of .08 has has 80% power to detect one parameter with an R^2 of .05, given an alpha of .05 with 148 participants). What I want to know is, given my situation, what are my options for understanding the topical structure of these data?

(Jul 01 '13 at 20:18) Patrick S Forscher

I guess if the answer is "You can't do much with that level of data", my question would be, are there any freely available corpora of college-age students talking about race-related issues?

(Jul 01 '13 at 21:21) Patrick S Forscher

It is a better idea and looking very different essay project. If you are good in hand coding then you can do this essay very easily. Also you can have some better essay writing service help. If you need those kind of help you can contact this UK international relations thesis writing service for further more help.

answered Jul 31 '14 at 05:32

Chrisatorain's gravatar image

Chrisatorain
1

There was some recent work specifically aimed at addressing the stability issue. They use a network clustering approach to find the initial topics (and automatically estimate the number of topics), then fine-tune it with LDA or PLSA: http://amaral-lab.org/media/publication_pdfs/PhysRevX.5.011007.pdf

This gives more consistent results because the initial topics are generated in a consistent way and these initial topics tend to be close to a local maximum in the likelihood space, as opposed to random initialization.

answered Feb 04 at 12:46

Brian's gravatar image

Brian
1

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.