Hi,
I work in an Archives where we have a large collection (2000-3000) of discourse transcripts where:
- The speaker is the same
- Many of the sessions are Q&A sessions, where completely unrelated questions may be asked/answered one after the other.
- We do have a very in-depth knowledge of the corpus and the topics commonly spoken about
- We have compiled a set of about 400-500 concept words which we feel completely covers all topics in the corpus. However, new documents are being generated all the time, so we need the ability to add new concepts (but this is likely to be quite rare - maybe one new concept every 6 months)
- We have already manually tagged about 800 documents, (assigning one or more concepts from our list, to selected passages in each document)
- As well as keyword tags, we also assign a short title - allowing the user to select from the existing titles used (currently we have about 3500) - or create a new title. (This is a much more specific description of the topic spoken about). E.g. keyword = "death", Title = "What happens to a person after they die?"
- We also tag a variety of other types. e.g. jokes or stories, which the speaker may frequently tell - to illustrate a point - but the words themselves will not be relevant to the topic being discussed - in terms of a topic modeling perspective
- Each document has various other metadata associated with it, including event-type, country. These will have an impact on the probability of certain topics being discussed.
We would like to adapt our tagging interface so that, after the user selects a passage of text, he is automatically presented with a list of suitable concept-keywords, and titles that have been used to tag similar material in the past - thus helping the tagger to maintain consistency, and not create new titles - when already a suitable title exists
I saw this post by David Andrzejewski http://metaoptimize.com/qa/questions/523/semi-supervised-lda- which looks very helpful, but I still don't know exactly which is the right approach for my situation.
Any suggestions would be very much appreciated.
Pls see this link for a more complete description of the data model used in our project
http://code.google.com/p/transcriptstudio4isha/wiki/DataModel
Kind Regards
Swami
http://www.sadhguru.org
asked
Oct 06 '12 at 07:04
swami
16●2●2●4