|
I'm looking for medium sized archives (10-100 users, hundreds of posts each) of IRC rooms, specifically relating to programming or a technical subject. Importantly, they have to be relatively recent (<5 years old). I've had a look around, but Google turns little up and its surprisingly difficult to find keywords that don't return irrelevant results. Does anyone know of any?
showing 5 of 6
show all
|
|
This is kinda of pushing it in terms of "recent" (data is from 2007), but how does chat logs of IRC data from #linux on freenode sound? Data set source: http://www.cs.brown.edu/~melsner/ (navigate down to the line titled "IRC Chat Data and Disentanglement Model") Data set description: http://www.cs.brown.edu/~melsner/chat-manual.html This is very useful, thanks.
(Oct 12 '11 at 18:49)
Robert Layton
|
|
I found a great resource: http://irclogs.ubuntu.com/ Has the ubuntu IRC logs, right up to current! |
|
I'm unsure who to give credit to for this. That is a great resource as well!
(Oct 13 '11 at 18:08)
Robert Layton
|
If this works for you; http://metaoptimize.com/qa/questions/7568/one-on-one-chat-corpus
Thanks, but not quite what I'm looking for. Chat-80 is a bit old (its important that the data is up to date), while NPS is not of a technical nature. I can do some preliminary training on NPS, but the final model has to be built on more technical natured posts.
Again this may not qualify as chat log but you can definitely find conversations in this (16 million tweets) http://trec.nist.gov/data/tweets/
Thanks. I'm looking at tweets as well (its a comparison of short use of languages), but need IRC as well.
What's the format of the tweet dataset from NIST? Does it include time of day? thanks for the info.
Don't know. I don't have the data. I did not go through the process of signing the agreement. But if it is in the default JSON format it should have the time