I'm trying to practice a logistic regression technique for categorizing text, and I want to build a dataset in the form of a p x n matrix, p rows for plays and n columns for unique words. I already have a text to work from, I just need to count the words in it.
It's important to keep track of which word occurs in which play, so for a given play I've been able to create a Python dictionary that tallies unique words. What I DON'T know how to do is to combine these dicts, so that, e.g.
can be merged to produce a matrix
For clarity I created an example where each play is composed only of unique words - naturally in reality this is not at all true.
How might someone go about building this matrix from these dictionaries? Would it be easier to start from somewhere else?
asked Mar 11 '12 at 19:56
In the end, what I ended up doing was implementing a defaultdict because I liked how it would create dictionaries or dictionary entries (depending on location) when a reference didn't previously exist.
I built a full defaultdict with what I wanted, and then ploddingly output to CSV.
I used the full text dump from opensourceshakespeare.com, here's what I wrote:
answered Mar 15 '12 at 13:52
You can start from your position, which is generally where I get to mid-way through counting. The next step is to obtain the list of all words in your dataset. You can either do this while you are counting to start with or use the following:
At this point, many people 'preprocess' this list of words. For example, document category clustering people will remove function words, or perform lemmaraisation.
The next step is to use the
You are better off storing that in a numpy array for working with (if you don't know numpy, learn it).
answered Mar 11 '12 at 22:03