Revision history[back]
click to hide/show revision 1
Revision n. 1

Feb 23 '11 at 08:20

Oscar%20T%C3%A4ckstr%C3%B6m's gravatar image

Oscar Täckström
2039133450

One way of countering the data sparseness is by using an n-gram kernel. The simplest way is to add character n-grams as additional features to your classifier. A more powerful variant is to use a gappy n-gram kernel, which is implemented in for example the OpenFST toolkit (http://openfst.cs.nyu.edu/twiki/bin/view/Kernel/KernelQuickTour).

Another simple alternative is to induce word representations, such that for example "becoz" and "bcuz" are mapped to similar vectors as measured by some metric. A very simple way of achieving this is to collect statistics of co-occurrences with neighbouring words, using random projections to reduce the dimensionality. With enough text, you should be able to achieve decent representations. You could then try to either add the feature vectors for each word to the document representation (which in my experience does not work very well), or use a clustering software to cluster the induced representations. The problem with this approach is that although you will probably get good representations of some words, you will also add a lot of noise.

Other variants that you could try, which I am not that practically familiar with, are neural-network based and Brown clustering based word embeddings. Perhaps you could use the pre-trained models (http://metaoptimize.com/projects/wordreprs/) that Joseph makes available directly?

click to hide/show revision 2
Revision n. 2

Feb 23 '11 at 08:22

Oscar%20T%C3%A4ckstr%C3%B6m's gravatar image

Oscar Täckström
2039133450

One way of countering the data sparseness is by using an n-gram kernel. The simplest way is to add character n-grams as additional features to your classifier. A more powerful variant is to use a gappy n-gram kernel, which is implemented in for example the OpenFST toolkit (http://openfst.cs.nyu.edu/twiki/bin/view/Kernel/KernelQuickTour).). You could also cluster the n-gram representations of the words to find clusters of ortographically similar words.

Another simple alternative is to induce word representations, such that for example "becoz" and "bcuz" are mapped to similar vectors as measured by some metric. A very simple way of achieving this is to collect statistics of co-occurrences with neighbouring words, using random projections to reduce the dimensionality. With enough text, you should be able to achieve decent representations. You could then try to either add the feature vectors for each word to the document representation (which in my experience does not work very well), or use a clustering software to cluster the induced representations. The problem with this approach is that although you will probably get good representations of some words, you will also add a lot of noise.

Other variants that you could try, which I am not that practically familiar with, are neural-network based and Brown clustering based word embeddings. Perhaps you could use the pre-trained models (http://metaoptimize.com/projects/wordreprs/) that Joseph makes available directly?

click to hide/show revision 3
Revision n. 3

Feb 23 '11 at 08:28

Oscar%20T%C3%A4ckstr%C3%B6m's gravatar image

Oscar Täckström
2039133450

One way of countering the data sparseness is by using an n-gram kernel. The simplest way is to add character n-grams as additional features to your classifier. A more powerful variant is to use a gappy n-gram kernel, which is implemented in for example the OpenFST toolkit (http://openfst.cs.nyu.edu/twiki/bin/view/Kernel/KernelQuickTour). You could also cluster the n-gram representations of the words to find clusters of ortographically similar words.

Another simple alternative alternative, if your words to be clustered are not necessarily ortographically similar, is to induce distributed word representations, such that for example "becoz" and "bcuz" are mapped to similar vectors as measured by some metric. representations. A very simple way of achieving this is to collect statistics of co-occurrences with neighbouring words, using random projections to reduce the dimensionality. With enough text, you should be able to achieve decent representations. You could then try to either add the feature vectors for each word to the document representation (which in my experience does not work very well), or use a clustering software to cluster the induced representations. The problem with this approach is that although you will probably get good representations of some words, you will also add a lot of noise.

Other variants that you could try, which I am not that practically familiar with, are neural-network based and Brown clustering based word embeddings. Perhaps you could use the pre-trained models (http://metaoptimize.com/projects/wordreprs/) that Joseph makes available directly?

click to hide/show revision 4
Revision n. 4

Feb 23 '11 at 10:01

Oscar%20T%C3%A4ckstr%C3%B6m's gravatar image

Oscar Täckström
2039133450

One way of countering the data sparseness is by using an n-gram kernel. The simplest way is to add character n-grams as additional features to your classifier. A more powerful variant is to use a gappy n-gram kernel, which is implemented in for example the OpenFST toolkit (http://openfst.cs.nyu.edu/twiki/bin/view/Kernel/KernelQuickTour). You could also cluster the n-gram representations of the words to find clusters of ortographically similar words.

Another alternative, if your words to be clustered are not necessarily ortographically similar, is to induce distributed word representations. A very simple way of achieving this is to collect statistics of co-occurrences with neighbouring words, using random projections to reduce the dimensionality. With enough text, you should be able to achieve decent representations. You could then try to either add the feature vectors for each word to the document representation (which in my experience does not work very well), or use a clustering software to cluster the induced representations. The problem with this approach is that although you will probably get good representations of some words, you will also add a lot of noise.

Other variants that you could try, which I am not that practically familiar with, are neural-network based and Brown clustering based word embeddings. Perhaps you could use the pre-trained models (http://metaoptimize.com/projects/wordreprs/) that Joseph makes available directly?

Edit: as discussed on another post, topic models, such as LDA might also be something to consider. The problem again is that even though topics capture spelling variations etc, they also capture a lot of other things, which you would not want to collapse. Joseph links to implementations of the various embedding algorithms on his page as well as pre-trained models.

click to hide/show revision 5
Revision n. 5

Feb 23 '11 at 12:15

Oscar%20T%C3%A4ckstr%C3%B6m's gravatar image

Oscar Täckström
2039133450

One way of countering the data sparseness is by using an n-gram kernel. The simplest way is to add character n-grams as additional features to your classifier. A more powerful variant is to use a gappy n-gram kernel, which is implemented in for example the OpenFST toolkit (http://openfst.cs.nyu.edu/twiki/bin/view/Kernel/KernelQuickTour). You could also cluster the n-gram representations of the words to find clusters of ortographically similar words.

Another alternative, if your words to be clustered are not necessarily ortographically similar, is to induce distributed word representations. A very simple way of achieving this is to collect statistics of co-occurrences with neighbouring words, using random projections to reduce the dimensionality. With enough text, you should be able to achieve decent representations. You could then try to either add the feature vectors for each word to the document representation (which in my experience does not work very well), or use a clustering software to cluster the induced representations. The problem with this approach is that although you will probably get good representations of some words, you will also add a lot of noise.

Other variants that you could try, which I am not that practically familiar with, are neural-network based and Brown clustering based word embeddings. Perhaps you could use the pre-trained models (http://metaoptimize.com/projects/wordreprs/) that Joseph makes available directly?

Edit: as Edit 1: As discussed on another post, topic models, such as LDA might also be something to consider. The problem again is that even though topics capture spelling variations etc, they also capture a lot of other things, which you would not want to collapse. Joseph links to implementations of the various embedding algorithms on his page as well as pre-trained models.

Edit 2: If you are mostly concerned with ortographic variations, then perhaps you could use n-gram similarity as a way of filtering out distributionally similar words that should not be considered to belong to the same class.

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.