One way of countering the data sparseness is by using an n-gram kernel. The simplest way is to add character n-grams as additional features to your classifier. A more powerful variant is to use a gappy n-gram kernel, which is implemented in for example the OpenFST toolkit (http://openfst.cs.nyu.edu/twiki/bin/view/Kernel/KernelQuickTour).
Another simple alternative is to induce word representations, such that for example "becoz" and "bcuz" are mapped to similar vectors as measured by some metric. A very simple way of achieving this is to collect statistics of co-occurrences with neighbouring words, using random projections to reduce the dimensionality. With enough text, you should be able to achieve decent representations. You could then try to either add the feature vectors for each word to the document representation (which in my experience does not work very well), or use a clustering software to cluster the induced representations. The problem with this approach is that although you will probably get good representations of some words, you will also add a lot of noise.
Other variants that you could try, which I am not that practically familiar with, are neural-network based and Brown clustering based word embeddings. Perhaps you could use the pre-trained models (http://metaoptimize.com/projects/wordreprs/) that Joseph makes available directly?