I am new to machine learning, and I am building a small projects for spelling corrector where features would be extracted from incorrect and correct words. Two of the features I decided to be is "Most incorrectly used" and "Most frequently used". Thus when a user types an incorrect word, a correct word should more pop up which people usually use more for the word which people are likely to spell incorrectly more.

Since I am new, I want to know how to build a list of these 2 features. I need counts and words.

Like :

For most incorrectly used :

freind 567

accomodate 560

etc. Similarly for "more frequently used".

Also, does any such list already exist? Please do give tips and suggestions.

asked Sep 24 '14 at 06:01

pranav_agrawal's gravatar image

pranav_agrawal
1135

edited Sep 24 '14 at 14:48

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146


2 Answers:

This seems like overkill, to use an SVM.

This is more of a classical information retrieval problem.

You could take a word, and tokenize it into character n-grams: 2-character ngrams, 3-character ngrams, 4-character ngrams.

Then, you could use tf*idf to weight each ngram. Finally, you can use cosine distance to find the word in the dictionary that is nearest. If you want, you could also use idf to weight the likelihood of a particular word, to add more weight to spelling corrections to common words.

answered Sep 24 '14 at 14:48

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146

As Joseph said, it's overkill to use SVM for this task.

Peter Norvig has a great solution for spell corrector using Naive Bayesian which uses n-grams characters. Check out the Spelling Correction section on his site.

All data set and code can be downloaded at ngrams

answered Sep 26 '14 at 09:45

Vinh%20Khuc's gravatar image

Vinh Khuc
863

edited Sep 27 '14 at 17:05

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.