|
What are the subtleties/differences in the pedantic definitions of lexemes, orthographical variants and syntactic equivalence classes? I am dealing with social media text. |
|
A lexeme is a set of forms a word can take depending on its context, within a single part of speech. So 'think', 'thinking', 'thought' are parts of a lexeme, but 'thinker' (being a noun) is not. 'Thinker' would go with 'thinkers', in a language that differentiates between male and female thinkers, the different gender forms would be part of that noun lexeme set as well. Orthographical variants are alternative spellings. Where the members of a lexeme differ based on syntactic rules (number agreement, tense agreement), there is nothing inherently different between two orthographical variants. In social media text, 'what' and 'wut' could be considered orthographical variants. They are not lexemes as they aren't different forms, just different spellings. Someone with deeper linguistic training may have more to say about syntactic equivalence classes, it's a new term to me. However, parsing it out it seems like it should refer to words with the same part of speech (they're syntactically equivalent). If you wanted to break it up into forms of parts of speech, such that you could literally replace any member of a class directly, then you've got a structure orthogonal to the lexeme: it would contain zero or one members from each lexeme, plus all its orthographical variants. If you take the weaker interpretation of syntactic classes to mean POS, then it'd be a set of all the lexemes for the same POS, e.g., all verb lexemes. It might help to envision a matrix per part of speech. Each column represents a form of that part of speech, e.g. first person singular. Each row represents a lemma, e.g. "dance". Then you fill in the corresponding forms. Each row is a lexeme. Any words that share a cell are orthographical variants. And either each column, or each matrix, is a syntactic equivalence class. Paul, With syntactic equivalence classes I meant variants of the same word I meant the class should incorporate transliterated variants (example: Spanish/Hindi written in English), lexemes and orthographical variants.
(Feb 10 '11 at 03:59)
Dexter
Paul, I am confused by the definition mentioned here : http://en.wiktionary.org/wiki/lexeme Could you give me an example where different part-of-speech of the same word but different lexemes? Also, does it mean that one can't decide a lexeme by just looking at the word and not at the part-of-speech?
(Feb 10 '11 at 05:11)
Dexter
1
No lexemes contain words of different POS. It's just the set of inflections/forms/genders/whatever. So "jump" "jumps" "jumped", but not "jumpy" or "jumper". In some cases, identical words are in different lexemes ("run" the verb vs. "run" the noun). You do need to know the POS to fill a lexeme. Note that in this case "verb" is the POS: past-tense verb is a form, not a separate part of speech. This applies if you're using, say, the Penn Treebank tags, which are more fine-grained then part of speech. Of course, more important than the technical definitions is what you want to do with the lexemes. If you have some NLP task where you don't care about POS, then lumping words together irrespective of POS would be fine. But from a technical standpoint, that's not a lexeme.
(Feb 10 '11 at 08:18)
Paul Barba
Thanks Paul, that clears a lot of stuff for me.
(Feb 11 '11 at 14:28)
Dexter
|