Summary
- Languages generally have a small set of closed class words that are highly frequent, ambiguous, and act as function words, and open-class words like nouns, verbs, adjectives. Various part-of-speech tagsets exist, of between 40 and 200 tags.
- Part-of-speech tagging is the process of assigning a part-of-speech label to each of a sequence of words.
- Two common approaches to sequence modeling:
- a generative approach, HMM tagging
- The probabilities in HMM taggers are estimated by maximum likelihood estimation on tag-labeled training corpora
- The Viterbi algorithm is used for decoding, finding the most likely tag sequence
- Beam search is a variant of Viterbi decoding that maintains only a fraction of high scoring states rather than all states during decoding
- a discriminative approach, MEMM (Maximum Entropy Markov Model)tagging
- This tagger train logistic regression models to pick the best tag given an observation word and its context and the previous tags, and then use Viterbi to choose the best sequence of tags.
- a generative approach, HMM tagging
- Modern taggers are generally run bidirectionally
(Mostly) English Word Classes
Part-of-speech can be divided into two broad supercategories:
Closed class types
Those with relatively fixed membership, such as prepositions - new prepositions are rarely coined
Generally are function words
like of, it, and, or you, which tend to be very short, occur frequently and often have structuring uses in grammar
Closed classes differ more from language to language
Some of the important closed classes in English include:
prepositions - on, under, over, near, by, at, from, to, with
prepositions occur before noun phrases
particles - up, down, on, off, in, out, at, by
a particle resembles a preposition or an adverb and is used in combination with a verb
determiners - a, an, the
a closed class that occurs with nouns, often marking the beginning of a noun phrase
conjunctions - and, but, or, as, if, when
conjunctions join two phrases, clauses, or sentences
pronouns - she, who, I, others
pronouns are forms that often act as a kind of shorthand for referring to some noun phrase or entity or event
- personal pronouns - refer to persons or entities (you, she, I, it, me, etc.)
- possessive pronouns - forms of personal pronouns that indicate either actual possession or more often just an abstract relation between the person and some object (my, your, his, its, one’s, our, their)
- wh-pronouns - used in certain question forms, or may also act as complemetizers
auxiliary verbs - can, may, should, are
- English auxiliaries include the copula verb
be
, the two verbsdo
andhave
, along with their inflected forms, as well as a class of modal verbs
- English auxiliaries include the copula verb
numerals - one, two, three, first, second, third
interjections - oh, hey, alas, uh, um
negatives - no, not
politeness markers - please, thank you
greetings - hello, goodbye
existential - there
Open class types
Nouns and verbs are open classes - new nouns and verbs like
iPhone
or tofax
are continually being created or borrowedFour major open classes occur in the languages of the world:
nouns
Open class nouns fall into two classes:
- Proper nouns - in written English, proper nouns are usually capitalized
- Common nouns - are divided in many languages into count nouns and mass nouns
verbs - refer to actions and processes
adjectives - includes many terms for properties or qualities
adverbs - is rather a hodge-podge in both form and meaning
The Penn Treebank Part-of-Speech Tagset
Part-of-Speech Tagging
HMM Part-of-Speech Tagging
Markov Chains
The Hidden Markov Model
The components of an HMM tagger
HMM tagging as decoding
The Viterbi Algorithm
Working through an example
Extending the HMM Algorithm to Trigrams
Beam Search
Unknown Words
Maximum Entropy Markov Models
Features in a MEMM
Decoding and Training MEMMs
Bidirectionality
Part-of-Speech Tagging for Other Languages
Reference
https://web.stanford.edu/~jurafsky/slp3/