Regular Expression, Text Normalization, Edit Distance
Summary
How to perform basic text normalization
tasks?
- word segmentation
- normalization
- sentence segmentation
- stemming
Regular Expression
The regular expression language is a powerful tool for pattern-matching.
Basic operations in regular expressions include
- concatenation of symbols
- disjunction of symbols ([], |, and .)
- counters (*, +, and {n, m})
- anchors (^, $)
- precedence operators ((, ))
- word tokenization and normalization are generally done by cascades of simple regular expressions substitutions or finite automata