Speech and Language Processing Notes

Regular Expression, Text Normalization, Edit Distance

Summary

How to perform basic text normalization tasks?

  • word segmentation
  • normalization
  • sentence segmentation
  • stemming

Regular Expression

  • The regular expression language is a powerful tool for pattern-matching.

  • Basic operations in regular expressions include

    • concatenation of symbols
    • disjunction of symbols ([], |, and .)
    • counters (*, +, and {n, m})
    • anchors (^, $)
    • precedence operators ((, ))
    • word tokenization and normalization are generally done by cascades of simple regular expressions substitutions or finite automata

Words

Corpora

Text Normalization

Minimum Edit Distance