Basic Text Processing

Regular Expressions

Regular expressions consist of constants and operators that denote sets of strings and operations over these sets, respectively.

—From Wikipedia[#regexwiki]_:

Regular expressions are a formal language for specifying text string. In general, regular expressions provides a flexible mean to match strings of text. Commonly abbeviated as regex and regexp.

Metacharacter Description
. Match any character.
+ Match the preceding pattern element one o more times.
? Match the preceding pattern element zero o one times.
* Match the preceding pattern element zero o more times.
{M,N} Denotes the minimun M andthe maximun N match count.
[...] Denotes a set of possible character matches.
| Separates alternate possibilities.
^ Initial of line.
$ Final of line.
regex matches
[Aa] amor, Amor
[123456790] Any digit, or simply [0-9]


Regular Expression Pocket Reference, 2nd Edition

Word Tokenization

Task in NLP needs to do text normalization:

  1. Segmentatio/tokenizing words in running text.
  2. Normalizing word formats.
  3. Segmenting senttences in running texts.


An element of the vocabulary. Represented by \(N\)
An instance of that type running text. Represented by \(V\). The size of the vocabulary is represented by \(|V|\)
Data sets of text.
  • Generally in an sentence #tokens >= #types
  • Chuch and Gale (1990): \(|V|>=O(N^{1/2})\)

Tokenizing, first steps

Using Unix Tools :-), we use big.txt, is not exactly the same of the class[#file]_:

$ curl -O

Replacing all non alphabetic characters with a newline (n), and display only the first 10 lines (head):

$ tr -sc 'A-Za-z' '\n' < big.txt | head

Sort the output:

$ tr -sc 'A-Za-z' '\n' < big.txt | sort | head

Merging upper and lower case:

$ tr 'A-Z' 'a-z' < big.txt | tr –sc 'A-Za-z' '\n' | sort | uniq –c

Sorting the counts:

$ tr 'A-Z' 'a-z' < big.txt | tr –sc 'A-Za-z' '\n' | sort | uniq –c | sort –n –r

Issues in tokenization

Finland’s capital -> Finland, Finlands, Finlands’
I’m -> I am

language issues

L’ensemble -> one token or two?
L?, L’?, Le?
Want l’ensemble to match with un ensemble.

Word Normalization and Stemming


Sentence Segmentation

[2]You can get a shakes.txt from the Project Gutenberg :