Basic Text Processing

Regular Expressions

Regular expressions consist of constants and operators that denote sets of strings and operations over these sets, respectively.

—From Wikipedia[#regexwiki]_:

Regular expressions are a formal language for specifying text string. In general, regular expressions provides a flexible mean to match strings of text. Commonly abbeviated as regex and regexp.

Metacharacter Description
. Match any character.
+ Match the preceding pattern element one o more times.
? Match the preceding pattern element zero o one times.
* Match the preceding pattern element zero o more times.
{M,N} Denotes the minimun M andthe maximun N match count.
[...] Denotes a set of possible character matches.
| Separates alternate possibilities.
^ Initial of line.
$ Final of line.
regex matches
[Aa] amor, Amor
[123456790] Any digit, or simply [0-9]

Books

Regular Expression Pocket Reference, 2nd Edition
http://shop.oreilly.com/product/9780596514273.do

Word Tokenization

Task in NLP needs to do text normalization:

  1. Segmentatio/tokenizing words in running text.
  2. Normalizing word formats.
  3. Segmenting senttences in running texts.

Concepts

Type
An element of the vocabulary. Represented by \(N\)
Token
An instance of that type running text. Represented by \(V\). The size of the vocabulary is represented by \(|V|\)
Corpora
Data sets of text.
  • Generally in an sentence #tokens >= #types
  • Chuch and Gale (1990): \(|V|>=O(N^{1/2})\)

Tokenizing, first steps

Using Unix Tools :-), we use big.txt, is not exactly the same of the class[#file]_:

$ curl -O http://norvig.com/big.txt

Replacing all non alphabetic characters with a newline (n), and display only the first 10 lines (head):

$ tr -sc 'A-Za-z' '\n' < big.txt | head

Sort the output:

$ tr -sc 'A-Za-z' '\n' < big.txt | sort | head

Merging upper and lower case:

$ tr 'A-Z' 'a-z' < big.txt | tr –sc 'A-Za-z' '\n' | sort | uniq –c

Sorting the counts:

$ tr 'A-Z' 'a-z' < big.txt | tr –sc 'A-Za-z' '\n' | sort | uniq –c | sort –n –r

Issues in tokenization

Apostrophe
Finland’s capital -> Finland, Finlands, Finlands’
I’m -> I am

language issues

French
L’ensemble -> one token or two?
L?, L’?, Le?
Want l’ensemble to match with un ensemble.

Word Normalization and Stemming

Normalization

Sentence Segmentation

[1]http://en.wikipedia.org/wiki/Regular_expression#Formal_definition
[2]You can get a shakes.txt from the Project Gutenberg : http://www.gutenberg.org/ebooks/100