Basic Text Processing¶

Regular Expressions¶

Regular expressions consist of constants and operators that denote sets of strings and operations over these sets, respectively.

—From Wikipedia[#regexwiki]_:

Regular expressions are a formal language for specifying text string. In general, regular expressions provides a flexible mean to match strings of text. Commonly abbeviated as regex and regexp.

Metacharacter	Description
`.`	Match any character.
`+`	Match the preceding pattern element one o more times.
`?`	Match the preceding pattern element zero o one times.
`*`	Match the preceding pattern element zero o more times.
`{M,N}`	Denotes the minimun M andthe maximun N match count.
`[...]`	Denotes a set of possible character matches.
`\|`	Separates alternate possibilities.
`^`	Initial of line.
`$`	Final of line.

regex	matches
[Aa]	amor, Amor
[123456790]	Any digit, or simply [0-9]

Books¶

Regular Expression Pocket Reference, 2nd Edition: http://shop.oreilly.com/product/9780596514273.do

Word Tokenization¶

Task in NLP needs to do text normalization:

Segmentatio/tokenizing words in running text.
Normalizing word formats.
Segmenting senttences in running texts.

Concepts¶

Type: An element of the vocabulary. Represented by $N$
Token: An instance of that type running text. Represented by $V$. The size of the vocabulary is represented by $|V|$
Corpora: Data sets of text.

Generally in an sentence #tokens >= #types
Chuch and Gale (1990): $|V|>=O(N^{1/2})$

Tokenizing, first steps¶

Using Unix Tools :-), we use big.txt, is not exactly the same of the class[#file]_:

$ curl -O http://norvig.com/big.txt

Replacing all non alphabetic characters with a newline (n), and display only the first 10 lines (head):

$ tr -sc 'A-Za-z' '\n' < big.txt | head

Sort the output:

$ tr -sc 'A-Za-z' '\n' < big.txt | sort | head

Merging upper and lower case:

$ tr 'A-Z' 'a-z' < big.txt | tr –sc 'A-Za-z' '\n' | sort | uniq –c

Sorting the counts:

$ tr 'A-Z' 'a-z' < big.txt | tr –sc 'A-Za-z' '\n' | sort | uniq –c | sort –n –r

Issues in tokenization¶

Apostrophe: Finland’s capital -> Finland, Finlands, Finlands’

I’m -> I am

language issues¶

French: L’ensemble -> one token or two?

L?, L’?, Le?

Want l’ensemble to match with un ensemble.

Word Normalization and Stemming¶

Normalization¶

Sentence Segmentation¶

[1]	http://en.wikipedia.org/wiki/Regular_expression#Formal_definition

[2]	You can get a shakes.txt from the Project Gutenberg : http://www.gutenberg.org/ebooks/100

Basic Text Processing¶

Regular Expressions¶

Books¶

Word Tokenization¶

Concepts¶

Tokenizing, first steps¶

Issues in tokenization¶

language issues¶

Word Normalization and Stemming¶

Normalization¶

Sentence Segmentation¶

Table Of Contents

Previous topic

Next topic

Navigation

Basic Text Processing¶

Regular Expressions¶

Books¶

Word Tokenization¶

Concepts¶

Tokenizing, first steps¶

Issues in tokenization¶

language issues¶

Word Normalization and Stemming¶

Normalization¶

Sentence Segmentation¶

Table Of Contents

Previous topic

Next topic

Quick search

Navigation