Regular expressions consist of constants and operators that denote sets of strings and operations over these sets, respectively.
—From Wikipedia[#regexwiki]_:
Regular expressions are a formal language for specifying text string. In general, regular expressions provides a flexible mean to match strings of text. Commonly abbeviated as regex and regexp.
Metacharacter | Description |
---|---|
. | Match any character. |
+ | Match the preceding pattern element one o more times. |
? | Match the preceding pattern element zero o one times. |
* | Match the preceding pattern element zero o more times. |
{M,N} | Denotes the minimun M andthe maximun N match count. |
[...] | Denotes a set of possible character matches. |
| | Separates alternate possibilities. |
^ | Initial of line. |
$ | Final of line. |
regex | matches |
---|---|
[Aa] | amor, Amor |
[123456790] | Any digit, or simply [0-9] |
Task in NLP needs to do text normalization:
Using Unix Tools :-), we use big.txt, is not exactly the same of the class[#file]_:
$ curl -O http://norvig.com/big.txt
Replacing all non alphabetic characters with a newline (n), and display only the first 10 lines (head):
$ tr -sc 'A-Za-z' '\n' < big.txt | head
Sort the output:
$ tr -sc 'A-Za-z' '\n' < big.txt | sort | head
Merging upper and lower case:
$ tr 'A-Z' 'a-z' < big.txt | tr –sc 'A-Za-z' '\n' | sort | uniq –c
Sorting the counts:
$ tr 'A-Z' 'a-z' < big.txt | tr –sc 'A-Za-z' '\n' | sort | uniq –c | sort –n –r
[1] | http://en.wikipedia.org/wiki/Regular_expression#Formal_definition |
[2] | You can get a shakes.txt from the Project Gutenberg : http://www.gutenberg.org/ebooks/100 |