Tuesday, September 21, 2010

MySQL and REGEX

"A regular expression is a powerful way of specifying a pattern for a complex search." (Excerpt from MySQL manual: http://dev.mysql.com/doc/refman/5.1/en/regexp.html)

The financial institutions use letters to identify the months of the year in the following way:

F - January
G - February
H - March
J - April
K - May
M - June
N - July
Q - August
U - September
V - October
X - November
Z - December

So you can have a date in the form F10 which identifies January 2010 or J11 which identifies April 2011.

Let say you have a prices table in a MySQL database and you want to check that the values in the field contract in that table are correct according to the following rules:

- The first character in the field should be a letter valid for a month (F - January, G - February, etc).
- The second and third chars in the field should be a decimal digit (0..9).

You need to specify a query in SQL for MySQL that identifies all the records that have a wrong contract according to the above rules.

There are several ways of doing this in MySQL, one could be a query like the following:

SELECT * FROM prices
WHERE substr(contract, 1, 1) NOT IN ('F', 'G', 'H', 'J', 'K', 'M', 'N', 'Q', 'U', 'V', 'X', 'Z')
OR substr(contract, 2, 1) NOT IN (0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
OR substr(contract, 3, 1) NOT IN (0, 1, 2, 3, 4, 5, 6, 7, 8, 9);

Other query could be:

SELECT * FROM prices
WHERE substr(contract, 1, 1) NOT IN ('F', 'G', 'H', 'J', 'K', 'M', 'N', 'Q', 'U', 'V', 'X', 'Z')
OR substr(contract, 2, 1) NOT BETWEEN 0 AND 9
OR substr(contract, 3, 1) NOT BETWEEN 0 AND 9;

As you can see the queries are long so here is where you can use the power of REGEX (Regular Expressions) that MySQL implements as an extension to the SQL language.

Using REGEX the query will be:

SELECT * FROM prices
WHERE contract NOT REGEXP '^[FGHJKMNQUVXZ][0-9]{2}';

The REGEX expression is negated (NOT REGEXP) so the query is going to return values in the field contract of the prices table that don't have at the beginning (the symbol ^) a letter F or G or H, etc. (the part which is specified between square brackets, which is called a class) and 2 digits (the class [0-9] with the number of occurrences, part {2}).

As you can see REGEX provides a concise and flexible way for matching strings of text against a certain pattern.

Have a good day.

--

J. E. Aneiros
GNU/Linux User #190716 en http://counter.li.org
perl -e '$_=pack(c5,0105,0107,0123,0132,(1<<3)+2);y[A-Z][N-ZA-M];print;'
PK fingerprint: 5179 917E 5B34 F073 E11A  AFB3 4CB3 5301 4A80 F674

No comments: