## Using Regexps to Identify Integer Ranges

From: andrew cooke <andrew@...>

Date: Thu, 25 Jul 2013 18:08:00 -0400

This was a cute question on StackOverflow - can you use a regexp to identify
integers in some range?  For example, all values between 1234 and 4321?

Obviously you can, because you could use (1234|1235|...|4320|4321).  But I
(wrongly - see below) assumed that was impractical.  So I sat down to generate
something more compact by hand.

It turns out that the algorithm is pretty easy.  If you have a range like
123-599 split it into an edge case, 123-199, and a "block", 200-599.  The
block can be matched trivially, while the edge case can be handled as a common
prefix, 1, plus recursion on 23-99 (which has an edge of 23-29 and a block of
30-99).

The code is at http://stackoverflow.com/a/17840228/181772 and the generated
expressions are pretty good (if you read them, they "make sense"):

0 - 0     0
0 - 1     [0-1]
0 - 2     [0-2]
0 - 9     \d
00 - 10    (0\d|10)
00 - 11    (0\d|1[0-1])
000 - 101   (0\d\d|10[0-1])
1 - 1     1
1 - 2     [1-2]
1 - 9     [1-9]
01 - 10    (0[1-9]|10)
01 - 11    (0[1-9]|1[0-1])
001 - 101   (0(0[1-9]|[1-9]\d)|10[0-1])
000 - 123   (0\d\d|1([0-1]\d|2[0-3]))
111 - 123   1(1[1-9]|2[0-3])
123 - 222   (1(2[3-9]|[3-9]\d)|2([0-1]\d|2[0-2]))
123 - 333   (1(2[3-9]|[3-9]\d)|2\d\d|3([0-2]\d|3[0-3]))
123 - 444   (1(2[3-9]|[3-9]\d)|[2-3]\d{2}|4([0-3]\d|4[0-4]))
000 - 321   ([0-2]\d{2}|3([0-1]\d|2[0-1]))
111 - 321   (1(1[1-9]|[2-9]\d)|2\d\d|3([0-1]\d|2[0-1]))
222 - 321   (2(2[2-9]|[3-9]\d)|3([0-1]\d|2[0-1]))
321 - 333   3(2[1-9]|3[0-3])
321 - 444   (3(2[1-9]|[3-9]\d)|4([0-3]\d|4[0-4]))
123 - 321   (1(2[3-9]|[3-9]\d)|2\d\d|3([0-1]\d|2[0-1]))
111 - 121   1(1[1-9]|2[0-1])
121 - 222   (1(2[1-9]|[3-9]\d)|2([0-1]\d|2[0-2]))
1234 - 4321  (1(2(3[4-9]|[4-9]\d)|[3-9]\d{2})|[2-3]\d{3}|4([0-2]\d{2}|3([0-1]\d|2[0-1])))
000 - 999   \d\d{2}

Then someone commented that (1234|1235|...|4320|4321) actually works just fine
- the string is smaller than memory available, which is all that matters.  And
that seems to be true for quite large numbers (it should compile down to a
compact DFA anyway, so it's only the input string size that is a worry).

So the following is a reasonable solution for many cases:

import re

def regexp(lo, hi):
fmt = '%%0%dd' % len(str(hi))
return re.compile('(%s)' % '|'.join(fmt % i for i in range(lo, hi+1)))

When does the above fail (run out of memory)?  This is a table of the
expression size for various number ranges (approximately):

0-1  1
0-10  20
0-100  300
0-1000  4,000
0-10000  50,000
0-100000  600,000
0-1000000  7,000,000
0-10000000  80,000,000
0-100000000  900,000,000
0-1000000000  10,000,000,000
0-10000000000  110,000,000,000
0-100000000000  1,200,000,000,000

so 9 digit numbers require about 1GB of memory.

Andrew