How to implement a verbose REGEX in Python -
i trying use verbose regular expression in python (2.7). if matters trying make easier go , more understand expression sometime in future. because new first created compact expression make sure getting wanted.
here compact expression:
test_verbose_item_pattern = re.compile('\n{1}\b?i[tt][ee][mm]\s+\d{1,2}\.?\(?[a-e]?\)?.*[^0-9]\n{1}')
it works expected
here verbose expression
verbose_item_pattern = re.compile(""" \n{1} #begin new line allow 1 new line character \b? #allow word boundary ? allows 0 or 1 word boundaries \nitem or \n item # first word on line must begin capital [tt][ee][mm] #then need 1 character each of 3 sets allows unknown case \s+ # 1 or more white spaces allow \n not sure if should change \d{1,2} # require 1 or 2 digits \.? # there 0 or 1 periods after digits 1. or 1 \(? # there might 0 or 1 instance of open paren [a-e]? # there 0 or 1 instance of letter in range a-e \)? # there 0 or 1 instance of closing paren .* #any number of unknown characters can have words , punctuation [^0-9] # placement hoping stating not want allow strings end number , \n \n{1} #i want cut off @ next newline character """,re.verbose)
the problem when run verbose pattern exception
traceback (most recent call last): file "c:/users/dropbox/directedgar-code-examples/newitemidentifier.py", line 17, in <module> """,re.verbose) file "c:\python27\lib\re.py", line 190, in compile return _compile(pattern, flags) file "c:\python27\lib\re.py", line 242, in _compile raise error, v # invalid expression error: nothing repeat
i afraid going silly can't figure out. did take verbose expressions , compact line line make sure compact version same verbose.
the error message states there nothing repeat?
it habit use raw string literals when defining regex patterns. lot of regex patterns use backslashes, , using raw string literal allow write single backslashes instead of having worry whether or not python interpret backslash have different meaning (and having use 2 backslashes in cases).
\b?
not valid regex. saying 0-or-1 word boundaries. either have word boundary or don't. if have word boundary, have 1 word boundary. if don't have word boundary have 0 word boundaries.\b?
(if valid regex) true.regex makes distinction between end of string , end of line. (a string may consist of multiple lines.)
\a
matches start of string.\z
matches end of string.$
matches end of string, , end of line in re.multiline mode.^
matches start of string, , start of line in re.multiline mode.
import re verbose_item_pattern = re.compile(r""" $ # end of line boundary \s{1,2} # 1-or-2 whitespace character, including newline # capital [tt][ee][mm] # 1 character each of 3 sets allows unknown case \s+ # 1-or-more whitespaces including newline \d{1,2} # 1-or-2 digits [.]? # 0-or-1 literal . \(? # 0-or-1 literal open paren [a-e]? # 0-or-1 letter in range a-e \)? # 0-or-1 closing paren .* # number of unknown characters can have words , punctuation [^0-9] # [0-9] $ # end of line boundary """, re.verbose|re.multiline) x = verbose_item_pattern.search(""" item 1.0(a) foo bar """) print(x)
yields
<_sre.sre_match object @ 0xb76dd020>
(indicating there match)
Comments
Post a Comment