How to implement a verbose REGEX in Python -


i trying use verbose regular expression in python (2.7). if matters trying make easier go , more understand expression sometime in future. because new first created compact expression make sure getting wanted.

here compact expression:

test_verbose_item_pattern = re.compile('\n{1}\b?i[tt][ee][mm]\s+\d{1,2}\.?\(?[a-e]?\)?.*[^0-9]\n{1}') 

it works expected

here verbose expression

verbose_item_pattern = re.compile(""" \n{1}       #begin new line allow 1 new line character \b?       #allow word boundary ? allows 0 or 1 word boundaries \nitem or \n  item        # first word on line must begin capital [tt][ee][mm]  #then need 1 character each of 3 sets allows unknown case \s+       # 1 or more white spaces allow \n not sure if should change \d{1,2}    # require 1 or 2 digits \.?        # there 0 or 1 periods after digits 1. or 1 \(?        # there might 0 or 1 instance of open paren [a-e]?      # there 0 or 1 instance of letter in range a-e \)?         # there 0 or 1 instance of closing paren .*          #any number of unknown characters can have words , punctuation [^0-9]     # placement hoping stating not want allow strings end number , \n \n{1}     #i want cut off @ next newline character """,re.verbose) 

the problem when run verbose pattern exception

traceback (most recent call last): file "c:/users/dropbox/directedgar-code-examples/newitemidentifier.py", line 17, in <module>  """,re.verbose)  file "c:\python27\lib\re.py", line 190, in compile   return _compile(pattern, flags)  file "c:\python27\lib\re.py", line 242, in _compile  raise error, v # invalid expression  error: nothing repeat 

i afraid going silly can't figure out. did take verbose expressions , compact line line make sure compact version same verbose.

the error message states there nothing repeat?

  • it habit use raw string literals when defining regex patterns. lot of regex patterns use backslashes, , using raw string literal allow write single backslashes instead of having worry whether or not python interpret backslash have different meaning (and having use 2 backslashes in cases).

  • \b? not valid regex. saying 0-or-1 word boundaries. either have word boundary or don't. if have word boundary, have 1 word boundary. if don't have word boundary have 0 word boundaries. \b? (if valid regex) true.

  • regex makes distinction between end of string , end of line. (a string may consist of multiple lines.)

    • \a matches start of string.
    • \z matches end of string.
    • $ matches end of string, , end of line in re.multiline mode.
    • ^ matches start of string, , start of line in re.multiline mode.

import re verbose_item_pattern = re.compile(r"""     $            # end of line boundary     \s{1,2}      # 1-or-2 whitespace character, including newline                # capital     [tt][ee][mm] # 1 character each of 3 sets allows unknown case     \s+          # 1-or-more whitespaces including newline     \d{1,2}      # 1-or-2 digits     [.]?         # 0-or-1 literal .     \(?          # 0-or-1 literal open paren     [a-e]?       # 0-or-1 letter in range a-e     \)?          # 0-or-1 closing paren     .*           # number of unknown characters can have words , punctuation     [^0-9]       # [0-9]     $            # end of line boundary     """, re.verbose|re.multiline)  x = verbose_item_pattern.search("""  item 1.0(a) foo bar """)  print(x) 

yields

<_sre.sre_match object @ 0xb76dd020> 

(indicating there match)


Comments

Popular posts from this blog

get url and add instance to a model with prefilled foreign key :django admin -

css - Make div keyboard-scrollable in jQuery Mobile? -

ruby on rails - Seeing duplicate requests handled with Unicorn -