程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Python | an article full of regular expressions

編輯:Python

Catalog

The role of regular expressions

re Module basic usage

1.match And search: Find the first match

re Module basic usage -raw

re Module basic usage -match object

re Module basic usage -findall

Regular substitution

re Module basic usage -compile

Basically regular

1. Section []   Specify the range according to the coding sequence

2. Interval negation

3. Match or

4. “.” Place holder , Express Division \n Any character other than

5. Match start and end  ^,$

Shortcut

Regular repetition

1. ? Indicates matching the previous item 0 Time or 1 Time

2. * Indicates that the previous item is matched any time (0-n Time )

3. + Indicates that the previous item is matched at least once

4.{n} n Is a non negative integer . Matched definite n Time .

5.{n,} n Is a non negative integer . Match at least n Time .

6.{n,m} Indicates matching the previous item n-m Time , Least match n Times and at most m Time

Greedy mode and non greedy mode

Regular grouping

1. Capture groups

 2. Reference group ( Group backward reference )

3. Non capture grouping    (?:regex)

Example

4. Name groups

Regular tags are often used

Inline tag

Regular assertion

1. Zero width forward lookahead assertion

2. Zero width negative look ahead assertion

3. Zero width forward and backward assertion

4. Zero width negative backward assertion


The role of regular expressions

1. Filter text ( data mining )
      Specify a matching rule , To identify whether the rule is in a larger text string .
2. Validation of validity
      Use regular to confirm whether the obtained data is the expected value

Advantages and disadvantages of regular expressions
• advantage : Improve work efficiency 、 Save code
• shortcoming : complex , Difficult to understand

re Module basic usage

1.match And search: Find the first match

re.search
• Find a match
• Accept a regular expression and a string , And return the first match found .
• If no match is found at all ,re.search return None

>>> import re
>>> rest=re.search(r'sanle','hello sanle')
>>> print(rest)
<_sre.SRE_Match object; span=(6, 11), match='sanle'>
>>> type(rest)
<class '_sre.SRE_Match'>
re.match
• Find a match from the string header
• Accept a regular expression and a string , Match from the first character of the main string , And return the first match found .
• If the string doesn't start with a regular expression , The match fails ,re.match return None
>>> rest=re.match(r'sanle','hello sanle')
>>> print(rest)
None
>>> type(rest)
<class 'NoneType'>
>>> rest=re.match(r'sanle','sanle sanle hello sanle')
>>> print(rest)
<_sre.SRE_Match object; span=(0, 5), match='sanle'>
>>> type(rest)
<class '_sre.SRE_Match'>

re Module basic usage -raw

r'sanle' Medium r It stands for raw( Original string )

• The difference between the original string and the normal string is that the original string will not \ The character is interpreted as an escape character

• Regular expressions using primitive characters are common and useful

>>> rest=re.search('\\tsanle','hello\\tsanle')
>>> print(rest)
None
>>> rest=re.search(r'\\tsanle','hello\\tsanle')
>>> print(rest)
<_sre.SRE_Match object; span=(5, 12), match='\\tsanle'>
>>> re.search('\\\\tsanle','hello\\\\tsanle')
<_sre.SRE_Match object; span=(6, 13), match='\\tsanle'>
>>> re.search(r'\\\\tsanle','hello\\\\tsanle')
<_sre.SRE_Match object; span=(5, 13), match='\\\\tsanle'>

re Module basic usage -match object

match.group(default=0): Returns the matching string .

• group This is because regular expressions can be divided into multiple subgroups that only call out matching subsets .

• 0 Is the default parameter , Represents the entire string of matches ,n It means the first one n A minute

match.start()

• start Method provides the index of the start of the match in the original string

match.end()

• end Method provides the index of the start of the match in the original string

match.groups()

• groups Returns a tuple containing all the group strings , from 1 To Group number included

>>> msg="It's rainning cats and dogs"
>>> match=re.search(r'cats',msg)
>>> print(match)
<_sre.SRE_Match object; span=(14, 18), match='cats'>
>>> print(match.group())
cats
>>> print(match.start())
14
>>> print(match.end())
18
>>> print(match.groups())
()

re Module basic usage -findall

findall and finditer: Multiple matches found

re.findall

• Find and return a matching string , Return a list

re.finditer

• Find and return a matching string , Returns an iterator

>>> rest=re.findall(r'sanle','hello sanle sanlee sanlee')
>>> print(rest)
['sanle', 'sanle', 'sanle']
>>> msg="It's rainning cats and dogs"
>>> re.findall('a',msg)
['a', 'a', 'a']
>>> re.finditer('a',msg)
<callable_iterator object at 0x7f06f13bc5f8>
# msg="aaaaaa"
# result=re.finditer("a",msg)
# for i in result:
# print(i)
# print(i.group())

Regular substitution
 

re.sub(' Matches a regular ',' replace content ','string')
• take string Replace the matching content in with the new content

print(re.sub("python","Python","I am learning python3"))
print(re.sub("python","Python","I am learning python3 python"))

re Module basic usage -compile

Features of compiling regular :

• Complex regular reusable .

• It is more convenient to use compiled regular , The parameter is omitted .

• re The module caches its improvised regular expressions , So in most cases , Use compile Not very big Performance advantages

msg1="hello world"
msg2="i am learning python"
msg3="sanle"
print(re.findall("python",msg1))
print(re.findall("python",msg2))
print(re.findall("python",msg3))
reg = re.compile("python") # Compile regular expressions into objects
print(reg.findall(msg1))
print(reg.findall(msg2))
print(reg.findall(msg3))

Basically regular

1. Section []   Specify the range according to the coding sequence

ret1=re.findall("python","Python on python")
print(ret1)
ret2=re.findall("[Pp]ython","Python on python")
print(ret2)
ret3=re.findall("[A-Za-z0-9-]","abc123ABCD--")
print(ret3)
ret4=re.findall("[a-zA-Z0-9-]","abc123ABCD--")
print(ret4)
ret5=re.findall("[A-z0-9\-]","abc123ABCD--\\")
print(ret5)

The output is as follows

['python']
['Python', 'python']
['a', 'b', 'c', '1', '2', '3', 'A', 'B', 'C', 'D', '-', '-']
['a', 'b', 'c', '1', '2', '3', 'A', 'B', 'C', 'D', '-', '-']
['a', 'b', 'c', '1', '2', '3', 'A', 'B', 'C', 'D', '-', '-', '\\']

2. Interval negation

ret6=re.findall("[^A-Z]c","Ac111crc#c")
print(ret6)
ret7=re.findall("[^A-Z][0-9]","Ac121crc#c")
print(ret7)

The output is as follows

['1c', 'rc', '#c']
['c1', '21']

3. Match or

msg="welcome to changsha,welcome to hunan"
rest=re.findall("changsha|hunan",msg)
print(rest)

  The output is as follows

['changsha', 'hunan']

4. “.” Place holder , Express Division \n Any character other than

rest2=re.findall("p.thon","Pythonpthon p thon p-thon p\nthon")
print(rest2)

  The output is as follows  

['p thon', 'p-thon']

5. Match start and end  ^,$

rest3=re.findall("^python","python hello pyth3on1")
print(rest3)
rest4=re.findall("python$","pyth3on hello python")
print(rest4)

  The output is as follows  

['python']
['python']

Shortcut

\d Match the Numbers , namely 0-9\D matching ⾮ Numbers , It's not numbers \s Match empty ⽩, That is, the space ,tab key \S matching ⾮ empty ⽩ character \w Match word characters , namely a-z、A-Z、0-9、_\W matching ⾮ Word characters \A Match string start \b Word boundaries , Match empty string , But only at the beginning or end of the word \B Non word boundaries , No   Can be at the beginning or end of a word

Regular repetition

1. ? Indicates matching the previous item 0 Time or 1 Time

ret=re.findall("py?","python p pyy ps")
print(ret)

   The output is as follows

['py', 'p', 'py', 'p']

2. * Indicates that the previous item is matched any time (0-n Time )

ret=re.findall("py*","python p pyy ps")
print(ret)

   The output is as follows

['py', 'p', 'pyy', 'p']

3. + Indicates that the previous item is matched at least once

ret=re.findall("py+","python p pyy ps")
print(ret)

   The output is as follows

['py', 'pyy']

4.{n} n Is a non negative integer . Matched definite n Time .

ret=re.findall("py{2}","python p pyy ps pyyyy")
print(ret)

   The output is as follows

['pyy', 'pyy']

5.{n,} n Is a non negative integer . Match at least n Time .

ret=re.findall("py{2,}","python p pyy ps pyyyy")
print(ret)

   The output is as follows

['pyy', 'pyyyy']

6.{n,m} Indicates matching the previous item n-m Time , Least match n Times and at most m Time

ret=re.findall("py{2,4}","python p pyy ps pyyyy")
print(ret)

   The output is as follows

['pyy', 'pyyyy']

Greedy mode and non greedy mode

 Greedy mode :* + ? Are greedy , They will match as long a string as possible
Non greedy model : Match to output , Match as short as possible (+? *? ?? {2,4}?)

msg="helloooooo,I am sanchuang,123"
print(re.findall("lo{3,}",msg))
print(re.findall("lo{3,}?",msg))
print(re.findall("lo*?",msg))
print(re.findall("lo?",msg))
print(re.findall("lo??",msg))
msg="cats and dogs , cats1 and dog1"
print(re.findall("cats.*s",msg))
print(re.findall("cats.*?s",msg))

    The output is as follows

['loooooo']
['looo']
['l', 'l']
['l', 'lo']
['l', 'l']
['cats and dogs , cats']
['cats and dogs']

Regular grouping

When using grouping , Except you can get the whole match , You can also select each individual group , Use () Grouping

1. Capture groups

match Object's group function , The default parameter is 0, Represents all strings of the output function
Parameters n(n>0), Indicates the content matched by the output group 
msg="tel:173-7572-2991"
ret=re.search(r"(\d{3})-(\d{4})-(\d{4})",msg)
# ret1=re.search(r"\d{3}-\d{4}-\d{4}",msg)
print(ret.groups())
print(ret.group())
print(ret.group(1))
print(ret.group(2))
print(ret.group(3))
ret=re.search(r"(\d{3})-(\d{4})-(\d{4})",msg)

    The output is as follows

('173', '7572', '2991')
173-7572-2991
173
7572
2991

 2. Reference group ( Group backward reference )

 Capture groups -- After grouping, the matched data is temporarily placed in memory , And given an index from the beginning
therefore , Capture groups can be referenced backwards \1 \2
ret = re.search(r"(\d{3})-(\d{4})-\2","173-7572-7572")
print(ret.group())
ret = re.search(r"(\d{3})-(\d{4})-\1","173-7572-173")
print(ret.group())

     The output is as follows

173-7572-7572
173-7572-173

3. Non capture grouping    (?:regex)

 Group only, do not capture , The matched content will not be temporarily put into memory , Cannot use group backward reference 
ret = re.search(r"(?:\d{3})-(\d{4})-\1","173-7572-7572")
print(ret.group(1))

     The output is as follows

7572
 If there are capture groups ,findall Only the captured group content will be output 
ret = re.findall(r"(?:\d{3})-(\d{4})-\1","173-7572-7572")
print(ret)

   The output is as follows

['7572']

Example

msg="[email protected]@[email protected]@163.com"
find 126.com and qq.com and 163.com Your email address 

Code implementation

msg="[email protected]@[email protected]@163.com"
print(re.findall(r"(?:\.com)?(\[email protected](?:126|qq|163)\.com)",msg))

The output is as follows

['[email protected]', '[email protected]', '[email protected]']

4. Name groups

import re
ret=re.search(r'(?P<first>\d{3})-\d{3}-(?P<last>\d{3})',"321-123-231")
print(ret.group())
print(ret.groups())
print(ret.groupdict())
ret=re.findall(r'(?P<first>\d{3})-\d{3}-(?P<last>\d{3})',"321-123-231")
print(ret)

  The output is as follows

321-123-231
('321', '231')
{'first': '321', 'last': '231'}
[('321', '231')]

Regular tags are often used

 re.I GNORECASE, Make match match case insensitive
re.M re.MULTILINE, Multi-line matching , influence ^ and $
re.S re.DOTALL, send . Match all characters including line breaks 
import re
ret=re.findall("^python$","Python",re.I)
print(ret)
ret=re.findall("^python$","Python\npython",re.I)
print(ret)
ret=re.findall("^python$","Python\npython",re.I|re.M)
print(ret)

The output is as follows

['Python']
[]
['Python', 'python']
# Case insensitive , And multiple lines match 
msg="""
python
python
Python
"""
print(re.findall("^python$",msg,re.M|re.I))
print(re.findall(".+",msg,re.S))

The output is as follows

['python', 'python', 'Python']
['\npython\npython\nPython\n']

Inline tag

(?imx) Regular expressions contain three optional flags :i, m, or x . Only the areas in brackets are affected .
(?imx: re) Use... In parentheses i, m, or x Optional logo 
import re
ret=re.findall("(?i)^python$","Python")
print(ret)
ret=re.findall("(?i)^python$","Python\npython")
print(ret)
ret=re.findall("(?im)^python$","Python\npython")
print(ret)

The output is as follows

['Python']
[]
['Python', 'python']

Inline tags can be valid for only one field , When you use inline markup, you should add a space between it and the following expression

ret=re.findall("(?i:hello) Python","Hello python")
print(ret)
ret=re.findall("(?i:hello) python","Hello python")
print(ret)

The output is as follows

[]
['Hello python']

Regular assertion

 Regular expression assertions are divided into : Assert ahead (lookahead) And the following assertion (lookbehind)
The first assertion and the second assertion of regular expressions have 4 In the form of :
n (?=pattern) Zero width forward lookahead assertion (zero-width positive lookahead assertion)
n (?!pattern) Zero width negative look ahead assertion (zero-width negative lookahead assertion)
n (?<=pattern) Zero width forward and backward assertion (zero-width positive lookbehind assertion)
n (?<!pattern) Zero width negative backward assertion (zero-width negative lookbehind assertion)

1. Zero width forward lookahead assertion

import re
s='a reguler expression'
print(re.findall(r're(?=guler)',s))
s='a reguller expression'
print(re.findall(r're(?=guler)',s))

  The output is as follows

['re']
[]

2. Zero width negative look ahead assertion

import re
s='a reguler expression'
print(re.findall(r're(?!guler)',s))
s='a reguller expression'
print(re.findall(r're(?!guler)',s))

  The output is as follows

['re']
['re', 're']

3. Zero width forward and backward assertion

import re
s='a reguler expression'
print(re.findall(r'(?<=re)guler',s))
s='a reguller expression'
print(re.findall(r'(?<=re)guler',s))

The output is as follows

['guler']
[]

4. Zero width negative backward assertion

import re
s='a reguler expression'
print(re.findall(r'(?<!re)guler',s))
s='a reguller expression'
print(re.findall(r'(?<!re)expression',s))

The output is as follows

[]
['expression']


  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved