您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

The most understandable Python regular expression tutorial in history

編輯：Python

This article is a quick tutorial , Only the most basic knowledge will be introduced . To learn more , Please refer to official documents or other articles . \textcolor{red}{\text{ This article is a quick tutorial , Only the most basic knowledge will be introduced . To learn more , Please refer to official documents or other articles .}} This article is a quick tutorial , Only the most basic knowledge will be introduced . To learn more , Please refer to official documents or other articles .

Catalog

Preface
One 、re.match
- 1.1 Special characters are often used
- 1.2 Common special sequences
- 1.3 Some examples
- - 1.3.1 Hexadecimal RGB Color values match
  - 1.3.2 1-100 Match all integers between
  - 1.3.3 Mailbox format match
  - 1.3.4 IPV4 Address matching
- 1.4 Modifier
Two 、 Other functions
- 2.1 re.fullmatch
- 2.2 re.search
- 2.3 re.findall
- 2.4 re.sub
- 2.5 re.split
3、 ... and 、 Greedy matching and non greedy matching
Four 、 Recommended sites

Preface

What is regular expression ？ Regular expressions （regular expression） It's a special kind character string , It can help you easily check whether a string matches a given pattern .Python Medium re Module provides all the functions of regular expression , In the next few chapters , We will introduce in detail re The most commonly used function in .

import re

One 、re.match

re.match Try... From string The starting position Match a pattern , If the match is successful , It will return a re.Match object ; If the match fails , Then return to None. The specific format is as follows ：

re.match(pattern, string, flags=0)

pattern： Regular expressions ;
string： String to match ;
flags： Modifier , Used to control the matching pattern .

Regular expressions can contain either ordinary characters or special characters . If only ordinary characters are included , That is to match a specific string ：

""" Try from string abcde The starting position of matches the string abc """
s = re.match('abc', 'abcde')
print(s)
# <re.Match object; span=(0, 3), match='abc'>

character string abc In string abcde The starting position and ending position in are 0 and 2, according to Left closed right away principle ,span by (0, 3).

If you want to get matching results , Can be used group Method ：

print(s.group())
# abc

Be careful , The following matching will fail , because match Is from the string The starting position Matching ：

s = re.match('bcd', 'abcde')
print(s)
# None

1.1 Special characters are often used

Special characters are often used （special characters） Listed in the table below ：

Special characters effect . Match except for line breaks \n Any other Single character ^ Match start position $ Match the end position （ Before the line break ）* Express * The first character can appear 0 Times or any number of times + Express + The first character can appear 1 Times or any number of times ? Express ? The first character can appear 0 Time or 1 Time {m} Express {m} The first character appears m Time {m,} Express {m,} The first character can appear m Time And above {,n} Express {,n} The first character appears at most n Time {m,n} Express {m,n} The first character appears m Time to n Time [] matching [] The characters listed in () matching () The expression inside , Represents a group | or

For simplicity , Let's define a match Function is used to display the matching results more intuitively ：

def match(pattern, list_of_strings):
for string in list_of_strings:
if re.match(pattern, string):
print(' The match is successful ！ The result is :', res.group())
else:
print(' Matching failure ！')

.：

match('.', ['a', 'ab', 'abc'])
# The match is successful ！ The result is : a
# The match is successful ！ The result is : a
# The match is successful ！ The result is : a

Because we match a single character from scratch , So the results are a.

^、$：

match('^ab', ['ab', 'abc', 'adc', 'bac'])
# The match is successful ！ The result is : ab
# The match is successful ！ The result is : ab
# Matching failure ！
# Matching failure ！
match('cd$', ['cd', 'acd', 'adc', 'cdcd'])
# The match is successful ！ The result is : cd
# Matching failure ！
# Matching failure ！
# Matching failure ！

It seems to match with cd String ending with , But actually don't forget match It starts from the beginning of the string , So the above statement is actually a matching string cd.

*、+、?：

match('a*', ['aa', 'aba', 'baa', 'aaaa'])
# The match is successful ！ The result is : aa
# The match is successful ！ The result is : a
# The match is successful ！ The result is : 
# The match is successful ！ The result is : aaaa
match('a+', ['aa', 'aba', 'baa'])
# The match is successful ！ The result is : aa
# The match is successful ！ The result is : a
# Matching failure ！
match('a?', ['aa', 'ab', 'ba'])
# The match is successful ！ The result is : a
# The match is successful ！ The result is : a
# The match is successful ！ The result is :

Be careful ,a* representative a Can appear 0 Times or any number of times , therefore baa To match, you will get an empty string .

{m}、{m,}、{,n}、{m,n}（ Be careful There are no spaces ）：

match('a{3}', ['abaa', 'aaab', 'baaa'])
# Matching failure ！
# The match is successful ！ The result is : aaa
# Matching failure ！
match('a{3,}', ['aaab', 'aaaab', 'baaa'])
# The match is successful ！ The result is : aaa
# The match is successful ！ The result is : aaaa
# Matching failure ！
match('a{,3}', ['aaab', 'aaaab', 'baaa'])
# The match is successful ！ The result is : aaa
# The match is successful ！ The result is : aaa
# The match is successful ！ The result is : 
match('a{3,5}', ['a' * i for i in range(2, 7)])
# Matching failure ！
# The match is successful ！ The result is : aaa
# The match is successful ！ The result is : aaaa
# The match is successful ！ The result is : aaaaa
# The match is successful ！ The result is : aaaaa

[]：

match('[123]', [str(i) for i in range(1, 5)])
# The match is successful ！ The result is : 1
# The match is successful ！ The result is : 2
# The match is successful ！ The result is : 3
# Matching failure ！

Be careful , We can [123] Shorthand for [1-3], This means to match Single Numbers , You can use [0-9] Such regular expressions ：

match('[0-9]', ['a', 'A', '1', '3', '_'])
# Matching failure ！
# Matching failure ！
# The match is successful ！ The result is : 1
# The match is successful ！ The result is : 3
# Matching failure ！

Further more , If we want to match the interval [ 1 , 35 ] [1, 35] [1,35] All integers in , How to do it ？ A natural idea is to use [1-35], But look closely at the following example ：

match('[1-35]', ['1', '2', '3', '4'])
# The match is successful ！ The result is : 1
# The match is successful ！ The result is : 2
# The match is successful ！ The result is : 3
# Matching failure ！

You'll find numbers 4 4 4 The match failed . This is because - You can only connect to adjacent Two numbers of , therefore [1-35] It actually represents numbers 1、2、3 and 5. in other words , All numbers except these four numbers will fail to match .

We can consider three situations ： The ten digits are 3 3 3, The ten digits are 1 1 1 or 2 2 2, There's only one digit （ Need to use or operation |）

pattern = '3[0-5]|[12][0-9]|[1-9]'

The regular expressions do all match correctly [ 1 , 35 ] [1,35] [1,35] All integers in , however ：

match('3[0-5]|[12][0-9]|[1-9]', ['36', '350'])
# The match is successful ！ The result is : 3
# The match is successful ！ The result is : 35

We will find that numbers outside the interval can also be matched successfully . So you need to use $ To prevent miscarriage of justice , The right thing to do is ：

pattern = '(3[0-5]|[12][0-9]|[1-9])$'

among () The role of will be mentioned later （ It can be roughly understood here as a whole ）.

besides , We can also judge whether a given character is a letter , The corresponding regular expression is [a-zA-Z]：

match('[a-zA-Z]', ['-', 'a', '9', 'G', '.'])
# Matching failure ！
# The match is successful ！ The result is : a
# Matching failure ！
# The match is successful ！ The result is : G
# Matching failure ！

If we want to match non numeric characters , You need to use ^, It said Take the complement ：

match('[^0-9]', ['-', 'a', '3', 'M', '9', '_'])
# The match is successful ！ The result is : -
# The match is successful ！ The result is : a
# Matching failure ！
# The match is successful ！ The result is : M
# Matching failure ！
# The match is successful ！ The result is : _

()：

""" Match multiple ab """
match('(ab)+', ['ac', 'abc', 'abbc', 'abababac', 'adc'])
# Matching failure ！
# The match is successful ！ The result is : ab
# The match is successful ！ The result is : ab
# The match is successful ！ The result is : ababab
# Matching failure ！

Be careful ab+ Such regular expressions are invalid , It represents only one character a And one or more characters b. So we must use () Take it as a whole , Also called a grouping .

1.2 Common special sequences

With \ Those that begin with only one character are called special sequences （special sequences）, Common special sequences are listed in the following table ：

Special sequence effect \d Equivalent to [0-9], That is, all numbers （ Ingenious notes ：digit）\D Equivalent to [^\d], That is, all non numbers \s Space character （ Ingenious notes ：space）\S Equivalent to [^\s], That is, all non space characters \w Equivalent to [a-zA-Z0-9_], That is, all word characters , Include letters 、 Numbers and underscores （ Ingenious notes ：word）\W Equivalent to [^\w], That is, all non word characters

""" Example 1 """
match('\d', ['1', 'a', '_', '-'])
# The match is successful ！ The result is : 1
# Matching failure ！
# Matching failure ！
# Matching failure ！
match('\D', ['1', 'a', '_', '-'])
# Matching failure ！
# The match is successful ！ The result is : a
# The match is successful ！ The result is : _
# The match is successful ！ The result is : -
""" Example 2 """
match('\s', ['1', 'a', '_', ' '])
# Matching failure ！
# Matching failure ！
# Matching failure ！
# The match is successful ！ The result is : 
match('\S', ['1', 'a', '_', ' '])
# The match is successful ！ The result is : 1
# The match is successful ！ The result is : a
# The match is successful ！ The result is : _
# Matching failure ！
""" Example 3 """
match('\w', ['1', 'a', '_', ' ', ']'])
# The match is successful ！ The result is : 1
# The match is successful ！ The result is : a
# The match is successful ！ The result is : _
# Matching failure ！
# Matching failure ！
match('\W', ['1', 'a', '_', ' ', ']'])
# Matching failure ！
# Matching failure ！
# Matching failure ！
# The match is successful ！ The result is : 
# The match is successful ！ The result is : ]

1.3 Some examples

Next, we will use some examples to further consolidate the concepts learned before .

1.3.1 Hexadecimal RGB Color values match

The format of hexadecimal color values is usually #XXXXXX, among X The value of can be a number , It can also be for A-F Any character in （ Suppose the case of lowercase is not considered here ）.

regex = '#[A-F0-9]{6}$'
colors = ['#00', '#FFFFFF', '#FFAAFF', '#00HH00', '#AABBCC', '#000000', '#FFFFFFFF']
match(regex, colors)
# Matching failure ！
# The match is successful ！ The result is : #FFFFFF
# The match is successful ！ The result is : #FFAAFF
# Matching failure ！
# The match is successful ！ The result is : #AABBCC
# The match is successful ！ The result is : #000000
# Matching failure ！

1.3.2 1-100 Match all integers between

We don't consider prefixes 0 The circumstances of , For example, for numbers 8、35, Such as 08、008、035 Such forms are excluded .

Just consider three digits separately 、 Two digit number 、 Single digit situation ：

regex = '(100|[1-9]\d|[1-9])$'
numbers = ['0', '5', '05', '005', '12', '012', '89', '100', '101']
match(regex, numbers)
# Matching failure ！
# The match is successful ！ The result is : 5
# Matching failure ！
# Matching failure ！
# The match is successful ！ The result is : 12
# Matching failure ！
# The match is successful ！ The result is : 89
# The match is successful ！ The result is : 100
# Matching failure ！

1.3.3 Mailbox format match

Here we create a mailbox , Suppose the domain name is sky.com. When creating a new user , The user name can only be numbers 、 Letters and underscores , And it can't start with a dash , The length of mailbox name is 6-18 position . How can we use regular expressions to determine whether the mailbox entered by the user meets the specification ？

regex = '[a-zA-Z0-9][\w]{5,17}@sky\.com$'
emails = [
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
]
match(regex, emails)
# Matching failure ！
# Matching failure ！
# Matching failure ！
# Matching failure ！
# The match is successful ！ The result is : [email protected]
# Matching failure ！
# Matching failure ！

If sky Email allows users to open vip, After opening, the email domain name changes to vip.sky.com, And ordinary users and vip All users belong to sky mailbox user . The regular expression in this case needs to be rewritten as ：

regex = '[a-zA-Z0-9][\w]{5,17}@(vip\.)?sky\.com$'
emails = [
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
]
match(regex, emails)
# The match is successful ！ The result is : [email protected]
# The match is successful ！ The result is : [email protected]
# Matching failure ！
# Matching failure ！

1.3.4 IPV4 Address matching

IPV4 The format of is usually X.X.X.X, among X For the range of 0-255. Prefix is still not considered here 0 The circumstances of .

regex = '((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\.){3}(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)$'
ipv4s = [
'0.0.0.0',
'0.0.255.0',
'255.0.0',
'127.0.0.1.0',
'256.0.0.123',
'255.255.255.255',
'012.08.0.0',
]
match(regex, ipv4s)
# The match is successful ！ The result is : 0.0.0.0
# The match is successful ！ The result is : 0.0.255.0
# Matching failure ！
# Matching failure ！
# Matching failure ！
# The match is successful ！ The result is : 255.255.255.255
# Matching failure ！

1.4 Modifier

Modifier is re.match Function flags Parameters , Commonly used modifiers are listed in the following table ：

Modifier effect re.I Ignore case when matching re.S bring . Can match any single character （ Include line breaks \n）

s = re.match('a', 'A', flags=re.I)
print(s.group())
# A
s = re.match('.', '\n', flags=re.S)
print(s)
# <re.Match object; span=(0, 1), match='\n'>

Two 、 Other functions

2.1 re.fullmatch

re.match Match from the beginning of the string , That is, as long as the match is successful , Then there is no need to control the remaining position of the string . and re.fullmatch Match the whole string .

The format is as follows ：

re.fullmatch(pattern, string, flags=0)

for example ：

print(re.match('\d', '3a'))
# <re.Match object; span=(0, 1), match='3'>
print(re.fullmatch('\d', '3a'))
# None

2.2 re.search

re.search Scan the entire string from left to right and return the first successful match . If the match fails , Then return to None.

The format is as follows ：

re.search(pattern, string, flags=0)

for example ：

print(re.search('ab', 'abcd'))
# <re.Match object; span=(0, 2), match='ab'>
print(re.search('cd', 'abcd'))
# <re.Match object; span=(2, 4), match='cd'>

2.3 re.findall

re.findall Is to find all substrings matching regular expressions in a given string , And return... As a list , The format is as follows ：

re.findall(pattern, string, flags=0)

for example ：

res = re.findall('\d+', 'ab 13 cd- 274 .]')
print(res)
# ['13', '274']
res = re.findall('(\w+):(\d+)', 'Xiaoming:16, Xiaohong:14')
print(res)
# [('Xiaoming', '16'), ('Xiaohong', '14')]

2.4 re.sub

re.sub Used to replace the matched substring with another substring , Execute from left to right . The format is as follows ：

re.sub(pattern, repl, string, count=0, flags=0)

repl Represents the replaced string ,count Is the maximum number of replacements ,0 Means to replace all .

for example ：

res = re.sub(' ', '', '1 2 3 4 5')
print(res)
# 12345
res = re.sub(' ', '', '1 2 3 4 5', count=2)
print(res)
# 123 4 5

2.5 re.split

re.split The string will be cut according to the match （ From left to right ）, And return a list . The format is as follows ：

re.split(pattern, string, maxsplit=0, flags=0)

count Is the maximum number of cuts ,0 Indicates all cutting positions .

for example ：

res = re.split(' ', '1 2 3 4 5')
print(res)
# ['1', '2', '3', '4', '5']
res = re.split(' ', '1 2 3 4 5', maxsplit=2)
print(res)
# ['1', '2', '3 4 5']

3、 ... and 、 Greedy matching and non greedy matching

be-all quantifiers ：*、+、?、{m}、{m,}、{,n}、{m,n} Default Adopt the principle of greedy matching , That is, when the match is successful As much as possible It's a good match .

With * For example , obviously \d* Is a number that matches any length ：

match('\d*', ['1234abc'])
# The match is successful ！ The result is : 1234

According to the principle of greed , The final matching result must be 1234.

Sometimes , We don't want to match as many as possible , It is As little as possible It's a good match . At this time, we can add ?, It said Not greed matching ：

match('\d*?', ['1234abc'])
# The match is successful ！ The result is :

In non greedy mode ,\d There should be 0 Time （ Match as few as possible ）, Finally, the null character is returned .

Let's take a look at other quantifiers ：

match('\d+?', ['1234abc'])
# The match is successful ！ The result is : 1
match('\d??', ['1234abc'])
# The match is successful ！ The result is : 
match('\d{2,}?', ['1234abc'])
# The match is successful ！ The result is : 12
match('\d{2,5}?', ['1234abc'])
# The match is successful ！ The result is : 12
match('\d{,5}?', ['1234abc'])
# The match is successful ！ The result is :