This article is a quick tutorial , Only the most basic knowledge will be introduced . To learn more , Please refer to official documents or other articles . \textcolor{red}{\text{ This article is a quick tutorial , Only the most basic knowledge will be introduced . To learn more , Please refer to official documents or other articles .}} This article is a quick tutorial , Only the most basic knowledge will be introduced . To learn more , Please refer to official documents or other articles .
What is regular expression ? Regular expressions (regular expression) It's a special kind character string , It can help you easily check whether a string matches a given pattern .Python Medium re
Module provides all the functions of regular expression , In the next few chapters , We will introduce in detail re
The most commonly used function in .
import re
re.match
Try... From string The starting position Match a pattern , If the match is successful , It will return a re.Match
object ; If the match fails , Then return to None
. The specific format is as follows :
re.match(pattern, string, flags=0)
pattern
: Regular expressions ;string
: String to match ;flags
: Modifier , Used to control the matching pattern .Regular expressions can contain either ordinary characters or special characters . If only ordinary characters are included , That is to match a specific string :
""" Try from string abcde The starting position of matches the string abc """
s = re.match('abc', 'abcde')
print(s)
# <re.Match object; span=(0, 3), match='abc'>
character string abc
In string abcde
The starting position and ending position in are 0
and 2
, according to Left closed right away principle ,span
by (0, 3)
.
If you want to get matching results , Can be used group
Method :
print(s.group())
# abc
Be careful , The following matching will fail , because match
Is from the string The starting position Matching :
s = re.match('bcd', 'abcde')
print(s)
# None
Special characters are often used (special characters) Listed in the table below :
.
Match except for line breaks \n
Any other Single character ^
Match start position $
Match the end position ( Before the line break )*
Express *
The first character can appear 0 Times or any number of times +
Express +
The first character can appear 1 Times or any number of times ?
Express ?
The first character can appear 0 Time or 1 Time {m}
Express {m}
The first character appears m Time {m,}
Express {m,}
The first character can appear m Time And above {,n}
Express {,n}
The first character appears at most n Time {m,n}
Express {m,n}
The first character appears m Time to n Time []
matching []
The characters listed in ()
matching ()
The expression inside , Represents a group |
or For simplicity , Let's define a match
Function is used to display the matching results more intuitively :
def match(pattern, list_of_strings):
for string in list_of_strings:
if re.match(pattern, string):
print(' The match is successful ! The result is :', res.group())
else:
print(' Matching failure !')
.
:
match('.', ['a', 'ab', 'abc'])
# The match is successful ! The result is : a
# The match is successful ! The result is : a
# The match is successful ! The result is : a
Because we match a single character from scratch , So the results are a
.
^
、$
:
match('^ab', ['ab', 'abc', 'adc', 'bac'])
# The match is successful ! The result is : ab
# The match is successful ! The result is : ab
# Matching failure !
# Matching failure !
match('cd$', ['cd', 'acd', 'adc', 'cdcd'])
# The match is successful ! The result is : cd
# Matching failure !
# Matching failure !
# Matching failure !
It seems to match with cd
String ending with , But actually don't forget match
It starts from the beginning of the string , So the above statement is actually a matching string cd
.
*
、+
、?
:
match('a*', ['aa', 'aba', 'baa', 'aaaa'])
# The match is successful ! The result is : aa
# The match is successful ! The result is : a
# The match is successful ! The result is :
# The match is successful ! The result is : aaaa
match('a+', ['aa', 'aba', 'baa'])
# The match is successful ! The result is : aa
# The match is successful ! The result is : a
# Matching failure !
match('a?', ['aa', 'ab', 'ba'])
# The match is successful ! The result is : a
# The match is successful ! The result is : a
# The match is successful ! The result is :
Be careful ,a*
representative a
Can appear 0 Times or any number of times , therefore baa
To match, you will get an empty string .
{m}
、{m,}
、{,n}
、{m,n}
( Be careful There are no spaces ):
match('a{3}', ['abaa', 'aaab', 'baaa'])
# Matching failure !
# The match is successful ! The result is : aaa
# Matching failure !
match('a{3,}', ['aaab', 'aaaab', 'baaa'])
# The match is successful ! The result is : aaa
# The match is successful ! The result is : aaaa
# Matching failure !
match('a{,3}', ['aaab', 'aaaab', 'baaa'])
# The match is successful ! The result is : aaa
# The match is successful ! The result is : aaa
# The match is successful ! The result is :
match('a{3,5}', ['a' * i for i in range(2, 7)])
# Matching failure !
# The match is successful ! The result is : aaa
# The match is successful ! The result is : aaaa
# The match is successful ! The result is : aaaaa
# The match is successful ! The result is : aaaaa
[]
:
match('[123]', [str(i) for i in range(1, 5)])
# The match is successful ! The result is : 1
# The match is successful ! The result is : 2
# The match is successful ! The result is : 3
# Matching failure !
Be careful , We can [123]
Shorthand for [1-3]
, This means to match Single Numbers , You can use [0-9]
Such regular expressions :
match('[0-9]', ['a', 'A', '1', '3', '_'])
# Matching failure !
# Matching failure !
# The match is successful ! The result is : 1
# The match is successful ! The result is : 3
# Matching failure !
Further more , If we want to match the interval [ 1 , 35 ] [1, 35] [1,35] All integers in , How to do it ? A natural idea is to use [1-35]
, But look closely at the following example :
match('[1-35]', ['1', '2', '3', '4'])
# The match is successful ! The result is : 1
# The match is successful ! The result is : 2
# The match is successful ! The result is : 3
# Matching failure !
You'll find numbers 4 4 4 The match failed . This is because -
You can only connect to adjacent Two numbers of , therefore [1-35]
It actually represents numbers 1
、2
、3
and 5
. in other words , All numbers except these four numbers will fail to match .
We can consider three situations : The ten digits are 3 3 3, The ten digits are 1 1 1 or 2 2 2, There's only one digit ( Need to use or operation |
)
pattern = '3[0-5]|[12][0-9]|[1-9]'
The regular expressions do all match correctly [ 1 , 35 ] [1,35] [1,35] All integers in , however :
match('3[0-5]|[12][0-9]|[1-9]', ['36', '350'])
# The match is successful ! The result is : 3
# The match is successful ! The result is : 35
We will find that numbers outside the interval can also be matched successfully . So you need to use $
To prevent miscarriage of justice , The right thing to do is :
pattern = '(3[0-5]|[12][0-9]|[1-9])$'
among ()
The role of will be mentioned later ( It can be roughly understood here as a whole ).
besides , We can also judge whether a given character is a letter , The corresponding regular expression is [a-zA-Z]
:
match('[a-zA-Z]', ['-', 'a', '9', 'G', '.'])
# Matching failure !
# The match is successful ! The result is : a
# Matching failure !
# The match is successful ! The result is : G
# Matching failure !
If we want to match non numeric characters , You need to use ^
, It said Take the complement :
match('[^0-9]', ['-', 'a', '3', 'M', '9', '_'])
# The match is successful ! The result is : -
# The match is successful ! The result is : a
# Matching failure !
# The match is successful ! The result is : M
# Matching failure !
# The match is successful ! The result is : _
()
:
""" Match multiple ab """
match('(ab)+', ['ac', 'abc', 'abbc', 'abababac', 'adc'])
# Matching failure !
# The match is successful ! The result is : ab
# The match is successful ! The result is : ab
# The match is successful ! The result is : ababab
# Matching failure !
Be careful ab+
Such regular expressions are invalid , It represents only one character a
And one or more characters b
. So we must use ()
Take it as a whole , Also called a grouping .
With \
Those that begin with only one character are called special sequences (special sequences), Common special sequences are listed in the following table :
\d
Equivalent to [0-9]
, That is, all numbers ( Ingenious notes :digit)\D
Equivalent to [^\d]
, That is, all non numbers \s
Space character ( Ingenious notes :space)\S
Equivalent to [^\s]
, That is, all non space characters \w
Equivalent to [a-zA-Z0-9_]
, That is, all word characters , Include letters 、 Numbers and underscores ( Ingenious notes :word)\W
Equivalent to [^\w]
, That is, all non word characters """ Example 1 """
match('\d', ['1', 'a', '_', '-'])
# The match is successful ! The result is : 1
# Matching failure !
# Matching failure !
# Matching failure !
match('\D', ['1', 'a', '_', '-'])
# Matching failure !
# The match is successful ! The result is : a
# The match is successful ! The result is : _
# The match is successful ! The result is : -
""" Example 2 """
match('\s', ['1', 'a', '_', ' '])
# Matching failure !
# Matching failure !
# Matching failure !
# The match is successful ! The result is :
match('\S', ['1', 'a', '_', ' '])
# The match is successful ! The result is : 1
# The match is successful ! The result is : a
# The match is successful ! The result is : _
# Matching failure !
""" Example 3 """
match('\w', ['1', 'a', '_', ' ', ']'])
# The match is successful ! The result is : 1
# The match is successful ! The result is : a
# The match is successful ! The result is : _
# Matching failure !
# Matching failure !
match('\W', ['1', 'a', '_', ' ', ']'])
# Matching failure !
# Matching failure !
# Matching failure !
# The match is successful ! The result is :
# The match is successful ! The result is : ]
Next, we will use some examples to further consolidate the concepts learned before .
The format of hexadecimal color values is usually #XXXXXX
, among X
The value of can be a number , It can also be for A-F
Any character in ( Suppose the case of lowercase is not considered here ).
regex = '#[A-F0-9]{6}$'
colors = ['#00', '#FFFFFF', '#FFAAFF', '#00HH00', '#AABBCC', '#000000', '#FFFFFFFF']
match(regex, colors)
# Matching failure !
# The match is successful ! The result is : #FFFFFF
# The match is successful ! The result is : #FFAAFF
# Matching failure !
# The match is successful ! The result is : #AABBCC
# The match is successful ! The result is : #000000
# Matching failure !
We don't consider prefixes 0 The circumstances of , For example, for numbers 8
、35
, Such as 08
、008
、035
Such forms are excluded .
Just consider three digits separately 、 Two digit number 、 Single digit situation :
regex = '(100|[1-9]\d|[1-9])$'
numbers = ['0', '5', '05', '005', '12', '012', '89', '100', '101']
match(regex, numbers)
# Matching failure !
# The match is successful ! The result is : 5
# Matching failure !
# Matching failure !
# The match is successful ! The result is : 12
# Matching failure !
# The match is successful ! The result is : 89
# The match is successful ! The result is : 100
# Matching failure !
Here we create a mailbox , Suppose the domain name is sky.com
. When creating a new user , The user name can only be numbers 、 Letters and underscores , And it can't start with a dash , The length of mailbox name is 6-18 position . How can we use regular expressions to determine whether the mailbox entered by the user meets the specification ?
regex = '[a-zA-Z0-9][\w]{5,17}@sky\.com$'
emails = [
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
]
match(regex, emails)
# Matching failure !
# Matching failure !
# Matching failure !
# Matching failure !
# The match is successful ! The result is : [email protected]
# Matching failure !
# Matching failure !
If sky
Email allows users to open vip, After opening, the email domain name changes to vip.sky.com
, And ordinary users and vip All users belong to sky
mailbox user . The regular expression in this case needs to be rewritten as :
regex = '[a-zA-Z0-9][\w]{5,17}@(vip\.)?sky\.com$'
emails = [
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
]
match(regex, emails)
# The match is successful ! The result is : [email protected]
# The match is successful ! The result is : [email protected]
# Matching failure !
# Matching failure !
IPV4 The format of is usually X.X.X.X
, among X
For the range of 0-255. Prefix is still not considered here 0 The circumstances of .
regex = '((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\.){3}(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)$'
ipv4s = [
'0.0.0.0',
'0.0.255.0',
'255.0.0',
'127.0.0.1.0',
'256.0.0.123',
'255.255.255.255',
'012.08.0.0',
]
match(regex, ipv4s)
# The match is successful ! The result is : 0.0.0.0
# The match is successful ! The result is : 0.0.255.0
# Matching failure !
# Matching failure !
# Matching failure !
# The match is successful ! The result is : 255.255.255.255
# Matching failure !
Modifier is re.match
Function flags
Parameters , Commonly used modifiers are listed in the following table :
re.I
Ignore case when matching re.S
bring .
Can match any single character ( Include line breaks \n
)s = re.match('a', 'A', flags=re.I)
print(s.group())
# A
s = re.match('.', '\n', flags=re.S)
print(s)
# <re.Match object; span=(0, 1), match='\n'>
re.match
Match from the beginning of the string , That is, as long as the match is successful , Then there is no need to control the remaining position of the string . and re.fullmatch
Match the whole string .
The format is as follows :
re.fullmatch(pattern, string, flags=0)
for example :
print(re.match('\d', '3a'))
# <re.Match object; span=(0, 1), match='3'>
print(re.fullmatch('\d', '3a'))
# None
re.search
Scan the entire string from left to right and return the first successful match . If the match fails , Then return to None
.
The format is as follows :
re.search(pattern, string, flags=0)
for example :
print(re.search('ab', 'abcd'))
# <re.Match object; span=(0, 2), match='ab'>
print(re.search('cd', 'abcd'))
# <re.Match object; span=(2, 4), match='cd'>
re.findall
Is to find all substrings matching regular expressions in a given string , And return... As a list , The format is as follows :
re.findall(pattern, string, flags=0)
for example :
res = re.findall('\d+', 'ab 13 cd- 274 .]')
print(res)
# ['13', '274']
res = re.findall('(\w+):(\d+)', 'Xiaoming:16, Xiaohong:14')
print(res)
# [('Xiaoming', '16'), ('Xiaohong', '14')]
re.sub
Used to replace the matched substring with another substring , Execute from left to right . The format is as follows :
re.sub(pattern, repl, string, count=0, flags=0)
repl
Represents the replaced string ,count
Is the maximum number of replacements ,0 Means to replace all .
for example :
res = re.sub(' ', '', '1 2 3 4 5')
print(res)
# 12345
res = re.sub(' ', '', '1 2 3 4 5', count=2)
print(res)
# 123 4 5
re.split
The string will be cut according to the match ( From left to right ), And return a list . The format is as follows :
re.split(pattern, string, maxsplit=0, flags=0)
count
Is the maximum number of cuts ,0 Indicates all cutting positions .
for example :
res = re.split(' ', '1 2 3 4 5')
print(res)
# ['1', '2', '3', '4', '5']
res = re.split(' ', '1 2 3 4 5', maxsplit=2)
print(res)
# ['1', '2', '3 4 5']
be-all quantifiers :*
、+
、?
、{m}
、{m,}
、{,n}
、{m,n}
Default Adopt the principle of greedy matching , That is, when the match is successful As much as possible It's a good match .
With *
For example , obviously \d*
Is a number that matches any length :
match('\d*', ['1234abc'])
# The match is successful ! The result is : 1234
According to the principle of greed , The final matching result must be 1234
.
Sometimes , We don't want to match as many as possible , It is As little as possible It's a good match . At this time, we can add ?
, It said Not greed matching :
match('\d*?', ['1234abc'])
# The match is successful ! The result is :
In non greedy mode ,\d
There should be 0 Time ( Match as few as possible ), Finally, the null character is returned .
Let's take a look at other quantifiers :
match('\d+?', ['1234abc'])
# The match is successful ! The result is : 1
match('\d??', ['1234abc'])
# The match is successful ! The result is :
match('\d{2,}?', ['1234abc'])
# The match is successful ! The result is : 12
match('\d{2,5}?', ['1234abc'])
# The match is successful ! The result is : 12
match('\d{,5}?', ['1234abc'])
# The match is successful ! The result is :
regex101 Can be used to practice regular expressions .