您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Getting started with Python regular expressions

編輯：Python

If you are little white , This set of information can help you become a big bull , If you have rich development experience , This set of information can help you break through the bottleneck
2022web Full set of video tutorial front-end architecture H5 vue node Applet video + Information + Code + Interview questions .

?? Today we are going to study python Part of the regular expression of , First, why do you want to learn this part , Of course, it's because regular expressions are so convenient to deal with text data . For later entry nlp Lay a foundation in the field ！

First recommend a website : Used for regular expression validation .
It looks like this .

Here's the catalog title

- 1. Basic knowledge of
- 2. Greedy mode and non greedy mode
- 3. The use of backslashes
- 4. Usage of brackets
- 5. Match start and end positions
- 6. Use of parentheses — Group selection
- 7. Regular expressions cut characters
- summary

1. Basic knowledge of

Ordinary character ： The meaning of ordinary characters is that bytes match them .
Special characters ： They appear in regular expressions , Not directly match them , But express some special meaning .

. Means to match any single character except the newline character

For example, match ‘’. company ‘’( Match three characters )

# Here's to show python How to use regular expressions
import re # Regular expression library
content='''
The apple is red
Bananas are yellow
The leaves are green
The sky is blue
'''
# Convert the expression to pattern object , You can call the following find Something like that
p=re.compile(r'. color ')
for i in p.findall(content):
print(i)

give the result as follows ：

* Means to match the previous subexpression any number of times , Include 0 Time

for example ： matching ,.* Represents a match , And all the following characters

Of course _ It can be preceded by ordinary characters “ good _” matching “ Good good …”

+ Indicates that the previous subexpression is matched one or more times , barring 0 Time

The difference is that it does not include 0 Time .
+ The no. （ You can't 0 Time ）

* The no. （ Sure 0 Time ）

{ } Matches the number of times specified by the preceding character

for example ： expression " oil {2,4}" Indicates that the matching oil word is the least 2 Most times 4 Time

2. Greedy mode and non greedy mode

I wonder if you understand the following picture ？
We just want to match one by one

<head><title>

label , But did it help me? It all matched , This is because it only sees the first “< ” And the back left “>” The middle part is all regarded as arbitrary characters , This is the greedy model , It will match the characters as much as possible .

To become a non greedy model , Need to be in ‘+’,”*“ Add one at the end ‘？’ This is the match 4 Characters .

3. The use of backslashes

The backslash Multiple uses in regular expressions , Like escape

for example ： We need to look for . All the elements before Need to use . / . .*/. ./. The slash is to tell the program that the next character represents an ordinary character . The meaning of

Backslash can be combined with some characters to represent some special characters

4. Usage of brackets

Brackets can be used to indicate conditions or [0123] perhaps [0-3] This character can be 0,1,2,3
You can also store characters [ Yellow red green blue ]、[a-z] This kind of

for example ： Matching inclusion ‘’ yes [ Yellow red green blue ] color ‘’ The characters of

It should be noted that some metacharacters are in [] In life, you lose your meaning , Become ordinary characters
for example . + * There is no need to escape .‘

For example, find “. yes ” The characters of

If in [] Use in ^ character , Represents the concept of non

for example ： Match non numeric characters

5. Match start and end positions

^ Indicates the starting position of the matching text, but the effect is different in different modes
Regular table expressions mainly include 2 Patterns ： Single line mode and multi line mode
One way mode ： It means that the whole text is regarded as a set of data , Match only the beginning of all data
Multi line mode ： It refers to treating each row as a set of data , Match the beginning of each line

for example ： We use single line pattern matching , Only match the first line 001

for example ： We use multiline matching , It's a match 001、002、003

The problem is coming. , stay python How to decide whether to use single line mode or multi line mode in ？
stay compile Add parameters re.M perhaps re.MULTILINE Will do .

give the result as follows ：

$ Indicates the end of the file , Usage and ^ similar , It is also divided into multi line mode and single line mode

One way mode

Multi line mode

6. Use of parentheses — Group selection

Group selection ： It refers to selecting the characters we need from the regular expression matching results , for example ： We need to match the characters before the comma , We might write “.*,” But the matched characters contain commas , But we don't want this comma , In this case, you need to use group selection .

Have a look python Writing

If you encounter multiple groups , Then each row of data becomes a tuple , You can extract the corresponding character by tuple subscript .

Let's have a little practice ： Here's a set of data , Please choose a person's name and telephone number

 Apple , Telephone 123131
Banana , Telephone 234241
leaf , Telephone 245363
sky , Telephone 124234

python The implementation is as follows ：

7. Regular expressions cut characters

String object split() Method is only suitable for very simple string segmentation , When you need more flexibility to cut characters , You need to use regular expressions
for example ：

# We have a set of data here
names=‘ Guan yu ; Zhang Fei , d , The professor , Li Yuanfang Judge dee ’

How should this be cut ？
We can use re.split Use the symbols of regular expressions to specify separators .

summary

This chapter is mainly about the basic study of regular expressions , As a small introductory tutorial, it's still very good , I will continue to add... When I encounter complex usage in the future .