您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Simple use of pandas (text processing)

編輯：Python

pandas Simple use （ Text processing ）

Create data for text content
Common methods
- startswith() endswith() contains()
- replace() split()
- - split()
Regular expression and DataFrame Combination of internal methods

API reference： https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.capitalize.html

Create data for text content

import pandas as pd
df = pd.DataFrame({

" surname ": [" Zhao "," money "," Grandchildren ", " Li ", " Zhou "],
" name ": [" wind "," rain "," Thunder ", " electric ", " 3、 ... and "],
" Home address ": [" Zhejiang Province · Ningbo City ", " Zhejiang Province · hangzhou ", " Sichuan Province · Wuhou District ", " Hunan province · Yueyang Tower ", " Hunan province · Yiyang City "],
" WeChat ID": ["Tomoplplplut1248781", "Smopopo857", "Adahuhuifhhjfj", "Tull1945121", "ZPWERERTFD599557"],
" mailbox ": ["[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]"]
})
print(df)

Common methods

Python There are many ways to process string data , Basically, it can be done in DataFrame Used internally , for example lower() Methods and upper() Method , Change the case of letters .

# Capitalize letters 
print(df[" WeChat ID"].str.upper())
# Calculate character length 
print(df[" WeChat ID"].str.len())
# Truncate the spaces at both ends of the string . also lstrip(),rstrip() Remove the left and right spaces .
# Add parameters to the method str.lstrip('8'), Truncate the specified character to the left of the string 
print(df[" Home address "].str.strip())

Intercept like a normal string

print(df[" mailbox "].str[-10:])
''' 0 [email protected] 1 [email protected] 2 [email protected] 3 [email protected] 4 @gmail.com Name: mailbox , dtype: object '''

startswith() endswith() contains()

start , ending , contain

First do the deallocation
df[“ Home address ”] = df[“ Home address ”].str.strip()

# Registered permanent residence in Hunan xxx
print(df[df[" Home address "].str.startswith(" hunan ")])
''' surname name Home address WeChat ID mailbox 3 Li electric Hunan province · Yueyang Tower Tull1945121 [email protected] 4 Zhou 3、 ... and Hunan province · Yiyang City ZPWERERTFD599557 [email protected] '''

# Family in xxx City 
print(df[df[" Home address "].str.endswith(" City ")])
''' surname name Home address WeChat ID mailbox 0 Zhao wind Zhejiang Province · Ningbo City Tomoplplplut1248781 [email protected] 1 money rain Zhejiang Province · hangzhou Smopopo857 [email protected] 4 Zhou 3、 ... and Hunan province · Yiyang City ZPWERERTFD599557 [email protected] '''

# The home address has “ Yang ” This word 
print(df[df[" Home address "].str.contains(" Yang ")])
''' surname name Home address WeChat ID mailbox 3 Li electric Hunan province · Yueyang Tower Tull1945121 [email protected] 4 Zhou 3、 ... and Hunan province · Yiyang City ZPWERERTFD599557 [email protected] '''

replace() split()

Replace , Division

First do the deallocation
df[“ Home address ”] = df[“ Home address ”].str.strip()

df[" Home address "] = df[" Home address "].str.replace("·", "--")
print(df)
''' surname name Home address WeChat ID mailbox 0 Zhao wind Zhejiang Province -- Ningbo City Tomoplplplut1248781 [email protected] 1 money rain Zhejiang Province -- hangzhou Smopopo857 [email protected] 2 Grandchildren Thunder Sichuan Province -- Wuhou District Adahuhuifhhjfj [email protected] 3 Li electric Hunan province -- Yueyang Tower Tull1945121 [email protected] 4 Zhou 3、 ... and Hunan province -- Yiyang City ZPWERERTFD599557 [email protected] '''

print(df[" Home address "].str.split("·"))
''' 0 [ Zhejiang Province , Ningbo City ] 1 [ Zhejiang Province , hangzhou ] 2 [ Sichuan Province , Wuhou District ] 3 [ Hunan province , Yueyang Tower ] 4 [ Hunan province , Yiyang City ] Name: Home address , dtype: object '''

split()

after spilit() After the method is cut, it becomes a list , And then you can go through get() Methods or [ ] To get the elements inside , for example

print(df[" Home address "].str.split("·").str.get(0))
print(df[" Home address "].str.split("·").str[0])
''' 0 Zhejiang Province 1 Zhejiang Province 2 Sichuan Province 3 Hunan province 4 Hunan province Name: Home address , dtype: object '''

Of course, we can also be in split() Method expand=True This parameter , To convert the above tabular data into DataFrame Format

print(df[" Home address "].str.split("·", expand=True))
''' 0 1 0 Zhejiang Province Ningbo City 1 Zhejiang Province hangzhou 2 Sichuan Province Wuhou District 3 Hunan province Yueyang Tower 4 Hunan province Yiyang City '''

similarly , We can add... Later [] To get the elements we want

print(df[" Home address "].str.split("·", expand=True)[1])
''' 0 Ningbo City 1 hangzhou 2 Wuhou District 3 Yueyang Tower 4 Yiyang City Name: 1, dtype: object '''

Get mailbox type

print(df[" mailbox "].str.split("@").str.get(1).str.split(".").str.get(0))
print(df[" mailbox "].str.split("@").str[1].str.rstrip(".com"))
''' 0 163 1 qq 2 126 3 139 4 gmail Name: mailbox , dtype: object '''

Regular expression and DataFrame Combination of internal methods

extract “ WeChat ID” The letters and numbers in this column , And separate the two

wq = "([a-zA-Z]+)([0-9]+)"
print(df[" WeChat ID"].str.extract(wq, expand=True))
# It can also be obtained separately 
print(df[" WeChat ID"].str.extract("([a-zA-Z]+)", expand=True))
print(df[" WeChat ID"].str.extract("([0-9]+)", expand=True))
''' 0 1 0 Tomoplplplut 1248781 1 Smopopo 857 2 NaN NaN 3 Tull 1945121 4 ZPWERERTFD 599557 '''

From another point of view , Regular expressions can also help us confirm whether the text data conforms to a certain rule

wq = "([a-zA-Z]+)([0-9]+)"
print(df[" WeChat ID"].str.match(wq))
''' 0 True 1 True 2 False 3 True 4 True Name: WeChat ID, dtype: bool '''

Third act False, It does not meet the requirements of letters + The law of numbers . Let's go one step further , Extract the data that meet the conditions ; If negative , That means the extracted data does not meet the law .

wq = "([a-zA-Z]+)([0-9]+)"
print(df[df[" WeChat ID"].str.match(wq)])
print(~df[df[" WeChat ID"].str.match(wq)])
''' surname name Home address WeChat ID mailbox 0 Zhao wind Zhejiang Province · Ningbo City Tomoplplplut1248781 [email protected] 1 money rain Zhejiang Province · hangzhou Smopopo857 [email protected] 3 Li electric Hunan province · Yueyang Tower Tull1945121 [email protected] 4 Zhou 3、 ... and Hunan province · Yiyang City ZPWERERTFD599557 [email protected] '''
''' surname name Home address WeChat ID mailbox 2 Grandchildren Thunder Sichuan Province · Wuhou District Adahuhuifhhjfj [email protected] '''