API reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.capitalize.html
import pandas as pd
df = pd.DataFrame({
" surname ": [" Zhao "," money "," Grandchildren ", " Li ", " Zhou "],
" name ": [" wind "," rain "," Thunder ", " electric ", " 3、 ... and "],
" Home address ": [" Zhejiang Province · Ningbo City ", " Zhejiang Province · hangzhou ", " Sichuan Province · Wuhou District ", " Hunan province · Yueyang Tower ", " Hunan province · Yiyang City "],
" WeChat ID": ["Tomoplplplut1248781", "Smopopo857", "Adahuhuifhhjfj", "Tull1945121", "ZPWERERTFD599557"],
" mailbox ": ["[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]"]
})
print(df)
Python There are many ways to process string data , Basically, it can be done in DataFrame Used internally , for example lower() Methods and upper() Method , Change the case of letters .
# Capitalize letters
print(df[" WeChat ID"].str.upper())
# Calculate character length
print(df[" WeChat ID"].str.len())
# Truncate the spaces at both ends of the string . also lstrip(),rstrip() Remove the left and right spaces .
# Add parameters to the method str.lstrip('8'), Truncate the specified character to the left of the string
print(df[" Home address "].str.strip())
Intercept like a normal string
print(df[" mailbox "].str[-10:])
''' 0 [email protected] 1 [email protected] 2 [email protected] 3 [email protected] 4 @gmail.com Name: mailbox , dtype: object '''
start , ending , contain
First do the deallocation
df[“ Home address ”] = df[“ Home address ”].str.strip()
# Registered permanent residence in Hunan xxx
print(df[df[" Home address "].str.startswith(" hunan ")])
''' surname name Home address WeChat ID mailbox 3 Li electric Hunan province · Yueyang Tower Tull1945121 [email protected] 4 Zhou 3、 ... and Hunan province · Yiyang City ZPWERERTFD599557 [email protected] '''
# Family in xxx City
print(df[df[" Home address "].str.endswith(" City ")])
''' surname name Home address WeChat ID mailbox 0 Zhao wind Zhejiang Province · Ningbo City Tomoplplplut1248781 [email protected] 1 money rain Zhejiang Province · hangzhou Smopopo857 [email protected] 4 Zhou 3、 ... and Hunan province · Yiyang City ZPWERERTFD599557 [email protected] '''
# The home address has “ Yang ” This word
print(df[df[" Home address "].str.contains(" Yang ")])
''' surname name Home address WeChat ID mailbox 3 Li electric Hunan province · Yueyang Tower Tull1945121 [email protected] 4 Zhou 3、 ... and Hunan province · Yiyang City ZPWERERTFD599557 [email protected] '''
Replace , Division
First do the deallocation
df[“ Home address ”] = df[“ Home address ”].str.strip()
df[" Home address "] = df[" Home address "].str.replace("·", "--")
print(df)
''' surname name Home address WeChat ID mailbox 0 Zhao wind Zhejiang Province -- Ningbo City Tomoplplplut1248781 [email protected] 1 money rain Zhejiang Province -- hangzhou Smopopo857 [email protected] 2 Grandchildren Thunder Sichuan Province -- Wuhou District Adahuhuifhhjfj [email protected] 3 Li electric Hunan province -- Yueyang Tower Tull1945121 [email protected] 4 Zhou 3、 ... and Hunan province -- Yiyang City ZPWERERTFD599557 [email protected] '''
print(df[" Home address "].str.split("·"))
''' 0 [ Zhejiang Province , Ningbo City ] 1 [ Zhejiang Province , hangzhou ] 2 [ Sichuan Province , Wuhou District ] 3 [ Hunan province , Yueyang Tower ] 4 [ Hunan province , Yiyang City ] Name: Home address , dtype: object '''
after spilit() After the method is cut, it becomes a list , And then you can go through get() Methods or [ ] To get the elements inside , for example
print(df[" Home address "].str.split("·").str.get(0))
print(df[" Home address "].str.split("·").str[0])
''' 0 Zhejiang Province 1 Zhejiang Province 2 Sichuan Province 3 Hunan province 4 Hunan province Name: Home address , dtype: object '''
Of course, we can also be in split() Method expand=True This parameter , To convert the above tabular data into DataFrame Format
print(df[" Home address "].str.split("·", expand=True))
''' 0 1 0 Zhejiang Province Ningbo City 1 Zhejiang Province hangzhou 2 Sichuan Province Wuhou District 3 Hunan province Yueyang Tower 4 Hunan province Yiyang City '''
similarly , We can add... Later [] To get the elements we want
print(df[" Home address "].str.split("·", expand=True)[1])
''' 0 Ningbo City 1 hangzhou 2 Wuhou District 3 Yueyang Tower 4 Yiyang City Name: 1, dtype: object '''
Get mailbox type
print(df[" mailbox "].str.split("@").str.get(1).str.split(".").str.get(0))
print(df[" mailbox "].str.split("@").str[1].str.rstrip(".com"))
''' 0 163 1 qq 2 126 3 139 4 gmail Name: mailbox , dtype: object '''
extract “ WeChat ID” The letters and numbers in this column , And separate the two
wq = "([a-zA-Z]+)([0-9]+)"
print(df[" WeChat ID"].str.extract(wq, expand=True))
# It can also be obtained separately
print(df[" WeChat ID"].str.extract("([a-zA-Z]+)", expand=True))
print(df[" WeChat ID"].str.extract("([0-9]+)", expand=True))
''' 0 1 0 Tomoplplplut 1248781 1 Smopopo 857 2 NaN NaN 3 Tull 1945121 4 ZPWERERTFD 599557 '''
From another point of view , Regular expressions can also help us confirm whether the text data conforms to a certain rule
wq = "([a-zA-Z]+)([0-9]+)"
print(df[" WeChat ID"].str.match(wq))
''' 0 True 1 True 2 False 3 True 4 True Name: WeChat ID, dtype: bool '''
Third act False, It does not meet the requirements of letters + The law of numbers . Let's go one step further , Extract the data that meet the conditions ; If negative , That means the extracted data does not meet the law .
wq = "([a-zA-Z]+)([0-9]+)"
print(df[df[" WeChat ID"].str.match(wq)])
print(~df[df[" WeChat ID"].str.match(wq)])
''' surname name Home address WeChat ID mailbox 0 Zhao wind Zhejiang Province · Ningbo City Tomoplplplut1248781 [email protected] 1 money rain Zhejiang Province · hangzhou Smopopo857 [email protected] 3 Li electric Hunan province · Yueyang Tower Tull1945121 [email protected] 4 Zhou 3、 ... and Hunan province · Yiyang City ZPWERERTFD599557 [email protected] '''
''' surname name Home address WeChat ID mailbox 2 Grandchildren Thunder Sichuan Province · Wuhou District Adahuhuifhhjfj [email protected] '''