您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Pandas text processing Encyclopedia

編輯：Python

author | Dongge takes off

source | Python Data Science

This time, let's introduce the common methods of text processing .

The two main types of text are string and object. Unless otherwise specified, the type is string, The text type is generally object.

The operation of text is mainly through accessor str To achieve , Very powerful , However, the following points should be paid attention to before use .

Accessors can only access Series Data structure use . In addition to regular column variables df.col outside , You can also change the index type df.Index and df.columns Use
Ensure that the object type accessed is a string str type . If not, first astype(str) Conversion type , Otherwise, an error will be reported
Accessors can be used with multiple connections . Such as df.col.str.lower().str.upper(), This and Dataframe One line operation in is a principle

The following formally introduces the various operations of the text , It can basically cover daily 95% Data cleaning needs , altogether 8 A scenario .

The following operations are based on the following data ：

import pandas as pd
import numpy as np
df = pd.DataFrame({'name':['jordon', 'MIKE', 'Kelvin', 'xiaoLi', 'qiqi','Amei'],
                   'Age':[18, 30, 45, 23, 45, 62],
                   'level':['high','Low','M','L','middle',np.nan],
                   'Email':['[email protected]','[email protected]','[email protected]','[email protected]',np.nan,'[email protected]']})
--------------------------------------------
   name   Age   level    Email
0  jordon  18    high    [email protected]
1  MIKE    30     Low    [email protected]
2  Kelvin  45       M    [email protected]
3  xiaoLi  23       L    [email protected]
4  qiqi    45  middle    NaN
5  Amei    62     NaN    [email protected]

1、 Text format

Case change

#  All characters become lowercase
s.str.lower()
#  Characters are all uppercase
s.str.upper()
#  Capitalize the first letter of each word
s.str.title()
#  The first letter of the string is capitalized
s.str.capitalize()
#  Upper and lower case conversion
s.str.swapcase()

The above usage is relatively simple , Don't give examples one by one , Here's an example of columns Examples of lowercase .

df.columns.str.lower()
--------------------------------------------------------
Index(['name', 'age', 'level', 'email'], dtype='object')

Format judgment

The following are judgment operations , Therefore, the Boolean value is returned .

s.str.isalpha #  Is it a letter
s.str.isnumeric #  Is it a number 0-9
s.str.isalnum #  Whether it consists of letters and numbers
s.str.isupper #  Whether it is capitalized
s.str.islower #  Whether it is lowercase
s.str.isdigit #  Is it a number

alignment

#  Align center , Width is 8, Others ’*’ fill
s.str.center(, fillchar='*')
#  Align left , Width is 8, Others ’*’ fill
s.str.ljust(8, fillchar='*')
#  Right alignment , Width is 8, Others ’*’ fill
s.str.rjust(8, fillchar='*')
#  Custom alignment , Parameter adjustable width 、 Align the direction 、 Fill character
s.str.pad(width=8, side='both',fillchar='*')

#  give an example
df.name.str.center(8, fillchar='*')
-------------
0    *jordon*
1    **MIKE**
2    *Kelvin*
3    *xiaoLi*
4    **qiqi**
5    **Amei**

Counting and coding

s.str.count('b') #  A string that contains a specified number of letters
s.str.len() #  String length
s.str.encode('utf-8') #  Character encoding
s.str.decode('utf-8') #  Character decoding

2、 Text splitting

By using split Method can split text by using a specified character as a split point . among ,expand Parameter allows the split content to expand , Form a separate column ,n Parameter can specify the split position to control the formation of several columns .

Next email The variable follows @ To break up .

#  Usage method
s.str.split('x', expand=True, n=1)
#  give an example
df.Email.str.split('@')
----------------------------
0         [jordon, sohu.com]
1            [Mike, 126.cn]
2    [KelvinChai, gmail.com]
3          [xiaoli, 163.com]
4                        NaN
5             [amei, qq.com]

# expand You can expand the split content into a single column
df.Email.str.split('@' ,expand=True)
----------------------------
   0          1
0  jordon      sohu.com
1  Mike        126.cn
2  KelvinChai  gmail.com
3  xiaoli      163.com
4  NaN         NaN
5  amei        qq.com

More complex splitting can be done with the help of regular expressions , For example, I want to pass @ and . To break up , Then it can be realized in this way .

df.Email.str.split('\@|\.',expand=True)
----------------------------
   0           1      2
0  jordon      sohu   com
1  Mike        126    cn
2  KelvinChai  gmail  com
3  xiaoli      163    com
4  NaN         NaN    NaN
5  amei        qq     com

3、 Text substitution

There are several ways to replace text ：replace,slice_replace,repeat

replace Replace

replace Method is the most commonly used alternative , The parameters are as follows ：

pal： Is the replaced content string , It can also be a regular expression
repl： String for new content , It can also be a called function
regex： Used to set whether regular is supported , The default is True

#  take email Species com Replace with cn
df.Email.str.replace('com','cn')
------------------------
0         [email protected]
1            [email protected]
2    [email protected]
3          [email protected]
4                    NaN
5             [email protected]

More complicated , For example, write old content as Regular expressions .

# take @ Replace the previous names with xxx
df.Email.str.replace('(.*?)@','[email protected]')
------------------
0     [email protected]
1       [email protected]
2    [email protected]
3      [email protected]
4              NaN
5       [email protected]

Or write new content as Called function .

df.Email.str.replace('(.*?)@', lambda x:x.group().upper())
-------------------------
0         [email protected]
1             [email protected]
2    [email protected]
3          [email protected]
4                     NaN
5             [email protected]

Slice replacement

slice_replace The replacement is realized by slicing , The specified characters can be retained or deleted by slicing , The parameters are as follows .

start： The starting position
stop： End position
repl： New content to replace

Yes start After slice position and stop Replace before slice position , If not set stop, that start Then replace them all , Similarly, if it is not set start, that stop Replace all before .

df.Email.str.slice_replace(start=1,stop=2,repl='XX')
-------------------------
0         [email protected]
1             [email protected]
2    [email protected]
3          [email protected]
4                      NaN
5             [email protected]

Repeat replacement

repeat It can realize the function of repeated replacement , Parameters repeats Set the number of repetitions .

df.name.str.repeat(repeats=2)
-------------------------
0    jordonjordon
1        MIKEMIKE
2    KelvinKelvin
3    xiaoLixiaoLi
4        qiqiqiqi
5        AmeiAmei

4、 Text splicing

The text is spliced through cat Method realization , Parameters ：

others: Sequences that need to be spliced , If None Not set up , It will automatically splice the current sequence into a string
sep: Separator for splicing
na_rep: Null values are not processed by default , The replacement character of null value is set here .
join: Direction of splicing , Include left, right, outer, inner, The default is left

There are mainly the following splicing methods .

1. Splice a single sequence into a complete string

As mentioned above , When there is no setting ohters When parameters are , This method combines the current sequence into a new string .

df.name.str.cat()
-------------------------------
'jordonMIKEKelvinxiaoLiqiqiAmei'
#  Set up sep The separator is `-`
df.name.str.cat(sep='-')
-------------------------------
'jordon-MIKE-Kelvin-xiaoLi-qiqi-Amei'
#  Assign the missing value to `*`
df.level.str.cat(sep='-',na_rep='*')
-----------------------
'high-Low-M-L-middle-*'

2. Splicing sequence and other class list objects are new sequences

Let's start with name Column sum * Column splicing , then level Column splicing , Form a new sequence .

# str.cat Multi level connection realizes multi column splicing
df.name.str.cat(['*']*6).str.cat(df.level)
----------------
0    jordon*high
1       MIKE*Low
2       Kelvin*M
3       xiaoLi*L
4    qiqi*middle
5            NaN
#  You can also directly splice multiple columns
df.name.str.cat([df.level,df.Email],na_rep='*')
--------------------------------
0      [email protected]
1             [email protected]
2    [email protected]
3          [email protected]
4                    qiqimiddle*
5               Amei*[email protected]

Splice a sequence with multiple objects into a new sequence

5、 Text extraction

Text extraction is mainly through extract To achieve .

extract Parameters ：

pat: Through regular expressions to achieve an extraction of pattern
flags: Regular library re Logo in , such as re.IGNORECASE
expand: When regular extracts only one content , If expand=True The exhibition will return to a DataFrame, Otherwise, return one Series

#  extract email Two contents in
df.Email.str.extract(pat='(.*?)@(.*).com')
--------------------
   0          1
0  jordon      sohu
1  vMike      NaN
2  KelvinChai  gmail
3  xiaoli      163
4  NaN         NaN
5  amei        qq

6、 Text query

adopt find and findall Two ways to achieve .

find The parameters are simple , Directly enter the string to query , Returns the position in the original string , If no query result is found, return -1.

df['@position'] = df.Email.str.find('@')
df[['Email','@position']]
-------------------------------------
    Email                   @position
0   [email protected]         6.0
1   [email protected]             4.0
2   [email protected]    10.0
3   [email protected]          6.0
4   NaN                     NaN
5   [email protected]             4.0

The above example returns @ stay email Position in variable .

Another way to find it is findall

findall Parameters ：

pat: What to look for , regular expression
flag: Regular library re Logo in , such as re.IGNORECASE

findall and find The difference is that regular expressions are supported , And return the details . This method is a little similar to extract, It can also be used to extract , But not as good as extract convenient .

df.Email.str.findall('(.*?)@(.*).com')
--------------------------
0         [(jordon, sohu)]
1                       []
2    [(KelvinChai, gmail)]
3          [(xiaoli, 163)]
4                      NaN
5             [(amei, qq)]

The above example returns two parts of a regular lookup , And appear in the form of tuple list .

7、 The text contains

The text contains through contains Method realization , Returns a Boolean value , In general, and loc The query function is used in conjunction with , Parameters ：

pat: Match string , regular expression
case: Is it case sensitive ,True Express difference
flags: Regular library re Logo in , such as re.IGNORECASE
na: Fill in missing values
regex: Whether to support regular , Default True Support

df.Email.str.contains('jordon|com',na='*')
----------
0     True
1    False
2     True
3     True
4        *
5     True
# 
df.loc[df.Email.str.contains('jordon|com', na=False)]
------------------------------------------
   name    Age  level  Email                 @position
0  jordon  18   high   [email protected]        6.0
2  Kelvin  45   M      [email protected]   10.0
3  xiaoLi  23   L      [email protected]         6.0
5  Amei    62   NaN    [email protected]            4.0

There's a little bit of caution here , If and loc In combination with , Note that there must be no missing values , Otherwise, an error will be reported . Can be set by na=False Ignore missing values and complete the query .

8、 Dummy variable of text

get_dummies A column variable can be automatically generated into a dummy variable （ Dummy variable ）, This method is often used in feature derivation .

df.name.str.get_dummies()
-------------------------------
  Amei Kelvin MIKE jordon qiqi xiaoLi
0   0     0     0     1     0     0
1   0     0     1     0     0     0
2   0     1     0     0     0     0
3   0     0     0     0     0     1
4   0     0     0     0     1     0
5   1     0     0     0     0     0

That's what we're sharing .

Looking back

It's too voluminous ！AI High accuracy of math exam 81%

This Python Artifact can let you touch fish for a long time ！

2D Transformation 3D, Look at NVIDIA's AI“ new ” magic ！

How to use Python Realize the security system of the scenic spot ？

 Share
Point collection
A little bit of praise
Click to see

上一篇文章： Should I choose Python as the first programming language
下一篇文章： GPU practice how to use Python and AI to write automated text articles in 4 lines of code

Python

Python time conversion (timestamp, format time)

Background story In our daily

Using Python to compile student achievement management system (with source code)

In the usual games , We are Ex

python Locust個人學習記錄

Locust學習記錄一、什麼是locust？（一）安裝Loc

【Leetcode刷題Python】40. 組合總和 II

1 題目給定一個候選人編號的集合 candidates 和一

From developing your own first Python library, be a real programmer bilingual

Hello , Im yuechuang . I was i

python猴子選大王（從第q只猴子開始）

這個跟一般的有一點點變化，它不是從1開始的，而是從q開始的，

Python version 3.9 does not allow mixing byte text and non byte text

JSON object processing in Python

Python | file operation | read / write operation of text file

Python processing Excel

Python remote sensing image processing application (17): GDAL converts normalized CSV data into multi band remote sensing images

Python processing time record

Python crawler practice -- Chapter 2: movie data processing and visualization

Take you to learn Python automated testing (VII) -- warning box processing bit

[Python daily skill] print text with fixed width

Data processing tool: Pandas

熱門圖文

hdu 2117 （模擬除法運算） c++-static 與 constexpr作用 ASP FSO文件處理函數大全 ubuntu-Ubuntu 14.4 無法登陸圖形界面 php獲取當前網址url並替換參數或網址的方法 test-求大神幫幫忙，搞不懂了 mb-link如何建立超過100MB的二進制數組？是不是會浪費內存？ java-JAVA在執行程序時如何使用參數指定棧的大小

欄目導航