程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Pandas text processing Encyclopedia

編輯:Python

author | Dongge takes off

source | Python Data Science

This time, let's introduce the common methods of text processing .


The two main types of text are string and object. Unless otherwise specified, the type is string, The text type is generally object.

The operation of text is mainly through accessor str To achieve , Very powerful , However, the following points should be paid attention to before use .

  1. Accessors can only access Series Data structure use . In addition to regular column variables df.col outside , You can also change the index type df.Index and df.columns Use

  2. Ensure that the object type accessed is a string str type . If not, first astype(str) Conversion type , Otherwise, an error will be reported

  3. Accessors can be used with multiple connections . Such as df.col.str.lower().str.upper(), This and Dataframe One line operation in is a principle

The following formally introduces the various operations of the text , It can basically cover daily 95% Data cleaning needs , altogether 8 A scenario .

The following operations are based on the following data :

import pandas as pd
import numpy as np
df = pd.DataFrame({'name':['jordon', 'MIKE', 'Kelvin', 'xiaoLi', 'qiqi','Amei'],
                   'Age':[18, 30, 45, 23, 45, 62],
                   'level':['high','Low','M','L','middle',np.nan],
                   'Email':['[email protected]','[email protected]','[email protected]','[email protected]',np.nan,'[email protected]']})
--------------------------------------------
   name   Age   level    Email
0  jordon  18    high    [email protected]
1  MIKE    30     Low    [email protected]
2  Kelvin  45       M    [email protected]
3  xiaoLi  23       L    [email protected]
4  qiqi    45  middle    NaN
5  Amei    62     NaN    [email protected]

1、 Text format

Case change

#  All characters become lowercase
s.str.lower()
#  Characters are all uppercase
s.str.upper()
#  Capitalize the first letter of each word
s.str.title()
#  The first letter of the string is capitalized
s.str.capitalize()
#  Upper and lower case conversion
s.str.swapcase()

The above usage is relatively simple , Don't give examples one by one , Here's an example of columns Examples of lowercase .

df.columns.str.lower()
--------------------------------------------------------
Index(['name', 'age', 'level', 'email'], dtype='object')

Format judgment

The following are judgment operations , Therefore, the Boolean value is returned .

s.str.isalpha #  Is it a letter
s.str.isnumeric #  Is it a number 0-9
s.str.isalnum #  Whether it consists of letters and numbers
s.str.isupper #  Whether it is capitalized
s.str.islower #  Whether it is lowercase
s.str.isdigit #  Is it a number 

alignment

#  Align center , Width is 8, Others ’*’ fill
s.str.center(, fillchar='*')
#  Align left , Width is 8, Others ’*’ fill
s.str.ljust(8, fillchar='*')
#  Right alignment , Width is 8, Others ’*’ fill
s.str.rjust(8, fillchar='*')
#  Custom alignment , Parameter adjustable width 、 Align the direction 、 Fill character
s.str.pad(width=8, side='both',fillchar='*')
#  give an example
df.name.str.center(8, fillchar='*')
-------------
0    *jordon*
1    **MIKE**
2    *Kelvin*
3    *xiaoLi*
4    **qiqi**
5    **Amei**

Counting and coding

s.str.count('b') #  A string that contains a specified number of letters
s.str.len() #  String length
s.str.encode('utf-8') #  Character encoding
s.str.decode('utf-8') #  Character decoding 

2、 Text splitting

By using split Method can split text by using a specified character as a split point . among ,expand Parameter allows the split content to expand , Form a separate column ,n Parameter can specify the split position to control the formation of several columns .

Next email The variable follows @ To break up .

#  Usage method
s.str.split('x', expand=True, n=1)
#  give an example
df.Email.str.split('@')
----------------------------
0         [jordon, sohu.com]
1            [Mike, 126.cn]
2    [KelvinChai, gmail.com]
3          [xiaoli, 163.com]
4                        NaN
5             [amei, qq.com]
# expand You can expand the split content into a single column
df.Email.str.split('@' ,expand=True)
----------------------------
   0          1
0  jordon      sohu.com
1  Mike        126.cn
2  KelvinChai  gmail.com
3  xiaoli      163.com
4  NaN         NaN
5  amei        qq.com

More complex splitting can be done with the help of regular expressions , For example, I want to pass @ and . To break up , Then it can be realized in this way .

df.Email.str.split('\@|\.',expand=True)
----------------------------
   0           1      2
0  jordon      sohu   com
1  Mike        126    cn
2  KelvinChai  gmail  com
3  xiaoli      163    com
4  NaN         NaN    NaN
5  amei        qq     com

3、 Text substitution

There are several ways to replace text :replace,slice_replace,repeat

replace Replace

replace Method is the most commonly used alternative , The parameters are as follows :

  • pal: Is the replaced content string , It can also be a regular expression

  • repl: String for new content , It can also be a called function

  • regex: Used to set whether regular is supported , The default is True

#  take email Species com Replace with cn
df.Email.str.replace('com','cn')
------------------------
0         [email protected]
1            [email protected]
2    [email protected]
3          [email protected]
4                    NaN
5             [email protected]

More complicated , For example, write old content as Regular expressions .

# take @ Replace the previous names with xxx
df.Email.str.replace('(.*?)@','[email protected]')
------------------
0     [email protected]
1       [email protected]
2    [email protected]
3      [email protected]
4              NaN
5       [email protected]

Or write new content as Called function .

df.Email.str.replace('(.*?)@', lambda x:x.group().upper())
-------------------------
0         [email protected]
1             [email protected]
2    [email protected]
3          [email protected]
4                     NaN
5             [email protected]

Slice replacement

slice_replace The replacement is realized by slicing , The specified characters can be retained or deleted by slicing , The parameters are as follows .

  • start: The starting position

  • stop: End position

  • repl: New content to replace

Yes start After slice position and stop Replace before slice position , If not set stop, that start Then replace them all , Similarly, if it is not set start, that stop Replace all before .

df.Email.str.slice_replace(start=1,stop=2,repl='XX')
-------------------------
0         [email protected]
1             [email protected]
2    [email protected]
3          [email protected]
4                      NaN
5             [email protected]

Repeat replacement

repeat It can realize the function of repeated replacement , Parameters repeats Set the number of repetitions .

df.name.str.repeat(repeats=2)
-------------------------
0    jordonjordon
1        MIKEMIKE
2    KelvinKelvin
3    xiaoLixiaoLi
4        qiqiqiqi
5        AmeiAmei

4、 Text splicing

The text is spliced through cat Method realization , Parameters :

  • others: Sequences that need to be spliced , If None Not set up , It will automatically splice the current sequence into a string

  • sep: Separator for splicing

  • na_rep: Null values are not processed by default , The replacement character of null value is set here .

  • join: Direction of splicing , Include left, right, outer, inner, The default is left

There are mainly the following splicing methods .

1. Splice a single sequence into a complete string

As mentioned above , When there is no setting ohters When parameters are , This method combines the current sequence into a new string .

df.name.str.cat()
-------------------------------
'jordonMIKEKelvinxiaoLiqiqiAmei'
#  Set up sep The separator is `-`
df.name.str.cat(sep='-')
-------------------------------
'jordon-MIKE-Kelvin-xiaoLi-qiqi-Amei'
#  Assign the missing value to `*`
df.level.str.cat(sep='-',na_rep='*')
-----------------------
'high-Low-M-L-middle-*'

2. Splicing sequence and other class list objects are new sequences

Let's start with name Column sum * Column splicing , then level Column splicing , Form a new sequence .

# str.cat Multi level connection realizes multi column splicing
df.name.str.cat(['*']*6).str.cat(df.level)
----------------
0    jordon*high
1       MIKE*Low
2       Kelvin*M
3       xiaoLi*L
4    qiqi*middle
5            NaN
#  You can also directly splice multiple columns
df.name.str.cat([df.level,df.Email],na_rep='*')
--------------------------------
0      [email protected]
1             [email protected]
2    [email protected]
3          [email protected]
4                    qiqimiddle*
5               Amei*[email protected]

Splice a sequence with multiple objects into a new sequence

5、 Text extraction

Text extraction is mainly through extract To achieve .

extract Parameters :

  • pat: Through regular expressions to achieve an extraction of pattern

  • flags: Regular library re Logo in , such as re.IGNORECASE

  • expand: When regular extracts only one content , If expand=True The exhibition will return to a DataFrame, Otherwise, return one Series

#  extract email Two contents in
df.Email.str.extract(pat='(.*?)@(.*).com')
--------------------
   0          1
0  jordon      sohu
1  vMike      NaN
2  KelvinChai  gmail
3  xiaoli      163
4  NaN         NaN
5  amei        qq

6、 Text query

adopt find and findall Two ways to achieve .

find The parameters are simple , Directly enter the string to query , Returns the position in the original string , If no query result is found, return -1.

df['@position'] = df.Email.str.find('@')
df[['Email','@position']]
-------------------------------------
    Email                   @position
0   [email protected]         6.0
1   [email protected]             4.0
2   [email protected]    10.0
3   [email protected]          6.0
4   NaN                     NaN
5   [email protected]             4.0

The above example returns @ stay email Position in variable .

Another way to find it is findall

findall Parameters :

  • pat: What to look for , regular expression

  • flag: Regular library re Logo in , such as re.IGNORECASE

findall and find The difference is that regular expressions are supported , And return the details . This method is a little similar to extract, It can also be used to extract , But not as good as extract convenient .

df.Email.str.findall('(.*?)@(.*).com')
--------------------------
0         [(jordon, sohu)]
1                       []
2    [(KelvinChai, gmail)]
3          [(xiaoli, 163)]
4                      NaN
5             [(amei, qq)]

The above example returns two parts of a regular lookup , And appear in the form of tuple list .

7、 The text contains

The text contains through contains Method realization , Returns a Boolean value , In general, and loc The query function is used in conjunction with , Parameters :

  • pat: Match string , regular expression

  • case: Is it case sensitive ,True Express difference

  • flags: Regular library re Logo in , such as re.IGNORECASE

  • na: Fill in missing values

  • regex: Whether to support regular , Default True Support

df.Email.str.contains('jordon|com',na='*')
----------
0     True
1    False
2     True
3     True
4        *
5     True
# 
df.loc[df.Email.str.contains('jordon|com', na=False)]
------------------------------------------
   name    Age  level  Email                 @position
0  jordon  18   high   [email protected]        6.0
2  Kelvin  45   M      [email protected]   10.0
3  xiaoLi  23   L      [email protected]         6.0
5  Amei    62   NaN    [email protected]            4.0

There's a little bit of caution here , If and loc In combination with , Note that there must be no missing values , Otherwise, an error will be reported . Can be set by na=False Ignore missing values and complete the query .

8、 Dummy variable of text

get_dummies A column variable can be automatically generated into a dummy variable ( Dummy variable ), This method is often used in feature derivation .

df.name.str.get_dummies()
-------------------------------
  Amei Kelvin MIKE jordon qiqi xiaoLi
0   0     0     0     1     0     0
1   0     0     1     0     0     0
2   0     1     0     0     0     0
3   0     0     0     0     0     1
4   0     0     0     0     1     0
5   1     0     0     0     0     0

That's what we're sharing .

Looking back

It's too voluminous !AI High accuracy of math exam 81%

This Python Artifact can let you touch fish for a long time !

2D Transformation 3D, Look at NVIDIA's AI“ new ” magic !

How to use Python Realize the security system of the scenic spot ?

 Share
Point collection
A little bit of praise
Click to see 

  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved