您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Usage of pandas apply() function

編輯：Python

Hello everyone , I meet you again , I'm your friend, Quan Jun .

understand pandas Function of , Have a certain concept and understanding of functional programming . Functional programming , Including functional programming thinking , Of course, it is a very complicated topic , But for today's presentation apply() function , Just understand ： Function as an object , Can be passed to other functions as parameters , It can also be used as the return value of a function .

Functions as objects can bring about great changes in the style of code . For example , There is a type of list The variable of , contain from 1 To 10 The data of , We need to find out what can be 3 All numbers divisible . In the traditional way ：

def can_divide_by_three(number):
if number % 3 == 0:
return True
else:
return False
selected_numbers = []
for number in range(1, 11):
if can_divide_by_three(number):
selected_numbers.append(number)

Circulation is indispensable , because can_divide_by_three() Function is used only once , Consider using lambda Expression simplification ：

divide_by_three = lambda x : True if x % 3 == 0 else False
selected_numbers = []
for number in range(1, 11):
if divide_by_three(item):
selected_numbers.append(item)

The above is the traditional programming thinking mode , Functional programming thinking is completely different . We can think of it this way ： from list Remove from Specific rules The number of , Can you just focus on and set rules , Loops are left to the programming language ？ Certainly. . When programmers only care about rules （ A rule may be a condition , Or by some function To define ）, The code will be greatly simplified , It's more readable .

Python Language provides filter() function , The grammar is as follows ：

filter(function, sequence)

filter() Function functions ： Yes sequence Medium item successively perform function(item), Will result in True Of item Form a List/String/Tuple（ Depending on sequence The type of ） And back to . With this function , The above code can be simplified as ：

divide_by_three = lambda x : True if x % 3 == 0 else False
selected_numbers = filter(divide_by_three, range(1, 11))

take lambda Expression in statement , The code is simplified to one sentence ：

selected_numbers = filter(lambda x: x % 3 == 0, range(1, 11))

Series.apply()

Back to the topic , pandas Of apply() Functions can act on Series Or the whole thing DataFrame, The function is also to automatically traverse the whole Series perhaps DataFrame, Run the specified function for each element .

For example , Now there's a set of data , The student's test results ：

 Name Nationality Score
Zhang han 400
Li return 450
king han 460

If the nationality is not Han nationality , The total score is added to the test score 5 branch , Now we need to use pandas To do this calculation , We are Dataframe Add a column in . Of course, if only to get the result , numpy.where() Functions are simpler , This is mainly to demonstrate Series.apply() Function usage .

import pandas as pd
df = pd.read_csv("studuent-score.csv")
df['ExtraScore'] = df['Nationality'].apply(lambda x : 5 if x != ' han ' else 0)
df['TotalScore'] = df['Score'] + df['ExtraScore']

about Nationality This column , pandas Traverse each value , And execute... On this value lambda Anonymous functions , Store the calculation results in a new Series Back in . The above code is in jupyter notebook The results shown in are as follows ：

 Name Nationality Score ExtraScore TotalScore
0 Zhang han 400 0 400
1 Li return 450 5 455
2 king han 460 0 460

apply() Of course, functions can also be executed python Built in functions , For example, we want Name The number of characters in this column , If you use apply() Words ：

df['NameLength'] = df['Name'].apply(len)

apply Function receives a function with parameters

according to pandas Help document pandas.Series.apply — pandas 1.3.1 documentation, This function can receive positional parameters or keyword parameters , The grammar is as follows ：

Series.apply(func, convert_dtype=True, args=(), **kwargs)

about func In terms of parameters , The first parameter in the function definition is required , therefore funct() Parameters other than the first parameter are treated as additional parameters , Pass... As a parameter . Let's continue with the example just now , Suppose that except the Han nationality , Other minorities have extra points , We put the bonus points in the parameters of the function , So let's define one add_extra() function ：

def add_extra(nationality, extra):
if nationality != " han ":
return extra
else:
return 0

Yes df Add a new column ：

df['ExtraScore'] = df.Nationality.apply(add_extra, args=(5,))

The position parameter passes args = () To pass parameters , The type is tuple. You can also use the following method to call ：

df['ExtraScore'] = df.Nationality.apply(add_extra, extra=5)

The result after running is ：

 Name Nationality Score ExtraScore
0 Zhang han 400 0
1 Li return 450 5
2 king han 460 0

take add_extra As lambda function ：

df['Extra'] = df.Nationality.apply(lambda n, extra : extra if n == ' han ' else 0, args=(5,))

Let's continue with Key parameters . Suppose we can give different bonus points to different nationalities , Definition add_extra2() function ：

def add_extra2(nationaltiy, **kwargs):
return kwargs[nationaltiy]
df['Extra'] = df.Nationality.apply(add_extra2, han =0, return =10, hidden =5)

The running result is ：

 Name Nationality Score Extra
0 Zhang han 400 0
1 Li return 450 10
2 king han 460 0

contrast apply Syntax of functions , It's not difficult to understand. .

DataFrame.apply()

DataFrame.apply() Function will traverse every element , Runs the specified function. Consider the following example ：

import pandas as pd
import numpy as np
matrix = [
[1,2,3],
[4,5,6],
[7,8,9]
]
df = pd.DataFrame(matrix, columns=list('xyz'), index=list('abc'))
df.apply(np.square)

Yes df perform square() After the function , All the elements are squared ：

If you just want to apply() Act on specified rows and columns , You can use row or column name Property to qualify . For example, the following example will x The column is squared ：

df.apply(lambda x : np.square(x) if x.name=='x' else x)

The following example is for x and y The column is squared ：

df.apply(lambda x : np.square(x) if x.name in ['x', 'y'] else x)

The following example pairs the first line （a The line of the tag ） Square it ：

df.apply(lambda x : np.square(x) if x.name == 'a' else x, axis=1)

By default axis=0 Means by column ,axis=1 By line .

apply() Example of calculating date subtraction

We often use date calculation , For example, to calculate the interval between two dates , For example, the following group is about wbs Start and end date data ：

 wbs date_from date_to
job1 2019-04-01 2019-05-01
job2 2019-04-07 2019-05-17
job3 2019-05-16 2019-05-31
job4 2019-05-20 2019-06-11

Suppose you want to calculate the number of days between start and end dates . The simpler way is to subtract two columns （datetime type )：

import pandas as pd
import datetime as dt
wbs = {
"wbs": ["job1", "job2", "job3", "job4"],
"date_from": ["2019-04-01", "2019-04-07", "2019-05-16","2019-05-20"],
"date_to": ["2019-05-01", "2019-05-17", "2019-05-31", "2019-06-11"]
}
df = pd.DataFrame(wbs)
df['elpased'] = df['date_to'].apply(pd.to_datetime) -
df['date_from'].apply(pd.to_datetime)

apply() Function will date_from and date_to Two columns converted to datetime type . We print once df:

 wbs date_from date_to elapsed
0 job1 2019-04-01 2019-05-01 30 days
1 job2 2019-04-07 2019-05-17 40 days
2 job3 2019-05-16 2019-05-31 15 days
3 job4 2019-05-20 2019-06-11 22 days

The date interval has been calculated , But with a unit behind it days, That's because of two things datetime Type subtraction , The data type obtained is timedelta64, If only the numbers , You also need to use timedelta Of days Attribute conversion .

elapsed= df['date_to'].apply(pd.to_datetime) -
df['date_from'].apply(pd.to_datetime)
df['elapsed'] = elapsed.apply(lambda x : x.days)

Use DataFrame.apply() Functions do the same thing , We need to define a function first get_interval_days() The first column of the function is a Series Variable of type , When it comes to execution , Receive... In turn DataFrame Each line .

import pandas as pd
import datetime as dt
def get_interval_days(arrLike, start, end):
start_date = dt.datetime.strptime(arrLike[start], '%Y-%m-%d')
end_date = dt.datetime.strptime(arrLike[end], '%Y-%m-%d')
return (end_date - start_date).days
wbs = {
"wbs": ["job1", "job2", "job3", "job4"],
"date_from": ["2019-04-01", "2019-04-07", "2019-05-16","2019-05-20"],
"date_to": ["2019-05-01", "2019-05-17", "2019-05-31", "2019-06-11"]
}
df = pd.DataFrame(wbs)
df['elapsed'] = df.apply(
get_interval_days, axis=1, args=('date_from', 'date_to'))

Reference resources

Pandas Of Apply function ——Pandas The best function to use in pandas.Series.apply — pandas 1.3.1 documentation

Publisher ： Full stack programmer stack length , Reprint please indicate the source ：https://javaforall.cn/152100.html Link to the original text ：https://javaforall.cn