Hello everyone , I meet you again , I'm your friend, Quan Jun .
understand pandas Function of , Have a certain concept and understanding of functional programming . Functional programming , Including functional programming thinking , Of course, it is a very complicated topic , But for today's presentation apply()
function , Just understand : Function as an object , Can be passed to other functions as parameters , It can also be used as the return value of a function .
Functions as objects can bring about great changes in the style of code . For example , There is a type of list The variable of , contain from 1 To 10 The data of , We need to find out what can be 3 All numbers divisible . In the traditional way :
def can_divide_by_three(number):
if number % 3 == 0:
return True
else:
return False
selected_numbers = []
for number in range(1, 11):
if can_divide_by_three(number):
selected_numbers.append(number)
Circulation is indispensable , because can_divide_by_three()
Function is used only once , Consider using lambda Expression simplification :
divide_by_three = lambda x : True if x % 3 == 0 else False
selected_numbers = []
for number in range(1, 11):
if divide_by_three(item):
selected_numbers.append(item)
The above is the traditional programming thinking mode , Functional programming thinking is completely different . We can think of it this way : from list Remove from Specific rules The number of , Can you just focus on and set rules , Loops are left to the programming language ? Certainly. . When programmers only care about rules ( A rule may be a condition , Or by some function To define ), The code will be greatly simplified , It's more readable .
Python Language provides filter()
function , The grammar is as follows :
filter(function, sequence)
filter()
Function functions : Yes sequence Medium item successively perform function(item), Will result in True Of item Form a List/String/Tuple( Depending on sequence The type of ) And back to . With this function , The above code can be simplified as :
divide_by_three = lambda x : True if x % 3 == 0 else False
selected_numbers = filter(divide_by_three, range(1, 11))
take lambda Expression in statement , The code is simplified to one sentence :
selected_numbers = filter(lambda x: x % 3 == 0, range(1, 11))
Back to the topic , pandas Of apply()
Functions can act on Series
Or the whole thing DataFrame
, The function is also to automatically traverse the whole Series
perhaps DataFrame
, Run the specified function for each element .
For example , Now there's a set of data , The student's test results :
Name Nationality Score
Zhang han 400
Li return 450
king han 460
If the nationality is not Han nationality , The total score is added to the test score 5 branch , Now we need to use pandas To do this calculation , We are Dataframe Add a column in . Of course, if only to get the result , numpy.where()
Functions are simpler , This is mainly to demonstrate Series.apply()
Function usage .
import pandas as pd
df = pd.read_csv("studuent-score.csv")
df['ExtraScore'] = df['Nationality'].apply(lambda x : 5 if x != ' han ' else 0)
df['TotalScore'] = df['Score'] + df['ExtraScore']
about Nationality This column , pandas Traverse each value , And execute... On this value lambda Anonymous functions , Store the calculation results in a new Series
Back in . The above code is in jupyter notebook The results shown in are as follows :
Name Nationality Score ExtraScore TotalScore
0 Zhang han 400 0 400
1 Li return 450 5 455
2 king han 460 0 460
apply()
Of course, functions can also be executed python Built in functions , For example, we want Name The number of characters in this column , If you use apply()
Words :
df['NameLength'] = df['Name'].apply(len)
according to pandas Help document pandas.Series.apply — pandas 1.3.1 documentation, This function can receive positional parameters or keyword parameters , The grammar is as follows :
Series.apply(func, convert_dtype=True, args=(), **kwargs)
about func In terms of parameters , The first parameter in the function definition is required , therefore funct() Parameters other than the first parameter are treated as additional parameters , Pass... As a parameter . Let's continue with the example just now , Suppose that except the Han nationality , Other minorities have extra points , We put the bonus points in the parameters of the function , So let's define one add_extra() function :
def add_extra(nationality, extra):
if nationality != " han ":
return extra
else:
return 0
Yes df Add a new column :
df['ExtraScore'] = df.Nationality.apply(add_extra, args=(5,))
The position parameter passes args = () To pass parameters , The type is tuple. You can also use the following method to call :
df['ExtraScore'] = df.Nationality.apply(add_extra, extra=5)
The result after running is :
Name Nationality Score ExtraScore
0 Zhang han 400 0
1 Li return 450 5
2 king han 460 0
take add_extra As lambda function :
df['Extra'] = df.Nationality.apply(lambda n, extra : extra if n == ' han ' else 0, args=(5,))
Let's continue with Key parameters . Suppose we can give different bonus points to different nationalities , Definition add_extra2() function :
def add_extra2(nationaltiy, **kwargs):
return kwargs[nationaltiy]
df['Extra'] = df.Nationality.apply(add_extra2, han =0, return =10, hidden =5)
The running result is :
Name Nationality Score Extra
0 Zhang han 400 0
1 Li return 450 10
2 king han 460 0
contrast apply Syntax of functions , It's not difficult to understand. .
DataFrame.apply()
Function will traverse every element , Runs the specified function. Consider the following example :
import pandas as pd
import numpy as np
matrix = [
[1,2,3],
[4,5,6],
[7,8,9]
]
df = pd.DataFrame(matrix, columns=list('xyz'), index=list('abc'))
df.apply(np.square)
Yes df perform square()
After the function , All the elements are squared :
x y z
a 1 4 9
b 16 25 36
c 49 64 81
If you just want to apply()
Act on specified rows and columns , You can use row or column name
Property to qualify . For example, the following example will x The column is squared :
df.apply(lambda x : np.square(x) if x.name=='x' else x)
x y z
a 1 2 3
b 16 5 6
c 49 8 9
The following example is for x and y The column is squared :
df.apply(lambda x : np.square(x) if x.name in ['x', 'y'] else x)
x y z
a 1 4 3
b 16 25 6
c 49 64 9
The following example pairs the first line (a The line of the tag ) Square it :
df.apply(lambda x : np.square(x) if x.name == 'a' else x, axis=1)
By default axis=0
Means by column ,axis=1
By line .
We often use date calculation , For example, to calculate the interval between two dates , For example, the following group is about wbs Start and end date data :
wbs date_from date_to
job1 2019-04-01 2019-05-01
job2 2019-04-07 2019-05-17
job3 2019-05-16 2019-05-31
job4 2019-05-20 2019-06-11
Suppose you want to calculate the number of days between start and end dates . The simpler way is to subtract two columns (datetime type ):
import pandas as pd
import datetime as dt
wbs = {
"wbs": ["job1", "job2", "job3", "job4"],
"date_from": ["2019-04-01", "2019-04-07", "2019-05-16","2019-05-20"],
"date_to": ["2019-05-01", "2019-05-17", "2019-05-31", "2019-06-11"]
}
df = pd.DataFrame(wbs)
df['elpased'] = df['date_to'].apply(pd.to_datetime) -
df['date_from'].apply(pd.to_datetime)
apply()
Function will date_from
and date_to
Two columns converted to datetime type . We print once df:
wbs date_from date_to elapsed
0 job1 2019-04-01 2019-05-01 30 days
1 job2 2019-04-07 2019-05-17 40 days
2 job3 2019-05-16 2019-05-31 15 days
3 job4 2019-05-20 2019-06-11 22 days
The date interval has been calculated , But with a unit behind it days, That's because of two things datetime
Type subtraction , The data type obtained is timedelta64
, If only the numbers , You also need to use timedelta
Of days
Attribute conversion .
elapsed= df['date_to'].apply(pd.to_datetime) -
df['date_from'].apply(pd.to_datetime)
df['elapsed'] = elapsed.apply(lambda x : x.days)
Use DataFrame.apply()
Functions do the same thing , We need to define a function first get_interval_days()
The first column of the function is a Series
Variable of type , When it comes to execution , Receive... In turn DataFrame Each line .
import pandas as pd
import datetime as dt
def get_interval_days(arrLike, start, end):
start_date = dt.datetime.strptime(arrLike[start], '%Y-%m-%d')
end_date = dt.datetime.strptime(arrLike[end], '%Y-%m-%d')
return (end_date - start_date).days
wbs = {
"wbs": ["job1", "job2", "job3", "job4"],
"date_from": ["2019-04-01", "2019-04-07", "2019-05-16","2019-05-20"],
"date_to": ["2019-05-01", "2019-05-17", "2019-05-31", "2019-06-11"]
}
df = pd.DataFrame(wbs)
df['elapsed'] = df.apply(
get_interval_days, axis=1, args=('date_from', 'date_to'))
Pandas Of Apply function ——Pandas The best function to use in pandas.Series.apply — pandas 1.3.1 documentation
Publisher : Full stack programmer stack length , Reprint please indicate the source :https://javaforall.cn/152100.html Link to the original text :https://javaforall.cn