您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Save pandas plan (19) -- calculate the similarity of two columns using a custom method

編輯：Python

save pandas plan （19）—— Use a custom method to calculate the similarity of two columns

Recently, I found that many friends around me are not happy to use pandas, Switch to other data operation Libraries , As a data worker , Basically open your mouth pandas, Closed mouth pandas 了 , So I wrote this series to make more friends fall in love with pandas.

Series article description ：

Series name （ Serial number of series articles ）—— This series of articles specifically address the needs

platform ：

windows 10
python 3.8
pandas >=1.2.4

/ Data requirement

Recently, I came across a very interesting... While surfing the Internet pandas Example training topic , The title has been published for a long time , Someone has already answered this question , say ： There's a set of data , Each column records each index of the sample , To calculate each index of each data , Find the corresponding samples with the smallest sum of the absolute value of the index difference between each sample , Such as ：a sample ：[10.0, 9.0, 7.5, 8.6],b sample ：[11.2, 8.7, 6.4, 5.5] The similarity between them is calculated as ：|10.0-11.2|+|9.0-8.7|+|7.5-6.4|+|8.6-5.5|.

The absolute value is subtracted from the corresponding position and then added , Calculate the remaining samples for each sample , Get the least similarity of all samples 5 Other samples .

As described above , Construct a set of random data ：

import numpy as np
import pandas as pd
np.random.seed(2022)
data = np.clip(np.random.normal(loc=70, scale=15, size=600).round(2), 0, 100).reshape(60, 10)
df = pd.DataFrame(data, columns=list('abcdefghij'), index=[f' test _{i}' for i in range(1, 61)])

Data size is ：60*10, Floating point type .

/ Demand processing

Method 1 ： The cartesian product

First, calculate the cross data of each sample , Then group to get the top in the group 5 Data .

This method is the method of the solver at the bottom of the problem , The way it uses Cartesian products is very interesting .

df = df.reset_index().rename(columns={'index': ' name '})  
#  Set auxiliary columns
df['one'] = 1
# merge Oneself , With one List as key Parameter merge data frame
df_merge = pd.merge(left=df, right=df, left_on='one', right_on='one')

The original data size is 60*10, After the product is 3600 That's ok , The memory usage is much larger than that before processing , If the data is too large , In terms of processing , The time consumption will increase significantly .

Let's not consider the impact of data volume , To continue processing .

#  Remove the name column and one Column
columns = list(df.columns)
columns.remove(' name ')
columns.remove('one')
def sim_fun(row):
    sim_value = 0.0
    for col in columns:
        sim_value += abs(round(row[col+'_x'] - row[col+'_y'], 2))
    return round(sim_value, 2)
#  Find the similarity between each sample
df_merge['sim'] = df_merge.apply(sim_fun, axis=1)
df_merge = df_merge[df_merge[' name _x'] != df_merge[' name _y']].copy()
#  Obtain the lowest value of each sample and its similarity 5 Samples
def get_top_sims(df_sub):
    df_sort = df_sub.sort_values('sim').head(5)
    names = ','.join(df_sort[' name _y'])
    sims = ','.join(df_sort['sim'].astype(str))
    return pd.Series({'names': names, 'sims': sims})
df_result = df_merge.groupby(' name _x').apply(get_top_sims)

Answer the code of the boss by repeating the above , Successfully returned the data frame with the top five similarity . You can see in the 60 The Cartesian product is used in the data , It is acceptable to spend time .

Method 2 ： Directly subtract each row of data from the original data , utilize pandas Characteristic summation

In method 1, it can be noted that the data processing is slightly cumbersome , Expand the data capacity for further processing , And the difference of data is calculated one by one . Understand the data carefully , You can use the broadcast feature of the array to process .

def get_sims(s):
    #  Be careful ： Here df Is a global variable
    df_sim = (df - s).abs().sum(axis=1).round(2)
    df_sim.pop(s.name)
    df_sim = df_sim.sort_values(ascending=False).head(5).astype(str)
    return pd.Series({'names': ','.join(df_sim.index), 'sims': ','.join(df_sim.values)})
# apply Traverse the data by row , Make other rows calculate data with this row
df_result = df.apply(get_sims, axis=1)

The function in method 1 is optimized to get_sims function , It also greatly improves the computational efficiency .

/ summary

By understanding the purpose of the topic and the ideas of other solvers , According to what you know pandas Skill to analyze and deal with the problem , Reduce the redundancy of data calculation to a certain extent , Meet data needs , If you do not understand this article , Do your best to express your ideas .

It is much joyful to share the joy than enjoy alone. .

Made on June 21, 2002