您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

8 Python data cleaning codes, ready to use

編輯：Python

Hello everyone ！

Whether you admit it or not , Data cleaning is really not a simple task , In most cases, this work is very time-consuming and tedious , But it is very important .

If you've been through data cleansing , You'll see what I mean . And that's the purpose of writing this article —— Make it easier for readers to clean data .

in fact , I realized not long ago , During data cleaning , Some data have similar patterns . And it was from then on , I began to sort out and compile some data cleaning code （ See below ）, I think this code can also be used in other common scenarios .

Because these common scenarios involve different types of data sets , Therefore, this article focuses more on showing and explaining what these codes can be used to do , So that readers can use them more conveniently .

Data cleaning toolbox

In the following code snippet , The data cleaning code is encapsulated in some functions , The purpose of the code is very intuitive . You can use the code directly , There is no need to embed them in functions that require a small amount of parameter modification .

1. Delete multiple columns of data

def drop_multiple_col(col_names_list, df): 
    '''
    AIM    -> Drop multiple columns based on their column names 
    INPUT  -> List of column names, df
    OUTPUT -> updated df with dropped columns 
    ------
    '''
    df.drop(col_names_list, axis=1, inplace=True)
    return df

Sometimes , Not all columns of data are useful for our data analysis . therefore ,「df.drop」 You can easily delete the column you selected .

2. transformation Dtypes

def change_dtypes(col_int, col_float, df): 
    '''
    AIM    -> Changing dtypes to save memory
    INPUT  -> List of column names (int, float), df
    OUTPUT -> updated df with smaller memory  
    ------
    '''
    df[col_int] = df[col_int].astype('int32')
    df[col_float] = df[col_float].astype('float32')

When we face larger data sets , We need to be right about 「dtypes」 convert , To save memory .

3. Convert a classification variable to a numeric variable

def convert_cat2num(df):
    # Convert categorical variable to numerical variable
    num_encode = {'col_1' : {'YES':1, 'NO':0},
                  'col_2'  : {'WON':1, 'LOSE':0, 'DRAW':0}}  
    df.replace(num_encode, inplace=True)

Some machine learning models require variables to exist in numerical form . At this time , We need to convert classified variables into numerical variables, and then use them as the input of the model . For data visualization tasks , I suggest you keep categorical variables , So that the visualization results can be explained more clearly , Easy to understand .

4. Check for missing data

def check_missing_data(df):
    # check for any missing data in the df (display in descending order)
    return df.isnull().sum().sort_values(ascending=False)

If you want to check how much data is missing in each column , This is probably the fastest way . This method allows you to know more clearly which columns have more missing data , Help you decide what action to take next in data cleaning and data analysis .

5. Delete the string in the column

def remove_col_str(df):
    # remove a portion of string in a dataframe column - col_1
    df['col_1'].replace('\n', '', regex=True, inplace=True)
    # remove all the characters after &# (including &#) for column - col_1
    df['col_1'].replace(' &#.*', '', regex=True, inplace=True)

Sometimes you may see a new line of characters , Or see some strange symbols in the string column . You can use it easily df['col_1'].replace To deal with the problem , among 「col_1」 It's data frames df One of the columns .

6. Delete spaces in columns

def remove_col_white_space(df):
    # remove white space at the beginning of string 
    df[col] = df[col].str.lstrip()

When the data is chaotic , A lot of unexpected things happen . It is common to have some spaces at the beginning of a string . therefore , When you want to delete the space at the beginning of the string in the column , This method is very practical .

7. Put two columns of string data （ under certain conditions ） Splice up

def concat_col_str_condition(df):
    # concat 2 columns with strings if the last 3 letters of the first column are 'pil'
    mask = df['col_1'].str.endswith('pil', na=False)
    col_new = df[mask]['col_1'] + df[mask]['col_2']
    col_new.replace('pil', ' ', regex=True, inplace=True)  # replace the 'pil' with emtpy space

When you want to combine two columns of string data under certain conditions , This method is very useful . for example , You want when the first column ends with some specific letter , Splice the first and second columns of data together . According to your needs , You can also delete the ending letter after the splicing work is completed .

8. Convert timestamps （ Convert from string type to date 「DateTime」 Format ）

def convert_str_datetime(df): 
    '''
    AIM    -> Convert datetime(String) to datetime(format we want)
    INPUT  -> df
    OUTPUT -> updated df with new datetime format 
    ------
    '''
    df.insert(loc=2, column='timestamp', value=pd.to_datetime(df.transdate, format='%Y-%m-%d %H:%M:%S.%f'))

When processing time series data , You may encounter timestamp columns in string format . This means that we may have to convert the data in string format to the date specified according to our needs 「datetime」 Format , In order to use these data for meaningful analysis and display .