Hello everyone !
Whether you admit it or not , Data cleaning is really not a simple task , In most cases, this work is very time-consuming and tedious , But it is very important .
If you've been through data cleansing , You'll see what I mean . And that's the purpose of writing this article —— Make it easier for readers to clean data .
in fact , I realized not long ago , During data cleaning , Some data have similar patterns . And it was from then on , I began to sort out and compile some data cleaning code ( See below ), I think this code can also be used in other common scenarios .
Because these common scenarios involve different types of data sets , Therefore, this article focuses more on showing and explaining what these codes can be used to do , So that readers can use them more conveniently .
Data cleaning toolbox
In the following code snippet , The data cleaning code is encapsulated in some functions , The purpose of the code is very intuitive . You can use the code directly , There is no need to embed them in functions that require a small amount of parameter modification .
1. Delete multiple columns of data
def drop_multiple_col(col_names_list, df): ''' AIM -> Drop multiple columns based on their column names INPUT -> List of column names, df OUTPUT -> updated df with dropped columns ------ ''' df.drop(col_names_list, axis=1, inplace=True) return df
Sometimes , Not all columns of data are useful for our data analysis . therefore ,「df.drop」 You can easily delete the column you selected .
2. transformation Dtypes
def change_dtypes(col_int, col_float, df): ''' AIM -> Changing dtypes to save memory INPUT -> List of column names (int, float), df OUTPUT -> updated df with smaller memory ------ ''' df[col_int] = df[col_int].astype('int32') df[col_float] = df[col_float].astype('float32')
When we face larger data sets , We need to be right about 「dtypes」 convert , To save memory .
3. Convert a classification variable to a numeric variable
def convert_cat2num(df): # Convert categorical variable to numerical variable num_encode = {'col_1' : {'YES':1, 'NO':0}, 'col_2' : {'WON':1, 'LOSE':0, 'DRAW':0}} df.replace(num_encode, inplace=True)
Some machine learning models require variables to exist in numerical form . At this time , We need to convert classified variables into numerical variables, and then use them as the input of the model . For data visualization tasks , I suggest you keep categorical variables , So that the visualization results can be explained more clearly , Easy to understand .
4. Check for missing data
def check_missing_data(df): # check for any missing data in the df (display in descending order) return df.isnull().sum().sort_values(ascending=False)
If you want to check how much data is missing in each column , This is probably the fastest way . This method allows you to know more clearly which columns have more missing data , Help you decide what action to take next in data cleaning and data analysis .
5. Delete the string in the column
def remove_col_str(df): # remove a portion of string in a dataframe column - col_1 df['col_1'].replace('\n', '', regex=True, inplace=True) # remove all the characters after &# (including &#) for column - col_1 df['col_1'].replace(' &#.*', '', regex=True, inplace=True)
Sometimes you may see a new line of characters , Or see some strange symbols in the string column . You can use it easily df['col_1'].replace To deal with the problem , among 「col_1」 It's data frames df One of the columns .
6. Delete spaces in columns
def remove_col_white_space(df): # remove white space at the beginning of string df[col] = df[col].str.lstrip()
When the data is chaotic , A lot of unexpected things happen . It is common to have some spaces at the beginning of a string . therefore , When you want to delete the space at the beginning of the string in the column , This method is very practical .
7. Put two columns of string data ( under certain conditions ) Splice up
def concat_col_str_condition(df): # concat 2 columns with strings if the last 3 letters of the first column are 'pil' mask = df['col_1'].str.endswith('pil', na=False) col_new = df[mask]['col_1'] + df[mask]['col_2'] col_new.replace('pil', ' ', regex=True, inplace=True) # replace the 'pil' with emtpy space
When you want to combine two columns of string data under certain conditions , This method is very useful . for example , You want when the first column ends with some specific letter , Splice the first and second columns of data together . According to your needs , You can also delete the ending letter after the splicing work is completed .
8. Convert timestamps ( Convert from string type to date 「DateTime」 Format )
def convert_str_datetime(df): ''' AIM -> Convert datetime(String) to datetime(format we want) INPUT -> df OUTPUT -> updated df with new datetime format ------ ''' df.insert(loc=2, column='timestamp', value=pd.to_datetime(df.transdate, format='%Y-%m-%d %H:%M:%S.%f'))
When processing time series data , You may encounter timestamp columns in string format . This means that we may have to convert the data in string format to the date specified according to our needs 「datetime」 Format , In order to use these data for meaningful analysis and display .