Most surveys show that , Data scientists and data analysts need to spend 70-80% Time to clean up and prepare data for analysis .
For many data workers , Data cleaning and preparation are often the least favorite part of their work , So they're going to do something else 20-30% Time spent complaining , Although this is a joke , But it well reflects the special position of data cleaning in data analysis .
In normal work life , There are always some inconsistencies in the data 、 Missing input 、 Irrelevant information 、 Repeated information or outright errors, etc . Especially when the data comes from different sources , Each source has its own set of quirks 、 Challenges and irregularities . Messy data is useless , Sometimes it even acts in the opposite direction , That's why data scientists spend most of their time understanding all the data
Although cleaning and preparing data is cumbersome and laborious , But the cleaner our data is 、 The more organized , All the work behind will become faster 、 More relaxed 、 More efficient .
This article will share the selected 15 One of the most useful Python Data cleaning Library , I hope on the way of data analysis , The sooner we can all relax !
NumPy Is a fast and easy to use open source scientific computing Python library , It is also the basic library of data science ecosystem , because Pandas and Matplotlib And many of the most popular Python Libraries are built on NumPy Above
In addition to being the basis for other powerful Libraries ,NumPy It also has many features , Make it a Python An integral part of data analysis . Because of its speed and versatility ,NumPy Vectorization of 、 The concepts of index and broadcast represent the de facto standard of array computing ,NumPy Especially good when dealing with multidimensional arrays . It also provides a comprehensive numerical toolbox , Such as linear algebra routines 、 Fourier transform, etc
NumPy Can do many things for many people , Its high-level syntax allows programmers of any background or level of experience to use its powerful data processing capabilities . for example , be based on NumPy Generated the first ever image of a black hole , It also confirms the existence of gravitational waves , At present, it is playing an important role in all kinds of scientific research
Such a program that covers everything from sports to space can also help us manage and clean up data , Have to say ,Numpy The library is amazing
Pandas By NumPy Libraries that provide support , It is Python The most widely used Data analysis and operation library
Pandas Fast and easy to use , Its grammar is very humanized , Plus it's operating DataFrame Incredible flexibility in , Make it an analysis 、 An indispensable tool for manipulating and cleaning data
This powerful Python The library can not only process digital data , You can also process text data and date data . It allows us to join 、 Merge 、 Connect or copy DataFrame, And use drop() Function to easily add or delete columns or rows
In short ,Pandas Combined with speed 、 Ease of use and flexible functionality , Created a very powerful tool , Make data operation and analysis fast and simple
Understanding our data is a key part of the cleanup process , The purpose of cleaning up data is to make it easy to understand . But before we have beautiful and clean data , You need to understand the problems in messy data first , For example, their types and scope , Then it can be cleaned effectively , A large part of this operation depends on the accuracy and visual presentation of the data
Matplotlib Famous for its impressive data visualization , This makes it a valuable tool in data cleansing , It's using Python Generate graphics 、 Charts and other 2D Preferred tool library for data visualization
We can use... In data cleansing Matplotlib, By generating a distribution map to help us understand the shortcomings of the data
Datacleaner It's based on Pandas DataFrame Third party library , although Datacleaner It appears for a short time and is not as good as Pandas popular , however ,Datacleaner There is a unique way , It combines and automates some typical data cleaning functions , This saves us valuable time and energy
Use Datacleaner, We can easily replace missing values with mode or median on a column by column basis , Code the categorical variables , And delete rows with missing values
Dora Library usage Scikit-learn、Pandas and Matplotlib Do exploratory analysis , Or more specifically , Used to automate the least popular aspects of exploratory analysis . In addition to handling feature selection 、 Beyond extraction and Visualization ,Dora It also optimizes and automates data cleansing
Dora It will save us valuable time and energy through many data cleaning functions , For example, enter a missing value 、 Read the data of missing values and poorly scaled values, as well as the scaled values of input variables, etc
Besides ,Dora Provides a simple interface , Used to save data snapshots when we convert data , And with its unique data version control function and other functions Python The bag is different
in front , We discussed the importance of visualizing data to reveal data defects and inconsistencies . Before solving the problems in the data , We need to know what they are and where they are , At this point, using data visualization is the best solution . Although for many Python Users ,Matplotlib It is the preferred library for data visualization , However, some users find that Matplotlib There are also limitations in customizing data visualization options , So we have Seaborn.
Seaborn Is a data visualization package , It is based on Matplotlib above , Generate attractive and informative statistical graphics , At the same time, it provides customizable data visualization
It also improves on Pandas Of DataFrames Operating efficiency in , Can be more closely with Pandas Combination , Make exploratory analysis and data cleaning more enjoyable
An important aspect of improving data quality is throughout DataFrame Create unity and consistency in , For those who try to create uniformity when dealing with dates and times Python For developers , This process can often be difficult . Often after spending countless hours and countless lines of code , Special difficulties with date and time formatting still exist
Arrow It's a Python library , Designed to deal with these difficulties and create data consistency . Its time-saving features include time zone conversion ; Automatic string formatting and parsing ; Support pytz、dateutil object 、ZoneInfo tzinfo; Generation range 、 Lower limit 、 Time span and upper limit , Time ranges from microseconds to years
Arrow Can identify time zone ( With the standard Python Library is different ), And the default is UTC. It gives users more skilled date and time operation commands with less code and less input . This means that we can bring greater consistency to our data , At the same time, reduce the time spent on the clock
Scrubadub Is the favorite of financial and medical data scientists , It's a Python library , Designed to eliminate personally identifiable information from free text (PII)
This simple 、 Free and open source software packages can easily delete sensitive personal information from our data , So as to protect the privacy and security of the parties
Scrubadub At present, users are allowed to clear the data of the following information :
Just call a function ,Tabulate You can use our data to create small and attractive tables , Due to its digital format 、 Title and small sequence alignment and many other functions , These tables are highly readable
This open source library also allows users to process tabular data using other tools and languages , So that users can use other formats they are good at ( Such as HTML、PHP or Markdown Extra) Output data
Dealing with missing values is one of the main aspects of data cleaning ,Missingno The library came into being . It identifies and visualizes column by column DataFrame Missing value in , So that users can see the state of their data
Visualizing the problem is the first step in solving the problem , and Missingno It's an easy to use library , Can finish this work well
As we mentioned above ,Pandas It's already a quick library , but Modin take Pandas To a whole new level .Modin Improve... By distributing data and computing speed Pandas Performance of
Modin Users will benefit from working with Pandas The perfect fit of grammar and inconspicuous Integration , Can be Pandas Speed up by up to 400%!
Ftfy Was born for a simple task : Will be bad Unicode And useless characters into relevant and readable text data
such as :
“quoteâ€\x9d = "quote" ü = ü lt;3 = <3
No need to spend a lot of time processing text data , Use Ftfy You can quickly understand meaningless content
SciPy Not just a library , It is also a complete data science ecosystem
Besides ,SciPy Many special tools are also provided , One of them is Scikit-learn, Perfection can take advantage of its “Preprocessing” Package for data cleaning and data set standardization
scikit-learn A core engineer of the project developed Dabl As a data analysis library , To simplify the process of data exploration and preprocessing
Dabl There is a complete process to detect some data types and quality problems in the data set , And automatically apply the appropriate preprocessor
It can handle missing values , Convert categorical variables to numeric values , It even has built-in visualization options to facilitate rapid data exploration
The last library we want to introduce is Imbalanced-learn( Abbreviation for Imblearn), It depends on Scikit-learn And for those facing classification and imbalance Python Users provide tool support
Use is called “undersampling” It's a new pretreatment technology ,Imblearn The perfect data will be sorted out and the missing data in the data set will be deleted 、 Inconsistent or other irregular data
Our data analysis model depends on the data we enter , And the cleaner our data is , Handle 、 The simpler the analysis and Visualization , Be good at using tools , It will make our work more relaxed and pleasant
Although the tools summarized above may not include all data cleaning tools , But we just have to choose what suits us , I hope today's sharing can help you ~
Okay , Today's sharing is here , If you are satisfied, please be sure to order Fabulous + Focus on Under the support