您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

The most useful Python library for data cleaning in 2021

編輯：Python

Most surveys show that , Data scientists and data analysts need to spend 70-80% Time to clean up and prepare data for analysis .

For many data workers , Data cleaning and preparation are often the least favorite part of their work , So they're going to do something else 20-30% Time spent complaining , Although this is a joke , But it well reflects the special position of data cleaning in data analysis .

In normal work life , There are always some inconsistencies in the data 、 Missing input 、 Irrelevant information 、 Repeated information or outright errors, etc . Especially when the data comes from different sources , Each source has its own set of quirks 、 Challenges and irregularities . Messy data is useless , Sometimes it even acts in the opposite direction , That's why data scientists spend most of their time understanding all the data

Although cleaning and preparing data is cumbersome and laborious , But the cleaner our data is 、 The more organized , All the work behind will become faster 、 More relaxed 、 More efficient .

This article will share the selected 15 One of the most useful Python Data cleaning Library , I hope on the way of data analysis , The sooner we can all relax ！

NumPy
Pandas
Matplotlib
Datacleaner
Dora
Seaborn
Arrow
Scrubadub
Tabulate
Missingno
Modin
Ftfy
SciPy
Dabl
Imblearn

NumPy

NumPy Is a fast and easy to use open source scientific computing Python library , It is also the basic library of data science ecosystem , because Pandas and Matplotlib And many of the most popular Python Libraries are built on NumPy Above

In addition to being the basis for other powerful Libraries ,NumPy It also has many features , Make it a Python An integral part of data analysis . Because of its speed and versatility ,NumPy Vectorization of 、 The concepts of index and broadcast represent the de facto standard of array computing ,NumPy Especially good when dealing with multidimensional arrays . It also provides a comprehensive numerical toolbox , Such as linear algebra routines 、 Fourier transform, etc

NumPy Can do many things for many people , Its high-level syntax allows programmers of any background or level of experience to use its powerful data processing capabilities . for example , be based on NumPy Generated the first ever image of a black hole , It also confirms the existence of gravitational waves , At present, it is playing an important role in all kinds of scientific research

Such a program that covers everything from sports to space can also help us manage and clean up data , Have to say ,Numpy The library is amazing

Pandas

Pandas By NumPy Libraries that provide support , It is Python The most widely used Data analysis and operation library

Pandas Fast and easy to use , Its grammar is very humanized , Plus it's operating DataFrame Incredible flexibility in , Make it an analysis 、 An indispensable tool for manipulating and cleaning data

This powerful Python The library can not only process digital data , You can also process text data and date data . It allows us to join 、 Merge 、 Connect or copy DataFrame, And use drop() Function to easily add or delete columns or rows

In short ,Pandas Combined with speed 、 Ease of use and flexible functionality , Created a very powerful tool , Make data operation and analysis fast and simple

Matplotlib

Understanding our data is a key part of the cleanup process , The purpose of cleaning up data is to make it easy to understand . But before we have beautiful and clean data , You need to understand the problems in messy data first , For example, their types and scope , Then it can be cleaned effectively , A large part of this operation depends on the accuracy and visual presentation of the data

Matplotlib Famous for its impressive data visualization , This makes it a valuable tool in data cleansing , It's using Python Generate graphics 、 Charts and other 2D Preferred tool library for data visualization

We can use... In data cleansing Matplotlib, By generating a distribution map to help us understand the shortcomings of the data

Datacleaner

Datacleaner It's based on Pandas DataFrame Third party library , although Datacleaner It appears for a short time and is not as good as Pandas popular , however ,Datacleaner There is a unique way , It combines and automates some typical data cleaning functions , This saves us valuable time and energy

Use Datacleaner, We can easily replace missing values with mode or median on a column by column basis , Code the categorical variables , And delete rows with missing values

Dora

Dora Library usage Scikit-learn、Pandas and Matplotlib Do exploratory analysis , Or more specifically , Used to automate the least popular aspects of exploratory analysis . In addition to handling feature selection 、 Beyond extraction and Visualization ,Dora It also optimizes and automates data cleansing

Dora It will save us valuable time and energy through many data cleaning functions , For example, enter a missing value 、 Read the data of missing values and poorly scaled values, as well as the scaled values of input variables, etc

Besides ,Dora Provides a simple interface , Used to save data snapshots when we convert data , And with its unique data version control function and other functions Python The bag is different

Seaborn

in front , We discussed the importance of visualizing data to reveal data defects and inconsistencies . Before solving the problems in the data , We need to know what they are and where they are , At this point, using data visualization is the best solution . Although for many Python Users ,Matplotlib It is the preferred library for data visualization , However, some users find that Matplotlib There are also limitations in customizing data visualization options , So we have Seaborn.

Seaborn Is a data visualization package , It is based on Matplotlib above , Generate attractive and informative statistical graphics , At the same time, it provides customizable data visualization

It also improves on Pandas Of DataFrames Operating efficiency in , Can be more closely with Pandas Combination , Make exploratory analysis and data cleaning more enjoyable

Arrow

An important aspect of improving data quality is throughout DataFrame Create unity and consistency in , For those who try to create uniformity when dealing with dates and times Python For developers , This process can often be difficult . Often after spending countless hours and countless lines of code , Special difficulties with date and time formatting still exist

Arrow It's a Python library , Designed to deal with these difficulties and create data consistency . Its time-saving features include time zone conversion ; Automatic string formatting and parsing ; Support pytz、dateutil object 、ZoneInfo tzinfo; Generation range 、 Lower limit 、 Time span and upper limit , Time ranges from microseconds to years

Arrow Can identify time zone （ With the standard Python Library is different ）, And the default is UTC. It gives users more skilled date and time operation commands with less code and less input . This means that we can bring greater consistency to our data , At the same time, reduce the time spent on the clock

Scrubadub

Scrubadub Is the favorite of financial and medical data scientists , It's a Python library , Designed to eliminate personally identifiable information from free text (PII)

This simple 、 Free and open source software packages can easily delete sensitive personal information from our data , So as to protect the privacy and security of the parties

Scrubadub At present, users are allowed to clear the data of the following information ：

E-mail address
website
full name
Skype user name
Phone number
password / User name combination
Social security number

Tabulate

Just call a function ,Tabulate You can use our data to create small and attractive tables , Due to its digital format 、 Title and small sequence alignment and many other functions , These tables are highly readable

This open source library also allows users to process tabular data using other tools and languages , So that users can use other formats they are good at （ Such as HTML、PHP or Markdown Extra） Output data

Missingno

Dealing with missing values is one of the main aspects of data cleaning ,Missingno The library came into being . It identifies and visualizes column by column DataFrame Missing value in , So that users can see the state of their data

Visualizing the problem is the first step in solving the problem , and Missingno It's an easy to use library , Can finish this work well

Modin

As we mentioned above ,Pandas It's already a quick library , but Modin take Pandas To a whole new level .Modin Improve... By distributing data and computing speed Pandas Performance of

Modin Users will benefit from working with Pandas The perfect fit of grammar and inconspicuous Integration , Can be Pandas Speed up by up to 400%！

Ftfy

Ftfy Was born for a simple task ： Will be bad Unicode And useless characters into relevant and readable text data

such as ：

â€œquoteâ€\x9d = "quote"
uÌˆ = ü
lt;3 = <3

No need to spend a lot of time processing text data , Use Ftfy You can quickly understand meaningless content

SciPy

SciPy Not just a library , It is also a complete data science ecosystem

Besides ,SciPy Many special tools are also provided , One of them is Scikit-learn, Perfection can take advantage of its “Preprocessing” Package for data cleaning and data set standardization

Dabl

scikit-learn A core engineer of the project developed Dabl As a data analysis library , To simplify the process of data exploration and preprocessing

Dabl There is a complete process to detect some data types and quality problems in the data set , And automatically apply the appropriate preprocessor

It can handle missing values , Convert categorical variables to numeric values , It even has built-in visualization options to facilitate rapid data exploration

Imblearn

The last library we want to introduce is Imbalanced-learn（ Abbreviation for Imblearn）, It depends on Scikit-learn And for those facing classification and imbalance Python Users provide tool support

Use is called “undersampling” It's a new pretreatment technology ,Imblearn The perfect data will be sorted out and the missing data in the data set will be deleted 、 Inconsistent or other irregular data

summary

Our data analysis model depends on the data we enter , And the cleaner our data is , Handle 、 The simpler the analysis and Visualization , Be good at using tools , It will make our work more relaxed and pleasant

Although the tools summarized above may not include all data cleaning tools , But we just have to choose what suits us , I hope today's sharing can help you ~

Okay , Today's sharing is here , If you are satisfied, please be sure to order Fabulous + Focus on Under the support