In this book , I've already introduced Python Programming basis of data analysis . Because data analysts and scientists always spend a lot of time on data collation and preparation , The focus of this book is to master these functions .
What library to choose for the development model depends on the application itself . Many statistical problems can be solved by simple methods , Such as ordinary least square regression , Other problems may require complex machine learning methods . Fortunately, ,Python It has become one of the languages that use these analytical methods , So after reading this book , You can explore many tools .
In this chapter , I will review some pandas Characteristics , When you cling to pandas Data normalization and model fitting and scoring , They may come in handy . Then I will briefly introduce two popular modeling tools ,statsmodels and scikit-learn. Each of these two is worth writing another book , I will not make a comprehensive introduction , Instead, it is recommended that you study the online documentation of the two projects and other information based on Python Data science 、 Statistics and machine learning books .
The usual workflow for model development is to use pandas Data loading and cleaning , Then switch to the modeling library for modeling . An important part of developing models is in machine learning “ Feature Engineering ”. It can describe any data transformation or analysis that extracts information from the original data set , These datasets may be useful in modeling . Data aggregation and GroupBy Tools are often used in feature engineering .
Excellent feature engineering is beyond the scope of this book , I will try my best to introduce some methods for data operation and modeling switching .
pandas And other analysis libraries usually rely on NumPy Array of . take DataFrame Convert to NumPy Array , have access to .values attribute :
In [10]