Resource download address :https://download.csdn.net/download/sheziqiong/85705774
This is an introductory project , Used to understand
Text feature engineering ,
Image feature Engineering ,
Basic data cleaning process
Project modeling process
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20050 entries, 0 to 20049
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 gender 19953 non-null object
1 description 16306 non-null object
2 link_color 20050 non-null object
3 profileimage 20050 non-null object
4 sidebar_color 20050 non-null object
5 text 20050 non-null object
dtypes: object(6)
memory usage: 940.0+ KB
None
The dataset has 20050 That's ok ,6 Column
gender: User's gender , That is, the prediction content
description: User self description
link_color: User theme colors
profileimage:twitter Avatar link
sidebar_color : User sidebar color
text: user twitter Published content
1.1 according to 'gender' Columns filter data
1.2 To filter out 'description' Data whose column is empty
1.3 To filter out 'link_color' Column sum 'sidebar_color' Illegal column 16 Hexadecimal data
1.4 Clean text data
1.5 according to profileimage Link to determine whether the avatar image is valid ,
1.6 Replace male->0, female->1
Split the dataset participle Remove stop words
Feature Engineering
3.1 Training data feature extraction
3.1.1 Text data
description Data Extraction desc Textual TF-IDF features
extract text Text TF-IDF features
3.1.2 Image data
link color Of RGB features
Head portrait RGB Histogram features
Combine text features and image features
Feature range normalization
3.2 Test data feature extraction : Just like the training set
3.3 PCA Dimension reduction operation
Use is not done PCA Characteristics of operation
Use PCA Characteristics after operation
Model :lr_model = LogisticRegression()
Model test
Delete decompression data , Clean up the space
Resource download address :https://download.csdn.net/download/sheziqiong/85705774