您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Automatic binary classification of microblog text based on Python text content / emotion

編輯：Python

Resource download address ：https://download.csdn.net/download/sheziqiong/85836603
Resource download address ：https://download.csdn.net/download/sheziqiong/85836603

Topic selection ： Design and implement a text-based content / Automatic text classification method of emotion
Specific goals ： Implement a classifier , Complete the simple binary classification of microblog text , It is divided into the front 、 negative .

The main points of this paper

Microblog text is different from official text , As a text in the online community , It is non-standard 、 Popularity 、 Symbol hybridity, etc , The following four points are specifically summarized ：

The length is variable , But no longer than 140 word , Generally in 80-100 Between words ;
The sentence pattern is simple , Generally speaking, expressing emotions is straightforward , One or several key words can determine the emotion of the whole text ;
Nonstandard use of words , There are a lot of homonyms , But the genre is similar , And there will be a large number of popular words ;
Emoticons are often used to help express emotions .

2、4 Bring convenience to processing , In particular, textual emoticons can increase the weight of emotional words ; and 1、3 It's a challenge , The data volume shall be as large as possible , Cover more language phenomena .

In the interim report , Because of the existence of the text that has nothing to do with emotional expression in the sentence , The method of using classification directly is not applicable to long text , It needs to be filtered , And the traditional classification method ignores the semantic relationship between words , Make the scale of training set larger , Without a large microblog corpus, it is difficult to achieve good results .

therefore , This paper considers the combination of word vectors and traditional classification methods , We can train the word vector table on a large-scale corpus , Then the vector of the whole text is obtained by averaging the word vector in the microblog text , Use vector as input to classify . meanwhile , When input no longer requires complete text , Filtering unwanted text becomes possible , Delete the conjunction 、 Prepositions do not affect the key information in the sentence . The word vector model takes into account the semantic connection , It is also consistent with the fact that emotion is essentially semantic ; The training speed of traditional classification method is faster than that of deep learning method , Consume less computing resources , There is no absolute difference in the effect , Therefore, it is suitable to run on a personal computer .

Specific methods and processes

experimental data

Wikipedia Chinese corpus （）： Used for training word vector table

The second CCF Natural language processing and Chinese Computing Conference Chinese microblog emotion analysis sample data ： The microblog corpus of this project comes from the Chinese stop words list published by the Chinese naturallanguageprocessing open platform of the Institute of computing, Chinese Academy of Sciences StopWord.txt： Used to filter useless text

The training process

First, the word vector table is obtained from Wikipedia Chinese corpus , With the help of the module python Open source genism, It includes the method of training word vector for Wikipedia corpus .

Then we preprocess the microblog corpus , The sample data downloaded is not a simple positive and negative binary classification , Its classification includes ：anger anger 、disgust Hate 、fear Fear 、happiness happy 、like like 、sadness sad 、surprise surprised 、none Neutral . I will anger anger 、disgust Hate 、fear Fear 、sadness Sadness is classified as negative ,happiness happy 、like Preferences are classified as positive , Finally, each 1029 The positive and negative balance of sentences , Name it neg.txt and pos.txt.

Word segmentation of positive and negative corpora , With the help of the module jieba（ Stuttering participle ）; Use the stoplist for text cleaning ; Finally, compare the word vector table and take the average value to obtain the feature vector of the sentence . The dimension of the eigenvector obtained is 400, To save computing resources 、 Speed up , Use PCA Analysis and mapping , Find out 100 Dimension can contain almost all the information , Therefore, the dimension is reduced to 100.

The training set and verification set are randomly divided into 95：5. The verification set has 102 sentence , Among them, the negative 48 sentence , positive 54 sentence .

Three classification models are used for comparative study , by SVM、BP neural network 、 Random forests . The models are trained on the training sets respectively , Call the model to validate on the validation set .

Experimental results and Analysis

1. The evaluation index

Category Belong to this category Not in this category Actually belongs to this category TPFN It doesn't really belong to this category FPTN

surface 1： Confusion matrix

TP、TN Indicates the number of text that the classification result is consistent with the actual label ,FN、FP Indicates the number of text that the classification result is inconsistent with the actual label .

Accuracy rate ：P=TP/(TP+FP)

Recall rate ：R=TP/(TP+FN)

F1 value ：F1=2P*R/(P+R)

2. experimental result

The test results on the verification set are shown in the following table ：

Model Category Accuracy rate Recall rate F1 value SVM00.610.620.62SVM10.660.650.65BP neural network 00.560.710.62BP neural network 10.660.500.57 Random forests 00.650.690.67 Random forests 10.710.670.69

surface 2： The verification results

3. Analysis of experimental results

comparison , Random forest works best , But the gap is not big , It shows that no model has an absolute advantage in such a small training set . During parameter adjustment, it is found that BP The effect of increasing the number of iterations is not obvious , But changing the number of hidden layers has a great impact ;SVM The optimization kernel function is the key ; Random forest should be appropriately reduced “ Trees ” Number of , increase “ Trees ” The result is bad .
The size of the training set has a great impact . When reduced 100 Sentence training set sentences , Three models of F1 Values generally decreased by nearly 10 percentage , among SVM Most affected ,BP The neural network is least affected . It shows that these models still depend on the size of the training set , However, the scale of the corpus in this project is not large enough , It should be one of the reasons for the general effect of the final verification .
The positive and negative verification effects are unbalanced .SVM And random forest are both positive text effects , There are two reasons to speculate ：1. When sorting out the corpus , There are four kinds of negative emotions and two kinds of positive emotions , Negative emotion words are more complicated , This leads to incomplete coverage when the training set is not large enough , The effect will be worse ;2. There is a kind of sad expression in negative texts , Ideographic implication , Feature extraction is also difficult , Result in poor results . but BP The result of neural network is that the negative is better than the positive , I don't know why .
Suitable for batch processing , Not a single microblog analysis . Each time the text is analyzed, it needs to be preprocessed and vectorized , One of the most time-consuming is the word vector table I/O, The size of the text to be analyzed has little effect on the time consumption , Therefore, it is suitable for processing a large amount of data at one time , Processing a single micro blog is much slower than using the method of text direct classification .

Resource download address ：https://download.csdn.net/download/sheziqiong/85836603
Resource download address ：https://download.csdn.net/download/sheziqiong/85836603