您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Chinese emotion of Hotel Comments Based on Python analysis

編輯：Python

Resource download address ：https://download.csdn.net/download/sheziqiong/85820698
Resource download address ：https://download.csdn.net/download/sheziqiong/85820698

1 Development environment preparation

1.1 Python Environmental Science

stay python Official website https://www.python.org/downloads/ Download the corresponding python edition , I use Python2.7.13 Version of .

1.2 Third-party module

The implementation of this example code uses several well-known third-party modules , The main modules are as follows ：

1）Jieba
Currently the most widely used Chinese word segmentation component . Download address ：https://pypi.python.org/pypi/jieba/
2）Gensim
For theme models 、 Document index and large corpus similarity index python library , It is mainly used in natural language processing （NLP） And information retrieval （IR）. Download address ：https://pypi.python.org/pypi/gensim
This module is required for wiki Chinese corpus processing and Chinese word vector model construction in this example .
3）Pandas
For efficient processing of large datasets 、 Perform data analysis tasks python library , Is based on Numpy The toolkit . Download address ：https://pypi.python.org/pypi/pandas/0.20.1
4）Numpy
Toolkit for storing and processing large matrices . Download address ：https://pypi.python.org/pypi/numpy
5）Scikit-learn
For machine learning python tool kit ,python The module reference name is sklearn, Before installation, you also need Numpy and Scipy Two Python library . Official website address ：http://scikit-learn.org/stable/
6）Matplotlib
Matplotlib It's a python The graphic framework of , Used to draw two-dimensional graphics . Download address ：https://pypi.python.org/pypi/matplotlib
7）Tensorflow
Tensorflow Is an open source software library that uses data flow diagrams for numerical calculations , Used in the field of artificial intelligence .
Official website address ：http://www.tensorfly.cn/
Download address ：https://pypi.python.org/pypi/tensorflow/1.1.0

2 Data acquisition

2.1 A dictionary of stop words

This paper uses the Chinese stop words list published by the Chinese naturallanguageprocessing open platform of the Institute of computing, Chinese Academy of Sciences , Contains 1208 Stop words . Download address ：http://www.hicode.cc/download/view-software-13784.html

2.2 Positive and negative corpora

Text from http://www.datatang.com/data/11936 download “ Hotel comment corpus about Chinese emotion mining ” As training set and test set , This corpus contains 4 Language material subset , In this paper, positive and negative 1000 A balanced corpus of （ChnSentiCorp_htl_ba_2000） Analysis as a data set .

3 Data preprocessing

3.1 Positive and negative corpus preprocessing

Download and unzip ChnSentiCorp_htl_ba_2000.rar file , The resulting folder contains neg（ Negative corpus ） and pos（ Positive corpus ） Two folders , Each comment in the folder is a txt file , In order to facilitate the later operation , It is necessary to normalize the positive and negative comments to the corresponding one respectively txt In file , That is, the collection document of forward corpus （ Name it 2000_pos.txt） And negative corpus （ Name it 2000_neg.txt）.
Specifically Python The implementation code is as follows ：

After the operation is completed, you will get 2000_pos.txt and 2000_neg.txt Two text files , Store positive and negative comments respectively , Each comment is on one line . The screenshot of the document section is shown below ：

3.2 Chinese text segmentation

In this paper Stuttering participle Word segmentation is performed on positive and negative corpus respectively . Particular attention , Before executing the code, you need to put txt Source files are manually converted to UTF-8 Format , Otherwise, the Chinese code error will be reported . Before the participle , You need to remove numbers from the text 、 Handling of letters and special symbols , Use python Self contained string and re Modules can implement , among string Module is used to handle string operations ,re Module for regular expression processing .
The specific implementation code is as follows ：

After processing , obtain 2000_pos_cut.txt and 2000_neg_cut.txt Two txt file , Store the results of word segmentation of positive and negative corpus respectively . The screenshot of the word segmentation result is shown below ：

3.3 To stop using words

When the participle is finished , You can read the stop words in the stop words list , Match the positive and negative corpus after word segmentation and remove the stop words . The process of removing stop words is very simple , There are two main ones ：

1） Read the stoplist ;
2） Traverse the sentence after the participle , Drop each word into this table to match , If the stoplist exists, it will be replaced with empty .

The specific implementation code is as follows ：

According to the code , The acquisition of the stoplist uses python Peculiar radio broadcast form , One sentence of code is enough ：

stopkey = [w.strip() for w in codecs.open('data\stopWord.txt', 'r', encoding='utf-8').readlines()]

Every stop word read out must be de signed, that is w.strip(), Because the stop words read out also contain line breaks and tabs , If it is not processed, it will not match . After the code is executed , obtain 2000_neg_cut_stopword.txt and 2000_pos_cut_stopword.txt Two txt file .

Because the step of removing the stop word is performed after the sentence word segmentation , Therefore, it is usually carried out in the same code segment as the word segmentation operation , That is, after the sentence word segmentation operation is completed, directly call the function to stop the word , And get the result after removing the stop word , Then write to the result file . This article is to facilitate the understanding of the steps and separate them into two code files for execution , You can adjust it according to your own needs .

3.4 Get feature word vector

According to the above steps, we get the characteristic word text of the positive and negative corpus , The input of the model must be numerical data , Therefore, each sentence composed of words needs to be transformed into a numerical vector . Common conversion algorithms are Bag of Words(BOW)、TF-IDF、Word2Vec. In this paper Word2Vec Word vector model converts corpus into word vector .

Because the extraction of feature word vector is based on the trained word vector model , and wiki Chinese corpus is recognized as a large-scale Chinese corpus , This paper intends to start from wiki From the word vectors generated by Chinese corpus, the feature word vectors of this corpus are extracted .Wiki Of Chinese corpus Word2vec A previous article on model training “ utilize Python Realization wiki Of Chinese corpus word2vec model building ” It is described in detail in , I won't go into details here . That is, from the end of the article wiki.zh.text.vector Extracting feature word vector as input of model .

The main steps to obtain the feature word vector are as follows ：

1） Read the model word vector matrix ;
2） Traverse every word in the statement , Extract the numerical vector of the current word from the model word vector matrix , One statement can get a two-dimensional matrix , The number of lines is the number of words , The number of columns is the dimension set for the model ;
3） Calculate according to the obtained matrix Matrix mean As the feature word vector of the current sentence ;
4） After all statements are calculated , The values represented by the concatenation statement category , write in csv In file .

The main code is shown in the figure below ：

After the code is executed , Get one called 2000_data.csv The file of , The first column is the value corresponding to the category （1-pos, 0-neg）, The second column starts with a numeric vector , Each line represents a comment . A partial screenshot of the results is shown below ：

3.5 Dimension reduction

Word2vec The model sets 400 The dimension of training , The resulting word vector is 400 dimension , In this paper PCA The algorithm reduces the dimension of the result . The specific implementation code is as follows ：

Run code , According to the result graph, before 100 Dimension can better contain most of the original data , So before selecting 100 Dimension as input to model .

4 Classification model building

This paper uses support vector machine （SVM） As the Chinese text classification model of this experiment , Other classification models use the same analysis process , I won't go into details here .

Support vector machine （SVM） It's a supervised machine learning model . Firstly, this paper adopts the classical machine learning algorithm SVM As a classifier algorithm , By calculating the prediction accuracy and ROC Curve to verify the effectiveness of the classifier , Generally speaking ROC The area of the curve （AUC） The larger the model, the better .

use first SVM As a classifier algorithm , Then use matplotlib and metric Library to build ROC curve . Specifically python The code is as follows ：

Run code , obtain Test Accuracy: 0.88, That is to say, the prediction accuracy of the experimental test set is 88%,ROC The curve is shown in the figure below .

Resource download address ：https://download.csdn.net/download/sheziqiong/85820698
Resource download address ：https://download.csdn.net/download/sheziqiong/85820698