您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

[Python artificial intelligence] Python full stack system (19)

編輯：Python

Artificial intelligence

Chapter vii. Public opinion analysis

One 、 Public opinion analysis

Text emotion analysis is also called opinion mining 、 Tendentiousness analysis, etc . Briefly , It is to analyze the subjective text with emotional color 、 Handle 、 The process of induction and reasoning . The Internet has produced a large number of people 、 event 、 Products and other valuable comments . These comments express people's various emotional colors and emotional tendencies , Ruxi 、 anger 、 Grief 、 Music and criticism 、 Praise, etc . Based on this , Potential users can browse these subjective comments to understand the views of public opinion on an event or product .

Two 、 Sentiment analysis

Hotel Reviews

 Predict whether hotel reviews are positive or negative ：
1. The room is great , It's very close to the main road , Very convenient , Pretty good . Praise 0.9954
2. The room is a little dirty , The toilet is still leaking , The air conditioner fails to cool the air , I'll never come again . Bad review 0.99
3. The floor is not very clean , The TV has no signal , But the air conditioner is OK , Anyway, it's OK . Praise 0.56

Principles of text semantic analysis ：
- First, word segmentation is carried out for the training text , Count the frequency of words , Through word frequency - The inverse document frequency algorithm obtains the contribution of the word to the sample semantics , According to the contribution of each word , Build a supervised classification learning model . Give the test sample to the model , Get the emotion category of the test sample .

3、 ... and 、 Text participle

English participle ：

pip3 install nltk -i
https://pypi.tuna.tsinghua.edu.cn/simple/

Chinese word segmentation ：

pip3 install jieba -i
https://pypi.tuna.tsinghua.edu.cn/simple/

Four 、 English participle

English participle related API as follows ,nltk Will look for punkt resources ：
- ~/nltk_data/tokenizers/punkt/

import nltk.tokenize as tk
# Split the sample into sentences sent_list: List of sentences 
sent_list = tk.sent_tokenize(text)
# Split the sample according to the list word_list: List of words 
word_list = tk.word_tokenize(text)
# Split the sample into words punctTokenizer: Participator object 
punctTokenizer = tk.WordPunctTokenizer()
word_list = punctTokenizer.tokenize(text)

5、 ... and 、 Case study ： English participle

Make a clause in the following text 、 Word segmentation

doc = "Are you curious about tokenization? Let's see how it works! We need to analyze a couple of sentences with punctuations to see it in action."
# Separate sentences 
sents = tk.sent_tokenize(doc)
for i in range(len(sents)):
print(i+1, ':', sents[i])
""" 1 : Are you curious about tokenization? 2 : Let's see how it works! 3 : We need to analyze a couple of sentences with punctuations to see it in action. """
# Word segmentation 
words = tk.word_tokenize(doc)
for i in range(len(words)):
print(i+1, ':', words[i])
""" 1 : Are 2 : you 3 : curious 4 : about 5 : tokenization 6 : ? 7 : Let 8 : 's 9 : see 10 : how 11 : it 12 : works 13 : ! 14 : We 15 : need 16 : to 17 : analyze 18 : a 19 : couple 20 : of 21 : sentences 22 : with 23 : punctuations 24 : to 25 : see 26 : it 27 : in 28 : action 29 : . """
tokenizer = tk.WordPunctTokenizer()
words = tokenizer.tokenize(doc)
for i in range(len(words)):
print(i+1, ':', words[i])
""" 1 : Are 2 : you 3 : curious 4 : about 5 : tokenization 6 : ? 7 : Let 8 : ' 9 : s 10 : see 11 : how 12 : it 13 : works 14 : ! 15 : We 16 : need 17 : to 18 : analyze 19 : a 20 : couple 21 : of 22 : sentences 23 : with 24 : punctuations 25 : to 26 : see 27 : it 28 : in 29 : action 30 : . """

6、 ... and 、 Text vectorization processing

When training semantic analysis model , Take each paragraph as a sample , The semantics of this paragraph construct a training sample set for category labels . So there is an urgent need for an algorithm that can put a sentence （ A sample ） To an eigenvector , This vector is required to express semantics through numbers .

This hotel is very bad. The toilet in this hotel smells bed. The environment of this hotel is very good.
This hotel is very bad.
The toilet in this hotel smells bed.
The environment of this hotel is very good.

1. The word bag model

When training semantic analysis model , Take each paragraph as a sample , The semantics of this paragraph construct a training sample set for category labels .
The meaning of a sentence depends a lot on the number of times a word appears , Therefore, all possible words in all paragraphs can be used as feature names , Each paragraph is a sample , The number of words appearing in the sentence is used to construct the sample model for the eigenvalue , It's called a word bag （bow） Model .

ThishotelisverybadThetoiletinsmellsenvironmentofgood111110000000110011111000111101000111

Word bag model API：

import sklearn.feature_extraction.text as ft
# Build bag model objects 
cv = ft.CountVectorizer()
# Training models , Take all possible words in the sentence as feature names , Each sentence is a sample , The number of times a word appears in a sentence is the eigenvalue 
bow = cv.fit_transform(sentences).toarray()
print(bow)
# Get all feature names 
words = cv.get_features_names()

import sklearn.feature_extraction.text as ft
sents = ['This hotel is very bad.',
'The toilet in this hotel smells bad.',
'The environment of this hotel is very good.']
cv = ft.CountVectorizer()
bow = cv.fit_transform(sents)
print(bow)
print(bow.toarray())
print(cv.get_feature_names())
""" (0, 9) 1 (0, 3) 1 (0, 5) 1 (0, 11) 1 (0, 0) 1 (1, 9) 1 (1, 3) 1 (1, 0) 1 (1, 8) 1 (1, 10) 1 (1, 4) 1 (1, 7) 1 (2, 9) 1 (2, 3) 1 (2, 5) 1 (2, 11) 1 (2, 8) 1 (2, 1) 1 (2, 6) 1 (2, 2) 1 [[1 0 0 1 0 1 0 0 0 1 0 1] [1 0 0 1 1 0 0 1 1 1 1 0] [0 1 1 1 0 1 1 0 1 1 0 1]] ['bad', 'environment', 'good', 'hotel', 'in', 'is', 'of', 'smells', 'the', 'this', 'toilet', 'very'] """

2. Word frequency （TF）

The frequency of a word appearing in a sentence is called word frequency , That is, the number of words in the sentence divided by the total number of sentences . The higher the frequency of words , It indicates that the greater the contribution of words to the semantics of the current sample （ That is, when the word appears , The greater the probability of being classified into corresponding categories ）.
The word frequency can evaluate the contribution of words to the sample semantics more objectively than the number of words .

 This hotel is great , Decoration stick , Breakfast bar , The environment is great . 1
This hotel sucks , Rotten rotten rotten , It really sucks . 0
The hotel is well decorated , Poor service . ？

3. Document frequency （DF） And inverse document frequency （IDF）

Some words may appear in most samples , For example, demonstrative pronouns 、 Modal particle, etc . These words that appear in most samples do not have much effect on judging which category the sample belongs to . We need to design an algorithm to reduce the semantic contribution of these words ：

writing files frequency rate ： D F = contain Yes some individual single word Of writing files sample Ben Count total writing files sample Ben Count （ And sample Ben language The righteous Tribute offer degree back phase Turn off ） Document frequency ： DF = \frac{ Sample number of documents containing a word }{ Total number of document samples }（ It is inversely related to the semantic contribution of the sample ） writing files frequency rate ：DF= total writing files sample Ben Count contain Yes some individual single word Of writing files sample Ben Count （ And sample Ben language The righteous Tribute offer degree back phase Turn off ）
The inverse writing files frequency rate ： I D F = l o g ( total sample Ben Count 1 + contain Yes some individual single word Of sample Ben Count ) （ And sample Ben language The righteous Tribute offer degree just phase Turn off ） Reverse document frequency ： IDF = log\left(\frac{ The total number of samples }{1+ The number of samples containing a word }\right)（ It is positively correlated with the semantic contribution of the sample ） The inverse writing files frequency rate ：IDF=log(1+ contain Yes some individual single word Of sample Ben Count total sample Ben Count )（ And sample Ben language The righteous Tribute offer degree just phase Turn off ）

4. Word frequency - Reverse document frequency （TF - IDF）

Each element in the word frequency matrix is multiplied by the inverse document frequency of the corresponding word , We get the word frequency inverse document frequency value of each word for each sample . The greater the value, the greater the contribution of the word to the sample semantics , According to the contribution of each word , Building a learning model .
Get word frequency inverse document frequency （TF-IDF） Matrix correlation API：

# Build bag model objects 
cv = ft.CountVectorizer()
# Training models , Take all possible words in the sentence as feature names , Each sentence is a sample , The number of times a word appears in a sentence is the eigenvalue 
bow = cv.fit_transform(sentences).toarray()
# obtain TF-IDF Model trainer 
tt = ft.TfidfTransformer()
tfidf = tt.fit_transform(bow).toarray()

tt = ft.TfidfTransformer()
tfidf = tt.fit_transform(bow).toarray()
print(np.round(tfidf, 3))
print(cv.get_feature_names())
""" [[0.488 0. 0. 0.379 0. 0.488 0. 0. 0. 0.379 0. 0.488] [0.345 0. 0. 0.268 0.454 0. 0. 0.454 0.345 0.268 0.454 0. ] [0. 0.429 0.429 0.253 0. 0.326 0.429 0. 0.326 0.253 0. 0.326]] ['bad', 'environment', 'good', 'hotel', 'in', 'is', 'of', 'smells', 'the', 'this', 'toilet', 'very'] """

Chapter viii. Mail classification （ Subject identification ）

Use a given text data set （20news） Do topic recognition training , Custom test set

import numpy as np
import pandas as pd
import sklearn.datasets as sd
import sklearn.model_selection as ms
import sklearn.linear_model as lm
import sklearn.metrics as sm
# Load data set 
data = sd.load_files('20news', encoding='latin1')
len(data.data) # 2968 Samples 
""" 2968 """
import sklearn.feature_extraction.text as ft
# Organize input and output sets TFIDF Turn each email into an eigenvector 
cv = ft.CountVectorizer()
bow = cv.fit_transform(data.data)
tt = ft.TfidfTransformer()
tfidf = tt.fit_transform(bow)
# tfidf.shape # （2968,40605）
# Split test set and training set 
train_x, test_x, train_y, test_y = ms.train_test_split(tfidf, data.target, test_size=0.1, random_state=7)
# Cross validation 
model = lm.LogisticRegression()
scores = ms.cross_val_score(model, tfidf, data.target, cv=5, scoring='f1_weighted')
# f1 score 
print(scores.mean()) # 0.9597980963781605
# Training models 
model.fit(train_x, train_y)
# test model , Evaluation model 
pred_test_y = model.predict(test_x)
print(sm.classification_report(test_y, pred_test_y))
""" 0.9597980963781605 precision recall f1-score support 0 0.81 0.96 0.88 57 1 0.97 0.89 0.93 65 2 1.00 0.95 0.97 61 3 1.00 1.00 1.00 54 4 1.00 0.95 0.97 60 accuracy 0.95 297 macro avg 0.96 0.95 0.95 297 weighted avg 0.96 0.95 0.95 297 """
# Arrange a group of test samples for model test 
test_data = ["In the last game, the spectator was accidentally hit by a baseball injury and has been hospitalized.",
"Recently, Lao Wang is studying asymmetric encryption algorithms.",
"The two-wheeled car is pretty good on the highway."]
# Convert the sample into... According to the training method tfidf matrix , Can be handed over to the model for prediction 
bow = cv.transform(test_data)
test_data = tt.transform(bow)
pred_test_y = model.predict(test_data)
print(pred_test_y)
print(data.target_names)
""" [2 0 1] ['misc.forsale', 'rec.motorcycles', 'rec.sport.baseball', 'sci.crypt', 'sci.space'] """
data.target_names
""" ['misc.forsale', 'rec.motorcycles', 'rec.sport.baseball', 'sci.crypt', 'sci.space'] """

Chapter nine Naive Bayes classification model

Naive Bayesian classification model is a classification method based on statistical probability theory .
Naive Bayes is a powerful and easy to train classifier , It uses Bayesian theorem to determine The probability of a result given a set of conditions ,“ simple ” It means that all given conditions can exist and occur independently . Naive Bayes is a multi-purpose classifier , Its application can be found in many different scenarios , For example, spam filtering 、 Natural language processing, etc .

One 、 probability

1. Definition

Probability reflects the probability of random events . A random event is a random event under the same conditions , Events that may or may not occur . for example ：
（1） Flip a coin , Maybe face up , Maybe the opposite side is up , This is a random event . just / The possibility that the opposite side is up is called probability ;
（2） Dice , The number of points thrown is a random event . The probability of the occurrence of each point is called probability ;
（3） A batch of goods contains good products 、 Defective product , Take one at random , Good smoke / Defective products are random events . After a lot of trial and error , The defective rate is getting closer and closer to a constant , Then the constant is probability .
We can record random events as A or B, be P（A）, P（B） Indicates an event A or B Probability .

2. Joint probability and conditional probability

① joint probability

It refers to the probability that multiple conditions are included and all conditions are true at the same time , Write it down as P ( A , B ) P ( A , B ) P(A,B) , or P ( A B ) P(AB) P(AB), or P ( A ⋂ B ) P(A \bigcap B) P(A⋂B)

② Conditional probability

Known events B Under the condition of occurrence , Take retail as an example A The probability of occurrence is called conditional probability , Write it down as ： P ( A ∣ B ) P(A|B) P(A∣B)

p( It's raining | overcast )

③ Independence of events

event A It doesn't affect events B Happen , Call the two events independent , Write it down as ：
P ( A B ) = P ( A ) P ( B ) P(AB)=P(A)P(B) P(AB)=P(A)P(B)
because A and B Do not affect each other , Then there are ：
P ( A ∣ B ) = P ( A ) P(A|B) = P(A) P(A∣B)=P(A)
It can be understood as , Given or not given B Under the condition of ,A The probability is the same .

3. A priori probability and a posteriori probability

① Prior probability

A priori probability is also the probability obtained from previous experience and analysis , for example ： Without any information , Guess the last name of the stranger opposite , The probability of surnamed Li is the greatest （ Because the surname Li accounts for the highest proportion in the country ）, This is a priori probability .

② Posterior probability

A posteriori probability refers to the correction probability when certain conditions or information are received , for example ： I know that the person opposite is from “ Niujia village ” Under the circumstances , The probability of guessing his surname is the greatest , But the surname Yang is not ruled out 、 Lee, wait , This is a posteriori probability .

③ The relationship between the two

It hasn't happened yet , Find out the possibility of this happening , It's a priori probability （ It can be understood as seeking results ）. It's happened , The reason for this event is the possibility caused by a certain factor , It's a posterior probability （ From the fruit to the cause ）. There is an inseparable relationship between a priori probability and a posteriori probability , The calculation of posterior probability should be based on prior probability .

Two 、 Bayes theorem

1. Definition

Bayes gave the reason to Thomas, an English mathematician . Bayes ( Thomas Bayes) Put forward , Used to describe the relationship between two conditional probabilities , The theorem is described as ：
P ( A ∣ B ) = P ( A ) P ( B ∣ A ) P ( B ) P(A|B) = \frac{P(A)P(B|A)}{P(B)} P(A∣B)=P(B)P(A)P(B∣A)
among , P ( A ) P(A) P(A) and P ( B ) P(B) P(B) yes A Events and B The probability of an event happening . P ( A ∣ B ) P(A|B) P(A∣B) It's called conditional probability , Express B Under the condition of the event ,A The probability of an event happening . Derivation process ：
P ( A , B ) = P ( B ) P ( A ∣ B ) P ( B , A ) = P ( A ) P ( B ∣ A ) P(A,B) =P(B)P(A|B)\\ P(B,A) =P(A)P(B|A) P(A,B)=P(B)P(A∣B)P(B,A)=P(A)P(B∣A)
among P ( A , B ) P(A,B) P(A,B) It's called joint probability , Refers to an event B Probability of occurrence , Times the event A In the event B The probability of occurrence under the condition of occurrence . because P ( A , B ) = P ( B , A ) P(A,B)=P(B,A) P(A,B)=P(B,A), So there is ：
P ( B ) P ( A ∣ B ) = P ( A ) P ( B ∣ A ) P(B)P(A|B)=P(A)P(B|A) P(B)P(A∣B)=P(A)P(B∣A)
Divide both sides at the same time P(B), Then we get the expression of Bayesian Theorem . among , P ( A ) P(A) P(A) It's a priori probability , P ( A ∣ B ) P(A|B) P(A∣B) Is known B After occurrence A Conditional probability of , Also known as a posteriori probability .

2. Examples of Bayesian theorem

Suppose a school 60% Of boys and 40% The girl of , The number of girls wearing trousers is equal to the number of girls wearing skirts , All the boys wear pants , A man looks at it in the distance at random , Look at a student in pants , May I ask the probability that this student is a girl ：
p( Woman ) = 0.4
p( The trousers | Woman ) = 0.5
p( The trousers ) = 0.8
P( Woman | The trousers ) = 0.4 * 0.5 / 0.8 = 0.25

P ( A ∣ B ) = P ( A ) P ( B ∣ A ) P ( B ) P(A|B) = \frac{P(A)P(B|A)}{P(B)} P(A∣B)=P(B)P(A)P(B∣A)

3. demand

Look at this data ：

Weather conditions Dressing style Date a girlfriend ==> Mood 0( a sunny day )0( leisure )0( I have an appointment )==>0( happy )01( Coquettish )1( No appointment )==>01( cloudy )10==>002( dilapidated )1==>1( depressed )2( rain )20==>0………==>…010==>?

How to predict based on this set of data ： a sunny day 、 Dress casually 、 I don't feel like dating my girlfriend ？
0 0 1 => ?
Through the above training samples, samples with the same eigenvalue can be sorted out , Calculate the probability of belonging to a certain category ：
- a sunny day 、 Dress casually 、 The probability of being happy without a girlfriend
  - P( happy | Fine , Cease , No appointment )
- a sunny day 、 Dress casually 、 The probability of depression without a girlfriend
  - P( depressed | Fine , Cease , No appointment )

4. Bayes predicts mood

According to Bayes theorem , How to predict ： a sunny day 、 Dress casually 、 I don't feel like dating my girlfriend ？
P ( high xing ∣ Fine God , Cease Leisure , no about ) = P ( Fine God , Cease Leisure , no about ∣ high xing ) P ( high xing ) P ( Fine God , Cease Leisure , no about ) P( happy | a sunny day , leisure , No appointment ) = \frac{P( a sunny day , leisure , No appointment | happy )P( happy )}{P( a sunny day , leisure , No appointment )} P( high xing ∣ Fine God , Cease Leisure , no about )=P( Fine God , Cease Leisure , no about )P( Fine God , Cease Leisure , no about ∣ high xing )P( high xing )
P ( Depression Stuffy ∣ Fine God , Cease Leisure , no about ) = P ( Fine God , Cease Leisure , no about ∣ Depression Stuffy ) P ( Depression Stuffy ) P ( Fine God , Cease Leisure , no about ) P( depressed | a sunny day , leisure , No appointment ) = \frac{P( a sunny day , leisure , No appointment | depressed )P( depressed )}{P( a sunny day , leisure , No appointment )} P( Depression Stuffy ∣ Fine God , Cease Leisure , no about )=P( Fine God , Cease Leisure , no about )P( Fine God , Cease Leisure , no about ∣ Depression Stuffy )P( Depression Stuffy )
Comparison ：
= > P ( Fine God , Cease Leisure , no about ∣ high xing ) P ( high xing ) => P( a sunny day , leisure , No appointment | happy )P( happy ) =>P( Fine God , Cease Leisure , no about ∣ high xing )P( high xing )
= > P ( Fine God , Cease Leisure , no about ∣ Depression Stuffy ) P ( Depression Stuffy ) => P( a sunny day , leisure , No appointment | depressed )P( depressed ) =>P( Fine God , Cease Leisure , no about ∣ Depression Stuffy )P( Depression Stuffy )
Assuming conditional independence （ A simple concept ）, There is no causal relationship between features , be ：
= > P ( Fine God ∣ high xing ) P Cease Leisure ∣ high xing ) P ( no about ∣ high xing ) P ( high xing ) => P( a sunny day | happy )P leisure | happy )P( No appointment | happy )P( happy ) =>P( Fine God ∣ high xing )P Cease Leisure ∣ high xing )P( no about ∣ high xing )P( high xing )
= > P ( Fine God ∣ Depression Stuffy ) P Cease Leisure ∣ Depression Stuffy ) P ( no about ∣ Depression Stuffy ) P ( Depression Stuffy ) => P( a sunny day | depressed )P leisure | depressed )P( No appointment | depressed )P( depressed ) =>P( Fine God ∣ Depression Stuffy )P Cease Leisure ∣ Depression Stuffy )P( no about ∣ Depression Stuffy )P( Depression Stuffy )
It can be obtained. , Calculate the required probability in the training phase , After calculation, the greater one shall be selected as the final result .

5. Naive Bayesian implementation

Bayesian classifier correlation API：

import sklearn.naive_bayes as nb
# Create Gaussian naive Bayesian classifier object 
model = nb.GaussianNB()
model = nb.MultinomialNB()
model.fit(x, y)
result = model.predict(samples)

GaussianNB It is more suitable for training samples that obey Gaussian distribution
MultinomialNB It is more suitable for training samples subject to multinomial distribution
stay sklearn in , Three naive Bayesian classifiers are provided , Namely ：
- GaussianNB（ Gaussian naive Bayes classifier ）： The values suitable for the sample are continuous , The data are normally distributed （ Like the height of a person 、 Urban household income 、 The result of an exam, etc ）
- MultinominalNB（ Polynomial naive Bayes classifier ）： It is suitable for data sets where most attributes are discrete values
- BernoulliNB（ Bernoulli naive Bayesian classifier ）： It is suitable for data sets whose eigenvalues are binary discrete values or sparse multivariate discrete values
In the example , The value of the sample is continuous , And normal distribution , So using GaussianNB Model . The code is as follows ：

# Naive Bayesian classification example 
import numpy as np
import sklearn.naive_bayes as nb
import matplotlib.pyplot as mp
# Input , Output 
x, y = [], []
# Read data file 
with open("../data/multiple1.txt", "r") as f:
for line in f.readlines():
data = [float(substr) for substr in line.split(",")]
x.append(data[:-1]) # The input samples ： Take from the first column to the penultimate column 
y.append(data[-1]) # The output samples ： Take the last column 
x = np.array(x)
y = np.array(y, dtype=int)
# Create Gaussian naive Bayesian classifier object 
model = nb.GaussianNB()
model.fit(x, y) # Training 
# Calculate the display range 
left = x[:, 0].min() - 1
right = x[:, 0].max() + 1
buttom = x[:, 1].min() - 1
top = x[:, 1].max() + 1
grid_x, grid_y = np.meshgrid(np.arange(left, right, 0.01),
np.arange(buttom, top, 0.01))
mesh_x = np.column_stack((grid_x.ravel(), grid_y.ravel()))
mesh_z = model.predict(mesh_x)
mesh_z = mesh_z.reshape(grid_x.shape)
mp.figure('Naive Bayes Classification', facecolor='lightgray')
mp.title('Naive Bayes Classification', fontsize=20)
mp.xlabel('x', fontsize=14)
mp.ylabel('y', fontsize=14)
mp.tick_params(labelsize=10)
mp.pcolormesh(grid_x, grid_y, mesh_z, cmap='gray')
mp.scatter(x[:, 0], x[:, 1], c=y, cmap='brg', s=80)
mp.show()

Execution results ：

6. Mail classification implementation

import numpy as np
import pandas as pd
import sklearn.datasets as sd
import sklearn.model_selection as ms
import sklearn.linear_model as lm
import sklearn.metrics as sm
# Load data set 
data = sd.load_files('20news', encoding='latin1')
import sklearn.feature_extraction.text as ft
# Organize input and output sets TFIDF Turn each email into an eigenvector 
cv = ft.CountVectorizer()
bow = cv.fit_transform(data.data)
tt = ft.TfidfTransformer()
tfidf = tt.fit_transform(bow)
# tfidf.shape # （2968,40605）
# Split test set and training set 
train_x, test_x, train_y, test_y = ms.train_test_split(tfidf, data.target, test_size=0.1, random_state=7)
# Cross validation 
# Using naive Bayes 
import sklearn.naive_bayes as nb
model = nb.MultinomialNB()
scores = ms.cross_val_score(model, tfidf, data.target, cv=5, scoring='f1_weighted')
# f1 score 
print(scores.mean())
# Training models 
model.fit(train_x, train_y)
# test model , Evaluation model 
pred_test_y = model.predict(test_x)
print(sm.classification_report(test_y, pred_test_y))
""" 0.9458384770112502 precision recall f1-score support 0 1.00 0.84 0.91 57 1 0.94 0.94 0.94 65 2 0.95 0.97 0.96 61 3 0.90 1.00 0.95 54 4 0.97 1.00 0.98 60 accuracy 0.95 297 macro avg 0.95 0.95 0.95 297 weighted avg 0.95 0.95 0.95 297 """
# Arrange a group of test samples for model test 
test_data = ["In the last game, the spectator was accidentally hit by a baseball injury and has been hospitalized.",
"Recently, Lao Wang is studying asymmetric encryption algorithms.",
"The two-wheeled car is pretty good on the highway.",
"Next year, China will explore Mars."]
# Convert the sample into... According to the training method tfidf matrix , Can be handed over to the model for prediction 
bow = cv.transform(test_data)
test_data = tt.transform(bow)
pred_test_y = model.predict(test_data)
print(pred_test_y)
print(data.target_names)
""" [2 3 1 4] ['misc.forsale', 'rec.motorcycles', 'rec.sport.baseball', 'sci.crypt', 'sci.space'] """

7. summary

1. What is naive bayes

Naive bayes method is a classification method based on bayes theorem and independent hypothesis of feature conditions .“ simple ” Means ： It is assumed that the characteristic variables of the problem act on the decision variables independently of each other , That is, the characteristics of the problem are not related to each other .

2. Characteristics of naive Bayesian classification

① advantage

Logic is simple
The algorithm is relatively stable . When data presents different characteristics , The classification performance of naive Bayes will not be much different .
When the relationship between sample features is relatively independent , Naive Bayesian classification algorithm will have better results .

② shortcoming

The independence of features is difficult to meet in many cases , Because there is often correlation between sample features , If this problem occurs in the classification process , It will greatly reduce the effect of classification .