您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Machine Learning - Naive Bayesian Training with Python

編輯：Python

本文已參與「新人創作禮」活動,一起開啟掘金創作之路.

樸素貝葉斯算法

它是一個生成模型,through the given sample data,Get the probability of each feature appearing in each sample,From this, the joint probability is calculated,Determine which category is more likely to appear,It can be used for multi-classification problems

樸素貝葉斯原理

It is based on conditional probability

訓練數據和測試數據

The training data and test data are stored separately in TextData.txt和ceshiData.txt文件中.

TextData.txt

C、J表示標簽

ceshiData.txt

求解過程

帶有CThere are tags in this category8個單詞,其中包含Chinese5個,Beijing1個,Shanghai1個,Macao1個

帶有JThere are tags in this category3個單詞,其中包含Chinese1個,Tokyo1個,Japan1個

So the features of all categories in the whole data are ：Chinese、Beijing、Shanghai、Macao、Tokyo、Japan,總共6個特征.

接下來統計C標簽、J標簽以及6The number of times a feature appears in the entire document

Then use Laplace smoothing to calculate the probability of each feature appearing under each category

最後進行預測

代碼展示

導入數據函數：

def loadDataSetx(fileName):
dataMat=[]
fr=open(fileName)
# 一行一行地讀取數據
for line in fr.readlines():
# Convert the current row of data to a list
curLine=line.strip().split(' ')
# Extract the data and save it to dataMat中
dataMat.append(curLine)
return dataMat
復制代碼

training data function：

def train(dataMat):
# Get the type of each set of data
allType_yuan=[]
for i in range(len(dataMat)):
allType_yuan.append(dataMat[i][-1])
# Remove duplicates in type
allType=list(set(allType_yuan))
allType_pinlv=[]
for i in range(len(allType)):
allType_pinlv.append(float(allType_yuan.count(allType[i])/len(allType_yuan)))
# print(allType_pinlv)
# All properties in the sample
attribute=[]
for i in range(len(dataMat)):
for j in range(1,len(dataMat[i])-1):
attribute.append(dataMat[i][j])
attribute=list(set(attribute))
# Raw data is divided by type
type_dataMat=[]
for i in range(len(allType)):
data=[]
data.append(allType[i])
for j in range(len(dataMat)):
if dataMat[j][-1]==allType[i]:
for k in range(1,len(dataMat[j])-1):
data.append(dataMat[j][k])
type_dataMat.append(data)
# print(type_dataMat)
# The frequency of occurrence of each word in each category,Laplace smoothing was used
attr_pinlv=[]
for i in range(len(allType)):
attr_pinlv_1={}
attr_pinlv_1['Type']=allType[i]
for j in range(len(attribute)):
attr_pinlv_1[attribute[j]]=float((type_dataMat[i].count(attribute[j])+1)/(len(attribute)+len(type_dataMat[i])-1))
attr_pinlv.append(attr_pinlv_1)
return allType_pinlv,attr_pinlv
復制代碼

Test data function：

# 測試數據
def ceshi(fileName,allType_pinlv,attr_pinlv):
ceshi=loadDataSetx(fileName)
ceshi_result=[]
# 第i條測試數據
for i in range(len(ceshi)):
ceshi_result_1={}
for j in range(len(allType_pinlv)):
pinlv=1
# print(allType_pinlv[j])
pinlv*=allType_pinlv[j]
# print(pinlv)
yangben_data=attr_pinlv[j]
# the first in the test dataj個字符串
for k in range(1,len(ceshi[i])):
pinlv*=yangben_data[ceshi[i][k]]
ceshi_result_1[yangben_data['Type']]=pinlv
ceshi_result.append(ceshi_result_1)
print(ceshi_result)
for i in range(len(ceshi_result)):
maxPinlv=max(ceshi_result[i].values())
for key,value in ceshi_result[i].items():
if value==maxPinlv:
print('第'+str(i+1)+'The group data prediction category is ：'+key)
break
復制代碼

函數調用：

dataMat=loadDataSetx('TextData.txt')
allType_pinlv,attr_pinlv=train(dataMat)
print(allType_pinlv)
print(attr_pinlv)
ceshi('ceshiData.txt',allType_pinlv,attr_pinlv)
復制代碼