您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

使用Python相關技術實現對一本中文小說（自選）進行詞頻分析，字數不低於10萬字，顯示小說中出現率前50的中文詞組，並用圖表展示。

編輯：Python

將此題分為兩個步驟：

找出高頻詞匯

首先我們需要使用python中的jieba庫；目前最好的 Python 中文分詞組件，它主要有以下 3 種特性：

支持 3 種分詞模式：精確模式、全模式、搜索引擎模式
支持繁體分詞
支持自定義詞典

具體案例：https://www.jianshu.com/p/883c2171cdb5
安裝：
使用管理員身份打開CMD：輸入pip install jieba下載成功後打開pyCharm->File->Settings->Project Interpreter，如果Package中沒有jieba，點擊右邊的“+”號添加即可。

讀取文本文件，我們在第十章學到過：with open(file) as name：** = name.read()，這裡同樣使用。不過我們需要加上編碼格式encoding=“utf-8”。

with open("D:\\Python\\projects\\python_01\\files\\JourneytotheWest.txt", encoding="gb18030") as file:
contents = file.read()

接著我們使用jieba中的方法lcut對文本進行精確分詞：導入jieba庫

import jieba
# 使用jieba中的方法lcut對文本進行精確分詞
words = jieba.lcut(contents)

再自定義一個存儲詞語及其出現次數的容器，最後遍歷。需要注意的是我們要排除單個字。

# 存儲詞語及其出現的次數
counts = {
}
for word in words:
# 單個字排除
if len(word) == 1:
continue
else:
counts[word] = counts.get(word, 0) + 1

將鍵值對轉換成list列表，且按照降序的排序順序。

items = list(counts.items())
items.sort(key=lambda x: x[1], reverse=True)

items.sort(key=lambda x: x[1], reverse=True)這個語句中的sort函數用於對原列表就行排序，如果指定參數，則使用比較函數指定的比較函數。reverse=True是降序的意思，反之False為升序。這裡的難點是key=lambda x: x[1]，這個lambda是一個隱函數，後面的x: x可以自定義兩個一樣的字母，[0]按照第一維排序，[1]按照第二維排序，[2]按照第三維排序。我們這裡排序是根據該詞語出現的次數進行排序，我們輸出的結果格式是（word，count），count就是次數。

打印輸出詞頻為前50的詞語：

for i in range(50):
word, count = items[i]
txt = "{0:<5}{1:>5}".format(word, count)
print(txt)

txt = ("{0:<5}{1:>5}".format(word, count))這個是format方法的格式控制。比如："{0}{1}".format(name,jack)，這裡大括號裡的數字表示的是位置，也就是0對應的name,1對應的jack。同理，題中0對應的是word，1對應的是count。其次，冒號是引導符，後面跟的是格式控制方法。<表示左對齊，>表示右對齊，數字表示寬度。同理，題中<10表示左對齊，並占10個位置，>5表示右對齊，占5個位置。運行後報錯：UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xa1 in position 0——這是Python 編碼中編碼解碼的問題，我這個錯誤就是‘utf-8’不能解碼位置0的那個字節（0xa1），也就是這個字節超出了utf-8的表示范圍了，我們這裡將前面打開文本文件語句中的encoding="utf-8"換成encoding="gb18030"即可。
結果：

最後就是繪圖。我們結合第十七章練習完成。練習中是從一個json文件中獲取，那麼我們可不可以直接使用上一步生成的數據來生成圖表呢？練習中我們從json文件中通過遍歷取的的值作為圖表的參數，這裡我們直接省去了json文件遍歷，直接使用數據作為圖表的參數。

定義儲存詞匯和次數的空數組，並且在將具體的數據append到空數組中：

names, dicts = [], []
for i in range(50):
word, count = items[i]
txt = "{0:<5}{1:>5}".format(word, count)
# print(txt)
names.append(word)
dicts.append(count)

可視化

my_style = LS('#333366', base_style=LCS)
chart = pygal.Bar(style=my_style, x_label_rotation=45, show_legend=False)
chart.title = '《西游記》文章中出現率前50的中文詞組（2字）'
chart.x_labels = names
chart.add('', dicts)

最後在生成圖表（svg）：

chart.render_to_file('111.svg')

全部代碼：

import jieba
import pygal
from pygal.style import LightColorizedStyle as LCS, LightenStyle as LS
with open("D:\\Python\\projects\\python_01\\files\\JourneytotheWest.txt", encoding="gb18030") as file:
contents = file.read()
# 使用jieba中的方法lcut對文本進行精確分詞
words = jieba.lcut(contents)
# 存儲詞語及其出現的次數
counts = {
}
for word in words:
# 單個字排除
if len(word) == 1:
continue
else:
counts[word] = counts.get(word, 0) + 1
# 將鍵值對轉換成list列表
items = list(counts.items())
# reverse=True降序
items.sort(key=lambda x: x[1], reverse=True)
names, dicts = [], []
for i in range(50):
word, count = items[i]
txt = "{0:<5}{1:>5}".format(word, count)
# print(txt)
names.append(word)
dicts.append(count)
# 可視化
my_style = LS('#333366', base_style=LCS)
chart = pygal.Bar(style=my_style, x_label_rotation=45, show_legend=False)
chart.title = '《西游記》文章中出現率前50的中文詞組（2字）'
chart.x_labels = names
chart.add('', dicts)
chart.render_to_file('111.svg')

PS:上文提到的”第十章“”第十七章“均來自圖書《Python編程：從入門到實踐》。
西游記.txt獲取：https://gitee.com/desiy/python_01/blob/master/files/JourneytotheWest.txt