諸神緘默不語-個人CSDN博文目錄
本文將介紹一些簡單的使用Python3實現關鍵詞提取的算法。目前僅整理了一些比較簡單的方法,如後期將了解更多、更前沿的算法,會繼續更新本文。
extracted_sentences="隨著企業持續產生的商品銷量,其數據對於自身營銷規劃、市場分析、物流規劃都有重要意義。但是銷量預測的影響因素繁多,傳統的基於統計的計量模型,比如時間序列模型等由於對現實的假設情況過多,導致預測結果較差。因此需要更加優秀的智能AI算法,以提高預測的准確性,從而助力企業降低庫存成本、縮短交貨周期、提高企業抗風險能力。"
import jieba.analyse
print(jieba.analyse.extract_tags(extracted_sentences, topK=20, withWeight=False, allowPOS=()))
輸出:
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.457 seconds.
Prefix dict has been built successfully.
['預測', '模型', '銷量', '降低庫存', '企業', 'AI', '規劃', '提高', '准確性', '助力', '交貨', '算法', '計量', '序列', '較差', '繁多', '過多', '假設', '縮短', '營銷']
函數入參:
topK
:返回TF-IDF權重最大的關鍵詞的數目(默認值為20)withWeight
是否一並返回關鍵詞權重值,默認值為 FalseallowPOS
僅包括指定詞性的詞,默認值為空,即不篩選關鍵詞提取所使用逆向文件頻率(IDF)文本語料庫可以切換成自定義語料庫的路徑:
用法: jieba.analyse.set_idf_path(file_name)
# file_name為自定義語料庫的路徑
自定義語料庫示例:https://github.com/fxsjy/jieba/blob/master/extra_dict/idf.txt.big
用法示例:https://github.com/fxsjy/jieba/blob/master/test/extract_tags_idfpath.py
關鍵詞提取所使用停止詞(Stop Words)文本語料庫可以切換成自定義語料庫的路徑:
用法: jieba.analyse.set_stop_words(file_name)
# file_name為自定義語料庫的路徑
自定義語料庫示例:https://github.com/fxsjy/jieba/blob/master/extra_dict/stop_words.txt
用法示例:https://github.com/fxsjy/jieba/blob/master/test/extract_tags_stop_words.py
extracted_sentences="隨著企業持續產生的商品銷量,其數據對於自身營銷規劃、市場分析、物流規劃都有重要意義。但是銷量預測的影響因素繁多,傳統的基於統計的計量模型,比如時間序列模型等由於對現實的假設情況過多,導致預測結果較差。因此需要更加優秀的智能AI算法,以提高預測的准確性,從而助力企業降低庫存成本、縮短交貨周期、提高企業抗風險能力。"
import jieba.analyse
print(jieba.analyse.textrank(extracted_sentences, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v')))
輸出:
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.451 seconds.
Prefix dict has been built successfully.
['企業', '預測', '模型', '規劃', '提高', '銷量', '比如', '時間', '市場', '分析', '降低庫存', '成本', '縮短', '交貨', '影響', '因素', '情況', '計量', '現實', '數據']
入參和第一節中的入參相同,但allowPOS
的默認值不同。
TextRank用固定窗口大小(默認為5,通過span屬性調整),以詞作為節點,以詞之間的共現關系作為邊,構建無向帶權圖。
然後計算圖中節點的得分,計算方式類似PageRank。
對PageRank的計算方式和原理的更深入了解可以參考我之前撰寫的博文:cs224w(圖機器學習)2021冬季課程學習筆記4 Link Analysis: PageRank (Graph as Matrix)_諸神緘默不語的博客-CSDN博客
最後輸出的數值就是對應詞語的重要性得分。
extracted_sentences="隨著企業持續產生的商品銷量,其數據對於自身營銷規劃、市場分析、物流規劃都有重要意義。但是銷量預測的影響因素繁多,傳統的基於統計的計量模型,比如時間序列模型等由於對現實的假設情況過多,導致預測結果較差。因此需要更加優秀的智能AI算法,以提高預測的准確性,從而助力企業降低庫存成本、縮短交貨周期、提高企業抗風險能力。"
from LAC import LAC
lac=LAC(mode='rank')
seg_result=lac.run(extracted_sentences) #以Unicode字符串為入參
print(seg_result)
輸出:
W0625 20:13:22.369424 33363 init.cc:157] AVX is available, Please re-compile on local machine
W0625 20:13:22.455566 33363 analysis_predictor.cc:518] - GLOG's LOG(INFO) is disabled.
W0625 20:13:22.455617 33363 init.cc:157] AVX is available, Please re-compile on local machine
--- Running analysis [ir_graph_build_pass]
--- Running analysis [ir_graph_clean_pass]
--- Running analysis [ir_analysis_pass]
--- Running IR pass [simplify_with_basic_ops_pass]
--- Running IR pass [attention_lstm_fuse_pass]
--- Running IR pass [seqconv_eltadd_relu_fuse_pass]
--- Running IR pass [seqpool_cvm_concat_fuse_pass]
--- Running IR pass [fc_lstm_fuse_pass]
--- Running IR pass [mul_lstm_fuse_pass]
--- Running IR pass [fc_gru_fuse_pass]
--- Running IR pass [mul_gru_fuse_pass]
--- Running IR pass [seq_concat_fc_fuse_pass]
--- Running IR pass [fc_fuse_pass]
--- Running IR pass [repeated_fc_relu_fuse_pass]
--- Running IR pass [squared_mat_sub_fuse_pass]
--- Running IR pass [conv_bn_fuse_pass]
--- Running IR pass [conv_eltwiseadd_bn_fuse_pass]
--- Running IR pass [is_test_pass]
--- Running IR pass [runtime_context_cache_pass]
--- Running analysis [ir_params_sync_among_devices_pass]
--- Running analysis [adjust_cudnn_workspace_size_pass]
--- Running analysis [inference_op_replace_pass]
--- Running analysis [ir_graph_to_program_pass]
W0625 20:13:22.561131 33363 analysis_predictor.cc:518] - GLOG's LOG(INFO) is disabled.
W0625 20:13:22.561169 33363 init.cc:157] AVX is available, Please re-compile on local machine
--- Running analysis [ir_graph_build_pass]
--- Running analysis [ir_graph_clean_pass]
--- Running analysis [ir_analysis_pass]
--- Running IR pass [simplify_with_basic_ops_pass]
--- Running IR pass [attention_lstm_fuse_pass]
--- Running IR pass [seqconv_eltadd_relu_fuse_pass]
--- Running IR pass [seqpool_cvm_concat_fuse_pass]
--- Running IR pass [fc_lstm_fuse_pass]
--- Running IR pass [mul_lstm_fuse_pass]
--- Running IR pass [fc_gru_fuse_pass]
--- Running IR pass [mul_gru_fuse_pass]
--- Running IR pass [seq_concat_fc_fuse_pass]
--- Running IR pass [fc_fuse_pass]
--- Running IR pass [repeated_fc_relu_fuse_pass]
--- Running IR pass [squared_mat_sub_fuse_pass]
--- Running IR pass [conv_bn_fuse_pass]
--- Running IR pass [conv_eltwiseadd_bn_fuse_pass]
--- Running IR pass [is_test_pass]
--- Running IR pass [runtime_context_cache_pass]
--- Running analysis [ir_params_sync_among_devices_pass]
--- Running analysis [adjust_cudnn_workspace_size_pass]
--- Running analysis [inference_op_replace_pass]
--- Running analysis [ir_graph_to_program_pass]
[['隨著', '企業', '持續', '產生', '的', '商品', '銷量', ',', '其', '數據', '對於', '自身', '營銷', '規劃', '、', '市場分析', '、', '物流', '規劃', '都', '有', '重要', '意義', '。', '但是', '銷量', '預測', '的', '影響', '因素', '繁多', ',', '傳統', '的', '基於', '統計', '的', '計量', '模型', ',', '比如', '時間', '序列', '模型', '等', '由於', '對', '現實', '的', '假設', '情況', '過多', ',', '導致', '預測', '結果', '較差', '。', '因此', '需要', '更加', '優秀', '的', '智能', 'AI算法', ',', '以', '提高', '預測', '的', '准確性', ',', '從而', '助力', '企業', '降低', '庫存', '成本', '、', '縮短', '交貨', '周期', '、', '提高', '企業', '抗', '風險', '能力', '。'], ['p', 'n', 'vd', 'v', 'u', 'n', 'n', 'w', 'r', 'n', 'p', 'r', 'vn', 'n', 'w', 'n', 'w', 'n', 'n', 'd', 'v', 'a', 'n', 'w', 'c', 'n', 'vn', 'u', 'vn', 'n', 'a', 'w', 'a', 'u', 'p', 'v', 'u', 'vn', 'n', 'w', 'v', 'n', 'n', 'n', 'u', 'p', 'p', 'n', 'u', 'vn', 'n', 'a', 'w', 'v', 'vn', 'n', 'a', 'w', 'c', 'v', 'd', 'a', 'u', 'n', 'nz', 'w', 'p', 'v', 'vn', 'u', 'n', 'w', 'c', 'v', 'n', 'v', 'n', 'n', 'w', 'v', 'vn', 'n', 'w', 'v', 'n', 'v', 'n', 'n', 'w'], [0, 1, 1, 1, 0, 2, 2, 0, 1, 2, 0, 1, 2, 2, 0, 2, 0, 2, 2, 0, 0, 2, 2, 0, 0, 2, 2, 0, 2, 1, 2, 0, 2, 0, 0, 2, 0, 2, 1, 0, 1, 2, 2, 1, 0, 0, 0, 2, 0, 2, 1, 2, 0, 1, 2, 2, 2, 0, 0, 1, 1, 2, 0, 2, 2, 0, 0, 2, 2, 0, 2, 0, 0, 2, 1, 1, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 0]]