您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Common keyword extraction methods for python3

編輯：Python

The gods were silent - personal CSDN Blog Directory

This article will introduce some simple uses Python3 Keyword extraction algorithm . At present, only some simple methods have been sorted out , If you will learn more later 、 More cutting-edge algorithms , This article will continue to be updated .

List of articles

1. be based on TF-IDF Chinese keyword extraction algorithm ： Use jieba Package implementation
2. be based on TextRank Chinese keyword extraction algorithm ： Use jieba Package implementation
3. I didn't say the importance of Chinese words based on any algorithm ：LAC Realization

1. be based on TF-IDF Chinese keyword extraction algorithm ： Use jieba Package implementation

extracted_sentences=" As the business continues to generate sales of goods , Its data is important to its own marketing planning 、 Market analysis 、 Logistics planning is of great significance . However, there are many factors influencing the sales forecast , The traditional measurement model based on statistics , For example, time series models have too many assumptions about reality , Resulting in poor prediction results . Therefore, we need better intelligence AI Algorithm , To improve the accuracy of the prediction , So as to help enterprises reduce inventory costs 、 Shorten lead time 、 Improve the anti risk ability of enterprises ."
import jieba.analyse
print(jieba.analyse.extract_tags(extracted_sentences, topK=20, withWeight=False, allowPOS=()))

Output ：

Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.457 seconds.
Prefix dict has been built successfully.
[' forecast ', ' Model ', ' sales ', ' Reduce inventory ', ' Enterprises ', 'AI', ' planning ', ' Improve ', ' accuracy ', ' help ', ' delivery ', ' Algorithm ', ' metering ', ' Sequence ', ' Poor ', ' various ', ' Too much ', ' hypothesis ', ' To shorten the ', ' marketing ']

Function into the reference ：

topK： return TF-IDF The number of keywords with the largest weight （ The default value is 20）
withWeight Whether to return the keyword weight value together , The default value is False
allowPOS Only words with specified parts of speech are included , The default value is empty , That is, no screening

Keyword extraction using reverse file frequency （IDF） Text corpus can be switched to a custom corpus path ：
usage ： jieba.analyse.set_idf_path(file_name) # file_name For custom corpus path
Custom corpus example ：https://github.com/fxsjy/jieba/blob/master/extra_dict/idf.txt.big
Usage examples ：https://github.com/fxsjy/jieba/blob/master/test/extract_tags_idfpath.py

Stop words used in keyword extraction （Stop Words） Text corpus can be switched to a custom corpus path ：
usage ： jieba.analyse.set_stop_words(file_name) # file_name For custom corpus path
Custom corpus example ：https://github.com/fxsjy/jieba/blob/master/extra_dict/stop_words.txt
Usage examples ：https://github.com/fxsjy/jieba/blob/master/test/extract_tags_stop_words.py

2. be based on TextRank Chinese keyword extraction algorithm ： Use jieba Package implementation

extracted_sentences=" As the business continues to generate sales of goods , Its data is important to its own marketing planning 、 Market analysis 、 Logistics planning is of great significance . However, there are many factors influencing the sales forecast , The traditional measurement model based on statistics , For example, time series models have too many assumptions about reality , Resulting in poor prediction results . Therefore, we need better intelligence AI Algorithm , To improve the accuracy of the prediction , So as to help enterprises reduce inventory costs 、 Shorten lead time 、 Improve the anti risk ability of enterprises ."
import jieba.analyse
print(jieba.analyse.textrank(extracted_sentences, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v')))

Output ：

Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.451 seconds.
Prefix dict has been built successfully.
[' Enterprises ', ' forecast ', ' Model ', ' planning ', ' Improve ', ' sales ', ' such as ', ' Time ', ' market ', ' analysis ', ' Reduce inventory ', ' cost ', ' To shorten the ', ' delivery ', ' influence ', ' factors ', ' situation ', ' metering ', ' reality ', ' data ']

The input parameters are the same as those in the first section , but allowPOS The default values for are different .

TextRank Use fixed window size （ The default is 5, adopt span Attribute adjustment ）, Take words as nodes , Take the co-occurrence relationship between words as an edge , Construct undirected weighted graph .
Then calculate the score of the nodes in the graph , The calculation method is similar PageRank.
Yes PageRank For a more in-depth understanding of the calculation method and principle of, please refer to my previous blog ：cs224w（ Figure machine learning ）2021 Winter course study notes 4 Link Analysis: PageRank (Graph as Matrix)_ The silent blog of the gods -CSDN Blog

3. I didn't say the importance of Chinese words based on any algorithm ：LAC Realization

The final output value is the importance score of the corresponding word .

extracted_sentences=" As the business continues to generate sales of goods , Its data is important to its own marketing planning 、 Market analysis 、 Logistics planning is of great significance . However, there are many factors influencing the sales forecast , The traditional measurement model based on statistics , For example, time series models have too many assumptions about reality , Resulting in poor prediction results . Therefore, we need better intelligence AI Algorithm , To improve the accuracy of the prediction , So as to help enterprises reduce inventory costs 、 Shorten lead time 、 Improve the anti risk ability of enterprises ."
from LAC import LAC
lac=LAC(mode='rank')
seg_result=lac.run(extracted_sentences) # With Unicode String is an input parameter 
print(seg_result)

Output ：

W0625 20:13:22.369424 33363 init.cc:157] AVX is available, Please re-compile on local machine
W0625 20:13:22.455566 33363 analysis_predictor.cc:518] - GLOG's LOG(INFO) is disabled.
W0625 20:13:22.455617 33363 init.cc:157] AVX is available, Please re-compile on local machine
--- Running analysis [ir_graph_build_pass]
--- Running analysis [ir_graph_clean_pass]
--- Running analysis [ir_analysis_pass]
--- Running IR pass [simplify_with_basic_ops_pass]
--- Running IR pass [attention_lstm_fuse_pass]
--- Running IR pass [seqconv_eltadd_relu_fuse_pass]
--- Running IR pass [seqpool_cvm_concat_fuse_pass]
--- Running IR pass [fc_lstm_fuse_pass]
--- Running IR pass [mul_lstm_fuse_pass]
--- Running IR pass [fc_gru_fuse_pass]
--- Running IR pass [mul_gru_fuse_pass]
--- Running IR pass [seq_concat_fc_fuse_pass]
--- Running IR pass [fc_fuse_pass]
--- Running IR pass [repeated_fc_relu_fuse_pass]
--- Running IR pass [squared_mat_sub_fuse_pass]
--- Running IR pass [conv_bn_fuse_pass]
--- Running IR pass [conv_eltwiseadd_bn_fuse_pass]
--- Running IR pass [is_test_pass]
--- Running IR pass [runtime_context_cache_pass]
--- Running analysis [ir_params_sync_among_devices_pass]
--- Running analysis [adjust_cudnn_workspace_size_pass]
--- Running analysis [inference_op_replace_pass]
--- Running analysis [ir_graph_to_program_pass]
W0625 20:13:22.561131 33363 analysis_predictor.cc:518] - GLOG's LOG(INFO) is disabled.
W0625 20:13:22.561169 33363 init.cc:157] AVX is available, Please re-compile on local machine
--- Running analysis [ir_graph_build_pass]
--- Running analysis [ir_graph_clean_pass]
--- Running analysis [ir_analysis_pass]
--- Running IR pass [simplify_with_basic_ops_pass]
--- Running IR pass [attention_lstm_fuse_pass]
--- Running IR pass [seqconv_eltadd_relu_fuse_pass]
--- Running IR pass [seqpool_cvm_concat_fuse_pass]
--- Running IR pass [fc_lstm_fuse_pass]
--- Running IR pass [mul_lstm_fuse_pass]
--- Running IR pass [fc_gru_fuse_pass]
--- Running IR pass [mul_gru_fuse_pass]
--- Running IR pass [seq_concat_fc_fuse_pass]
--- Running IR pass [fc_fuse_pass]
--- Running IR pass [repeated_fc_relu_fuse_pass]
--- Running IR pass [squared_mat_sub_fuse_pass]
--- Running IR pass [conv_bn_fuse_pass]
--- Running IR pass [conv_eltwiseadd_bn_fuse_pass]
--- Running IR pass [is_test_pass]
--- Running IR pass [runtime_context_cache_pass]
--- Running analysis [ir_params_sync_among_devices_pass]
--- Running analysis [adjust_cudnn_workspace_size_pass]
--- Running analysis [inference_op_replace_pass]
--- Running analysis [ir_graph_to_program_pass]
[[' With ', ' Enterprises ', ' continued ', ' produce ', ' Of ', ' goods ', ' sales ', ',', ' Its ', ' data ', ' about ', ' Oneself ', ' marketing ', ' planning ', '、', ' Market analysis ', '、', ' logistics ', ' planning ', ' all ', ' Yes ', ' important ', ' significance ', '.', ' however ', ' sales ', ' forecast ', ' Of ', ' influence ', ' factors ', ' various ', ',', ' Tradition ', ' Of ', ' be based on ', ' Statistics ', ' Of ', ' metering ', ' Model ', ',', ' such as ', ' Time ', ' Sequence ', ' Model ', ' etc. ', ' because ', ' Yes ', ' reality ', ' Of ', ' hypothesis ', ' situation ', ' Too much ', ',', ' Lead to ', ' forecast ', ' result ', ' Poor ', '.', ' therefore ', ' need ', ' more ', ' good ', ' Of ', ' intelligence ', 'AI Algorithm ', ',', ' With ', ' Improve ', ' forecast ', ' Of ', ' accuracy ', ',', ' thus ', ' help ', ' Enterprises ', ' Reduce ', ' stock ', ' cost ', '、', ' To shorten the ', ' delivery ', ' cycle ', '、', ' Improve ', ' Enterprises ', ' resist ', ' risk ', ' Ability ', '.'], ['p', 'n', 'vd', 'v', 'u', 'n', 'n', 'w', 'r', 'n', 'p', 'r', 'vn', 'n', 'w', 'n', 'w', 'n', 'n', 'd', 'v', 'a', 'n', 'w', 'c', 'n', 'vn', 'u', 'vn', 'n', 'a', 'w', 'a', 'u', 'p', 'v', 'u', 'vn', 'n', 'w', 'v', 'n', 'n', 'n', 'u', 'p', 'p', 'n', 'u', 'vn', 'n', 'a', 'w', 'v', 'vn', 'n', 'a', 'w', 'c', 'v', 'd', 'a', 'u', 'n', 'nz', 'w', 'p', 'v', 'vn', 'u', 'n', 'w', 'c', 'v', 'n', 'v', 'n', 'n', 'w', 'v', 'vn', 'n', 'w', 'v', 'n', 'v', 'n', 'n', 'w'], [0, 1, 1, 1, 0, 2, 2, 0, 1, 2, 0, 1, 2, 2, 0, 2, 0, 2, 2, 0, 0, 2, 2, 0, 0, 2, 2, 0, 2, 1, 2, 0, 2, 0, 0, 2, 0, 2, 1, 0, 1, 2, 2, 1, 0, 0, 0, 2, 0, 2, 1, 2, 0, 1, 2, 2, 2, 0, 0, 1, 1, 2, 0, 2, 2, 0, 0, 2, 2, 0, 2, 0, 0, 2, 1, 1, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 0]]