Jieba library Is an excellent third-party library of Chinese word segmentation , Chinese text needs to get a single word through word segmentation .
Jieba The principle of word segmentation : Use a Chinese vocabulary , Determine the... Between Chinese characters Correlation probability , The phrases with high probability between Chinese characters , Form participle result . Except for participles , Users can also add custom phrases .
jieba The installation of the library uses pip Install or use
# Use pip Installation , Enter... At the console
pip install jieba
Jieba Library participle you 3 Patterns
Accurate model : It is to accurately cut a paragraph of text into several Chinese words , Several Chinese words are combined , Exactly revert to the previous text . There are no redundant words .
jieba.icut(s) # Accurate model
All model : Scan out all possible words in a text , There may be a piece of text that can be cut into different patterns , Or there are different angles to segment into different words , In full mode ,Jieba The library will dig out various combinations . The information after word segmentation will be redundant when combined , No longer the original text .
jieba.icut(s,cut_all = ture) # All model
Search engine model : On the basis of the precise model , For those long words found , We'll segment it again , Then it is suitable for the index and search of short words by search engines . There is also redundancy .
jieba.icut_for_sear(s) # Search engine model
Jieba Library common functions : Focus on what type of input ( character string ? list ?)、 What type of output ( character string ? list ?);
Add user thesaurus method : Add words confirmed by the user that do not want to be segmented
jieba.load_userdict(user.txt)
Add inactive Thesaurus : Delete the words that the user does not want to include in the statistics
def stopwordslist(): # Create a stop phrase
stopwords = [line.strip() for line in open('stop_words.txt', encoding='UTF-8').readlines()]
return stopwords # Return to inactive Thesaurus
stopwords = stopwordslist() # Call the inactive Thesaurus