程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Basic operations of Python Chinese word segmentation: Jieba Thesaurus (basic knowledge + examples)

編輯:Python

jieba【 Chinese word segmentation 】

Catalog

  • jieba【 Chinese word segmentation 】
    • jieba What is the library
      • jieba Installation and import of library
    • jieba Library usage
      • 1) Accurate model :
      • 2) All model :
      • 3) Search engine model :
      • 4)jieba Library common functions :

———————————————————————————————————————————————————————————————

jieba What is the library

Jieba library Is an excellent third-party library of Chinese word segmentation , Chinese text needs to get a single word through word segmentation .
Jieba The principle of word segmentation : Use a Chinese vocabulary , Determine the... Between Chinese characters Correlation probability , The phrases with high probability between Chinese characters , Form participle result . Except for participles , Users can also add custom phrases .

jieba Installation and import of library

jieba The installation of the library uses pip Install or use

# Use pip Installation , Enter... At the console 
pip install jieba

jieba Library usage

Jieba Library participle you 3 Patterns

1) Accurate model :

Accurate model : It is to accurately cut a paragraph of text into several Chinese words , Several Chinese words are combined , Exactly revert to the previous text . There are no redundant words .

jieba.icut(s) # Accurate model 

2) All model :

All model : Scan out all possible words in a text , There may be a piece of text that can be cut into different patterns , Or there are different angles to segment into different words , In full mode ,Jieba The library will dig out various combinations . The information after word segmentation will be redundant when combined , No longer the original text .

jieba.icut(s,cut_all = ture) # All model 

3) Search engine model :

Search engine model : On the basis of the precise model , For those long words found , We'll segment it again , Then it is suitable for the index and search of short words by search engines . There is also redundancy .

jieba.icut_for_sear(s) # Search engine model 

4)jieba Library common functions :

Jieba Library common functions : Focus on what type of input ( character string ? list ?)、 What type of output ( character string ? list ?);
Add user thesaurus method : Add words confirmed by the user that do not want to be segmented

jieba.load_userdict(user.txt)

Add inactive Thesaurus : Delete the words that the user does not want to include in the statistics

def stopwordslist(): # Create a stop phrase 
stopwords = [line.strip() for line in open('stop_words.txt', encoding='UTF-8').readlines()]
return stopwords # Return to inactive Thesaurus 
 stopwords = stopwordslist() # Call the inactive Thesaurus 

  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved