1、 Get ready flashtext Environmental Science
2、 Add keywords
3、 Extract key words
4、 Replace keywords
5、 Get all keywords
6、 Add keywords in batch
7、 Delete keywords in batch
8、 Comparison of execution efficiency
In some ordinary small-scale data filtering 、 Regular expressions are the most commonly used in the cleaning process , But as the data scale increases , Regular expressions seem to have some spare energy .
Regular expressions in a 10k In the thesaurus of 15k The time of a keyword is almost 0.165 second . But for Flashtext It just needs 0.002 second . therefore , On this issue Flashtext Is about faster than regular expressions 82 times .
From the performance comparison of the above example diagram , You can see that as we need to process more and more characters , The processing speed of regular expressions almost increases linearly . However ,Flashtext Almost a constant .
1、 Get ready flashtext Environmental Scienceadopt pip To install flashtext, Or other ways are also possible , The mirror station of Tsinghua University is used by default .
pip install flashtext -i https://pypi.tuna.tsinghua.edu.cn/simple
Getting ready for flashtext After environment , Take a look at flashtext Important use process , Help us to better complete the data cleaning operation .
2、 Add keywordsWhen adding a keyword here, it is added to the keyword thesaurus through a single keyword , Use add_keyword Function to add . The first parameter indicates the keyword to be added , The second parameter is the alias of the first keyword , If the keyword is found, it is displayed as an alias , If the second parameter is not used as an alias, the original name will still be displayed .
from flashtext import KeywordProcessor# Initialize the key vocabulary processor processor = KeywordProcessor()# Add keywords in the normal way processor.add_keyword('Python')# Add keywords by alias processor.add_keyword('Scala', 'Java')
In this way, the required keywords have been added to the thesaurus processor in two ways .
3、 Extract key wordsAdd keywords through the previous step , Now the keyword information already exists in the thesaurus processor , Reuse extract_keywords Just extract the keywords .
# Extract keyword information from a string found = processor.extract_keywords('I like Python and Scala.')# result print(found)# ['Python', 'Java']
And here it is , As we expected , and Scala Also shown as Java.
4、 Replace keywordsReplace the keywords with replace_keywords function , The premise is that the words with aliases in the thesaurus can be replaced , Just like up here Scala Displayed as Java equally .
Replace... In a string Scala key word , because Scala The corresponding alias is Java, So... In a string Scala It should be replaced by Java.
replaced = processor.replace_keywords('I like Scala.')# result print(replaced)# I like Java.# Scala If so, it will be replaced by Java.
5、 Get all keywords Sometimes , stay KeywordProcessor You may not remember what keywords have been added to the thesaurus processor , It can be used at this time get_all_keywords Function to get all the current keywords .
all_keywords = processor.get_all_keywords()# result print(all_keywords)# {'python': 'Python', 'scala': 'Java'}
6、 Add keywords in batch When the key vocabulary needs more keywords , You can add them in batches through lists or dictionaries . The corresponding functions are add_keywords_from_list、add_keywords_from_dict function .
# Initialize a dictionary for batch addition dict_ = { 'java': ['java_ee', 'java_se', 'java_me'], 'python': ['pandas', 'all']}# Add keywords in batches through dictionaries processor.add_keywords_from_dict(dict_)# Match keywords from batch added keywords result = processor.extract_keywords('looking for java_ee and pandas.')# result print(result)# ['java', 'python']# Batch add keywords by listing processor.add_keywords_from_list(['scala', 'python', 'scala', 'go'])# adopt get_all_keywords Take a look at all the keywords all_keywords = processor.get_all_keywords()# result print(all_keywords)# {'python': 'python', 'pandas': 'python', 'scala': 'scala', 'java_ee': 'java', 'java_se': 'java', 'java_me': 'java', 'all': 'python', 'go': 'go'}
Find that all the keywords have been added to the thesaurus processor , And repeated will not be added again .
7、 Delete keywords in batchThere are also two ways to batch delete keywords in the thesaurus processor , One is the list 、 The other is a dictionary . The corresponding functions are remove_keywords_from_list、remove_keywords_from_dict function .
# Remove keywords from the list in batch processor.remove_keywords_from_list(['python','java_ee','java_me'])# Remove keywords from the dictionary in batch processor.remove_keywords_from_dict({'python': ['pandas','all']})# adopt get_all_keywords Take a look at all the keywords all_keywords = processor.get_all_keywords()# result print(all_keywords)# {'scala': 'scala', 'java_se': 'java', 'go': 'go'}
It is found that all the keywords that need to be removed have been removed .
8、 Comparison of execution efficiencyFor a more impressive display effect , I found two flashtext The efficiency comparison chart in the process of searching and replacing keywords can be seen at a glance .
flashtext、 Regular expression search efficiency comparison
flashtext、 Regular expression search Replacement comparison
The above is the detailed explanation Python Data cleaning tools in flashtext Details of , More about Python For data cleaning information, please pay attention to other relevant articles on the software development network !