source :Python Practical treasure
Doing it NLP( natural language processing ) Related tasks , We often encounter the need to identify and extract provinces 、 City 、 The needs of the administrative region . Although we can search the keyword table one by one to achieve the purpose of extraction , But we need to collect the keyword list of provinces and cities first , Relatively cumbersome .
Today I will introduce a module to you , You just need to pass the string to this module , He can return the province in this string to you 、 City 、 Zone keywords , And can mark it on the picture for you , It is Cpca modular .
1. Get ready
Before the start , You have to make sure that Python and pip Has been successfully installed on the computer , without , You can visit this article : Hyperdetail Python Installation guide Installation .
( Optional 1) If you use Python The goal is data analysis , It can be installed directly Anaconda:Python Data analysis and mining good helper —Anaconda, It has... Built in Python and pip.
( Optional 2) Besides , Recommended VSCode Editor , It has many advantages :Python The best partner in programming —VSCode Detailed guidelines .
Please choose one of the following ways to enter the command to install the dependency :
1. Windows Environmental Science open Cmd ( Start - function -CMD).
2. MacOS Environmental Science open Terminal (command+ Space input Terminal).
3. If you're using a VSCode Editor or Pycharm, You can directly use the Terminal.
pip install cpca
Be careful , at present cpca The module only supports Python3 And above .
stay windows The following problems may occur on the :
Building wheel for pyahocorasick (setup.py) ... error
First read the original text to download Microsoft Visual C++ Build Tools install VC++ Building tools , Again pip install cpca, Problem solvable .
2. Basic use
Through two lines of code, you can achieve the most basic provincial and urban extraction :
# official account : Python Practical treasure
# 2022/06/23
import cpca
location_str = [
" Shennan Middle Road, bating street, Futian District, Shenzhen City, Guangdong Province 1025 New town building No 1 layer ",
" Tesla Shanghai Super factory is Tesla's first super factory outside the United States , Located in Shanghai, the people's Republic of China .",
" Sanxingdui site is located on the Bank of Yazi River in Sanxingdui Town, west of Guanghan City, Sichuan Province, China , It is a bronze age cultural site "
]
df = cpca.transform(location_str)
print(df)
The effect is as follows :
province City District Address adcode
0 Guangdong province, shenzhen Futian district Shennan Middle Road, bating street 1025 New town building No 1 layer 440304
1 Shanghai None None .310000
2 Sichuan Province deyang Guanghan City By the Duck River in Sanxingdui town in the west of the city , It is a bronze age cultural site 510681
Pay attention to Article 3 of Guanghan City ,cpca Not only the county-level city Guanghan City in the sentence is recognized , It can also be automatically matched to Deyang City, which is the entrusted city , I have to say it's very powerful .
If you want to know that the program extracts the name of the province or city from the position of the string , You can add one pos_sensitive=True Parameters :
# official account : Python Practical treasure
# 2022/06/23
import cpca
location_str = [
" Shennan Middle Road, bating street, Futian District, Shenzhen City, Guangdong Province 1025 New town building No 1 layer ",
" Tesla Shanghai Super factory is Tesla's first super factory outside the United States , Located in Shanghai, the people's Republic of China .",
" Sanxingdui site is located on the Bank of Yazi River in Sanxingdui Town, west of Guanghan City, Sichuan Province, China , It is a bronze age cultural site "
]
df = cpca.transform(location_str, pos_sensitive=True)
print(df)
The effect is as follows :
(base) G:\push\20220623>python 1.py
province City District Address adcode province _pos City _pos District _pos
0 Guangdong province, shenzhen Futian district Shennan Middle Road, bating street 1025 New town building No 1 layer 440304 0 3 6
1 Shanghai None None .310000 38 -1 -1
2 Sichuan Province deyang Guanghan City By the Duck River in Sanxingdui town in the west of the city , It is a bronze age cultural site 510681 9 -1 12
It marks the identification to the province 、 City 、 Key location of the zone (index), Of course, if it is Deyang City, this special identification will be marked as -1.
3. Advanced use
It can also batch identify multiple regions from large pieces of text :
# official account : Python Practical treasure
# 2022/06/23
import cpca
long_text = " The evaluation of a city always includes personal feelings . If you like a city , It is likely that I like myself at that time and place ."\
" In Guangzhou 、 I have read in Hong Kong , Worked , Bought a house in Shenzhen 、 A short life , I went on several business trips to Beijing ."\
" I would like to focus on Guangzhou 、 Shenzhen and Hong Kong , By the way, Beijing . in general , I feel comfortable in Guangzhou 、"\
" Hong Kong exquisite 、 Shenzhen is young and has a good atmosphere 、 Beijing has a rough atmosphere . Answer: the Lord has chosen Guangzhou ."
df = cpca.transform_text_with_addrs(long_text, pos_sensitive=True)
print(df)
The effect is as follows :
(base) G:\push\20220623>python 1.py
province City District Address adcode province _pos City _pos District _pos
0 Guangdong province, guangzhou None 440100 -1 44 -1
1 Hong Kong Special Administrative Region None None 810000 47 -1 -1
2 Guangdong province, shenzhen None 440300 -1 58 -1
3 The Beijing municipal None None 110000 71 -1 -1
4 Guangdong province, guangzhou None 440100 -1 86 -1
5 Guangdong province, shenzhen None 440300 -1 89 -1
6 Hong Kong Special Administrative Region None None 810000 92 -1 -1
7 The Beijing municipal None None 110000 100 -1 -1
8 Guangdong province, guangzhou None 440100 -1 110 -1
9 Hong Kong Special Administrative Region None None 810000 115 -1 -1
10 Guangdong province, shenzhen None 440300 -1 120 -1
11 The Beijing municipal None None 110000 128 -1 -1
12 Guangdong province, guangzhou None 440100 -1 143 -1
More Than This , The module also comes with some simple drawing tools , The data output above can be drawn in the form of thermal diagram on the map :
# official account : Python Practical treasure
# 2022/06/23
import cpca
from cpca import drawer
long_text = " The evaluation of a city always includes personal feelings . If you like a city , It is likely that I like myself at that time and place ."\
" In Guangzhou 、 I have read in Hong Kong , Worked , Bought a house in Shenzhen 、 A short life , I went on several business trips to Beijing ."\
" I would like to focus on Guangzhou 、 Shenzhen and Hong Kong , By the way, Beijing . in general , I feel comfortable in Guangzhou 、"\
" Hong Kong exquisite 、 Shenzhen is young and has a good atmosphere 、 Beijing has a rough atmosphere . Answer: the Lord has chosen Guangzhou ."
df = cpca.transform_text_with_addrs(long_text, pos_sensitive=True)
drawer.draw_locations(df[cpca._ADCODE], "df.html")
This error may be reported when running :
(base) G:\push\20220623>python 1.py
Traceback (most recent call last):
File "1.py", line 12, in <module>
drawer.draw_locations(df[cpca._ADCODE], "df.html")
File "G:\Anaconda3\lib\site-packages\cpca\drawer.py", line 41, in draw_locations
import folium
ModuleNotFoundError: No module named 'folium'
Use pip Can be installed :
pip install folium
Then rerun the code , Will generate... In the current directory df.html, Double-click to open , The effect is as follows :
How to use it? , Does it feel very convenient ? In the future, this module will be sufficient for location identification .
For more details, you can visit the Github Home page reading , The project README Written entirely in Chinese , It's very easy to read :
https://github.com/DQinYuan/chinese_province_city_area_mapper