Language :python3.8
Grab :selenium
agent :ipide
notes : Want the complete code at the end , Pay attention to novice suggestions and read them slowly . Here's a hint of the writing steps of this article :1. get data 、2. translate 、3. Data cleaning 、4. Cut word weight 、5. The word cloud
For simplicity , I used... Here selenium( For Rookies selenium, I'm a rookie ) Grab data , And used ipidea Agent for ( Anyway, it's safe to send ), Otherwise, I'll wait too many times to test and debug IP It just blew up .
selenium You can use pip Download , The order is :
pip install selenium
Downloaded selenium Then you need another driver, You need to view the browser version , Only support Firefox or Google .
Use Google distance here , First click on Chorm Three dots in the upper right corner of the browser :
Choose help , About google Get into chrome://settings/help page . Then find the corresponding version number :
Next go to driver Download address :http://chromedriver.storage.googleapis.com/index.html
Then find the close in the corresponding version number driver Download :
Then click and select the corresponding version :
windows Just use win32 That's all right. , After downloading, unzip it to a directory ok.
Then the agent I used was IPIDE, The official website is this link , Free use is ok, Will be enough .
The first step is to get the data , Then obtain through the agent .
First create a python The file named test1, Of course, take your own name .
Then use vscode( You can use your ), Head introduction :
from selenium import webdriver import requests,json,time
Then we write a head :
# agent headers = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:55.0) Gecko/20100101 Firefox/55.0'}
After the foundation is completed, you first need to obtain the agent , Let's write a function called ip_:
# Agent acquisition def ip_(): url=r"http://tiqu.ipidea.io:81/abroad?num=1&type=2&lb=1&sb=0&flow=1®ions=in&port=1" r = requests.get(url, headers=headers,timeout=3) data = json.loads(r.content) ip_=data['data'][0] return ip_
In the above code url Stored http://tiqu.ipidea.io:81/abroad?num=1&type=2&lb=1&sb=0&flow=1®ions=in&port=1
Get the link for the agent , Some small partners may fail when they get it , The reason is that the current... Is not set ip For the white list .
The way to set up a whitelist is simple , Just replace your whitelist at the end of the link :
https://api.ipidea.net/index/index/save_white?neek=***&appkey=***************************&white= White list ip
, Add links to your whitelist in https://www.ipidea.net/getapi:
I can basically use mine , So type a code .
Let's go back to the function ip_() in ,r = requests.get(url, headers=headers,timeout=3)
Will get the proxy ip Address , Then I used json To convert :data = json.loads(r.content)
, Finally returned to ip Address .IP The way of obtaining is too simple to explain .
Next, get the agent and composition ip Proxy string :
ip=ip_()#ip obtain proxy_str="http://"+str(ip['ip'])+':'+str(ip['port'])#ip proxy Combine
Then use webdriver Set up a proxy for Google browser :
options = webdriver.ChromeOptions() options.add_argument("--proxy-server=%s" % proxy_str) options.add_argument('--ignore-certificate-errors') options.add_argument('-ignore -ssl-errors')
In the above code options.add_argument Add proxy for browser , The next two sentences are just to ignore some mistakes , Of course, it's basically OK if you don't add it .
Then create a variable url Store links that need to grab pages :
url='https://www.quora.com/topic/Chinese-Food?q=Chinese%20food'
Next create Google browser object :
driver = webdriver.Chrome(executable_path=r'C:\webdriver\chromedriver.exe',options=options) driver.get(url) input()
webdriver.Chrome Medium executable_path Download for the specified driver The address of ,option Configure the agent .
Create good after driver You can understand it as Chrome Google browser objects , Using Google browser to open a specified page only needs to use get Method , stay get Method url.
Because we found that the page is automatically refreshed when the browser slides to the bottom , At this point, we just need to use the cycle to repeat and slide down :
for i in range(0,500): driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") time.sleep(10)
The code in the above loop driver.execute_script Indicates execution script command , among window.scrollTo(0, document.body.scrollHeight);
Is the corresponding sliding command . Give him a break after each slide , Otherwise, the effect is not very good , So use sleep rest 10s Wait for loading .
Then we get pieces of data in the page :
In order to avoid missing anything bad , Here I typed the code .
At this point, we can right-click to check , Open source :
Now we can see this piece html Under the code is the corresponding content :
We learned from it that , This part of class Its name is q-box, We can go through driver Medium find_element_by_class_name Methods , Find this element , And get the corresponding text .
Then we look at all the content blocks and know , Is to use q-box As name :
Then we just need to use the code :
content=driver.find_element_by_class_name('q-box')
You can grab all of them named q-box The object of .
At this point, we only need to use... For this object .text You can get the text , Reuse f.write Write it into the text :
f = open(r'C:\Users\Administrator\Desktop\data\data.txt',mode='w',encoding='UTF-8') f.write(content.text) f.close()
The complete code for this part is as follows :
from selenium import webdriver import requests,json,time # agent headers = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:55.0) Gecko/20100101 Firefox/55.0'} # Agent acquisition def ip_(): url=r"http://tiqu.ipidea.io:81/abroad?num=1&type=2&lb=1&sb=0&flow=1®ions=in&port=1" r = requests.get(url, headers=headers,timeout=3) data = json.loads(r.content) print(data) ip_=data['data'][0] return ip_ ip=ip_()#ip obtain proxy_str="http://"+str(ip['ip'])+':'+str(ip['port'])#ip proxy Combine options = webdriver.ChromeOptions() options.add_argument("--proxy-server=%s" % proxy_str) options.add_argument('--ignore-certificate-errors') options.add_argument('-ignore -ssl-errors') url='https://www.quora.com/topic/Chinese-Food?q=Chinese%20food' driver = webdriver.Chrome(executable_path=r'C:\webdriver\chromedriver.exe',options=options) driver.get(url) input() for i in range(0,500): driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") time.sleep(10) title=driver.find_element_by_class_name('q-box') #title=driver.find_element_by_css_selector("[class='dtb-style-1 table-dragColumns']") f = open(r'C:\Users\Administrator\Desktop\data\data.txt',mode='w',encoding='UTF-8') f.write(title.text) f.close()
Then we can translate the data , But the free library has a daily limit , Then the best way , The free way is to assign the content to online translation , Right , Although there are a lot of data , But also good , No big problem .
The translation is completed in a copy of the text , I name this text datacn.
Create a name here cut Of py file , And introduce... In the head :
import jieba,re,jieba.analyse # Stuttering participle from wordcloud import WordCloud # The word cloud import matplotlib.pyplot as plt
After the introduction, create a function to read the translated text datacn The content of :
def get_str(path): f = open(path,encoding="utf-8" ) data = f.read() f.close() return data
The code is simple , Namely open file ,read The reading is complete , However, some students are prone to coding errors , Remember to add it encoding="utf-8", If you don't believe it , You save the text as , Select the code as... When saving as utf-8 That's all right. :
next , Let's create another function to clean the content :
def word_chinese(text): pattern = re.compile(r'[^\u4e00-\u9fa5]') clean= re.sub(pattern, '', text) return clean
In fact, the function above is to find Chinese characters and return , Other contents are not needed , Otherwise, it will affect the effect , For example, some punctuation marks and English letters .
Then we read the data directly :
path=r"D:\datacn.txt" text=get_str(path) text=word_chinese(text)
among path It's the path , Is the path of the text storage I translated , Then pass in the parameter get_str Medium will do , The data read in this way is text, I'm in a hurry to text Passed to the cleaning function word_chinese Wash in , In this way, bad data will be clear ok 了 .
But there are still some bad words at this time , for example you 、 I 、 He 、 Hello 、 know ... These content , How to get rid of it ? At this time, use the stutter library to set some words. Don't jieba.analyse.set_stop_words
, The code is :
jieba.analyse.set_stop_words(r'D:\StopWords.txt')
among D:\StopWords.txt
This text records the words you don't want , I adjusted a lot of words for the accuracy of the data , If you want, you can see the comment area , Too much data is not easy to copy directly .
After setting, you can automatically filter , The next step is word segmentation and word frequency statistics , The code for this step is :
words = jieba.analyse.textrank(text, topK=168,withWeight=True)
The way to do it is jieba.analyse.textrank(), among text It's the text we cleaned up ,topk It means you want to get the first few words of frequency , I am here topk=168 It means getting the most frequent money 168 Word , Function where withWeight=True Indicates that the word frequency weight value appears in the result , For example, do not use withWeight=True give the result as follows :
Don't open withWeight The results show :
At this point, the result has been , I found that foreigners like 、 The most frequently mentioned is soy sauce , Then I like . It seems that I really like it .
Then let's make a word cloud , And then analyze . Word cloud requires string , You can't use arrays , Use the following code to make it a string :
wcstr = " ".join(words)
Then create the word cloud object :
wc = WordCloud(background_color="white", width=1000, height=1000, font_path='simhei.ttf' )
In the configuration of word cloud object ,background_color Is string ,width and height Is the width of the word cloud ,font_path Is to set the font . Notice here , The font must be set , Otherwise, you will not see any text .
Then pass the string to the created word cloud object wc Of generate function :
wc.generate(wcstr)
Next use plt Just display it :
plt.imshow(wc) plt.axis("off") plt.show()
The complete code is as follows :
import jieba,re,jieba.analyse from wordcloud import WordCloud import matplotlib.pyplot as plt def get_str(path): f = open(path,encoding="utf-8" ) data = f.read() f.close() return data def word_chinese(text): pattern = re.compile(r'[^\u4e00-\u9fa5]') clean = re.sub(pattern, '', text) return clean path=r"D:\datacn.txt" text=get_str(path) text=word_chinese(text) jieba.analyse.set_stop_words(r'D:\StopWords.txt') words = jieba.analyse.textrank(text, topK=168) print(words) wcstr = " ".join(words) wc = WordCloud(background_color="white", width=1000, height=1000, font_path='simhei.ttf' ) wc.generate(wcstr) plt.imshow(wc) plt.axis("off") plt.show()
The final result is as follows :
Because there's so much data , It is not convenient to use broken line chart and other statistics , I found the most mentioned by foreigners from the weight Top Several latitudes .
All rankings are as follows :
Foreigners mentioned most Top :
The holy places of food are Hong Kong 、 Macau 、 guangdong 、 wuxi 、 Guangzhou 、 Beijing 、 Minnan ;
The most mentioned food is : Fried rice 、 Steamed Rice 、 tofu 、 soybean 、 beef 、 noodles 、 Hot pot 、 Stir fry 、 dumplings 、 The cake 、 Steamed stuffed bun
Mention the most flavors : Sweet and sour 、 Salty
The most mentioned kitchen utensils : Hot pot 、 pot 、 Stone pot 、 cooking bench
But the first is what soy sauce is , And like the second , It seems that everyone likes our food better ! nice !
# 01_mnist_demo.py # Usi