您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

The foreigners favorite Chinese food is soy sauce? Whats going on? Python data analysis

編輯：Python

One 、 Environment and dependence

Language ：python3.8

Grab ：selenium

agent ：ipide

notes ： Want the complete code at the end , Pay attention to novice suggestions and read them slowly . Here's a hint of the writing steps of this article ：1. get data 、2. translate 、3. Data cleaning 、4. Cut word weight 、5. The word cloud

1.1 selenium Get ready

For simplicity , I used... Here selenium（ For Rookies selenium, I'm a rookie ） Grab data , And used ipidea Agent for （ Anyway, it's safe to send ）, Otherwise, I'll wait too many times to test and debug IP It just blew up .

selenium You can use pip Download , The order is ：

pip install selenium

Downloaded selenium Then you need another driver, You need to view the browser version , Only support Firefox or Google .

Use Google distance here , First click on Chorm Three dots in the upper right corner of the browser ：

Choose help , About google Get into chrome://settings/help page . Then find the corresponding version number ：

Next go to driver Download address ：http://chromedriver.storage.googleapis.com/index.html

Then find the close in the corresponding version number driver Download ：

Then click and select the corresponding version ：

Insert picture description here

windows Just use win32 That's all right. , After downloading, unzip it to a directory ok.

Then the agent I used was IPIDE, The official website is this link , Free use is ok, Will be enough .

Two 、 Data acquisition

2.1 agent

The first step is to get the data , Then obtain through the agent .

First create a python The file named test1, Of course, take your own name .

Then use vscode（ You can use your ）, Head introduction ：

from selenium import webdriver
import requests,json,time

Then we write a head ：

# agent
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:55.0) Gecko/20100101 Firefox/55.0'}

After the foundation is completed, you first need to obtain the agent , Let's write a function called ip_：

# Agent acquisition
def ip_():
url=r"http://tiqu.ipidea.io:81/abroad?num=1&type=2&lb=1&sb=0&flow=1&regions=in&port=1"
r = requests.get(url, headers=headers,timeout=3)
data = json.loads(r.content)
ip_=data['data'][0]
return ip_

In the above code url Stored http://tiqu.ipidea.io:81/abroad?num=1&type=2&lb=1&sb=0&flow=1&regions=in&port=1 Get the link for the agent , Some small partners may fail when they get it , The reason is that the current... Is not set ip For the white list .

The way to set up a whitelist is simple , Just replace your whitelist at the end of the link ：

https://api.ipidea.net/index/index/save_white?neek=***&appkey=***************************&white= White list ip , Add links to your whitelist in https://www.ipidea.net/getapi：

Insert picture description here

I can basically use mine , So type a code .

Let's go back to the function ip_() in ,r = requests.get(url, headers=headers,timeout=3) Will get the proxy ip Address , Then I used json To convert ：data = json.loads(r.content), Finally returned to ip Address .IP The way of obtaining is too simple to explain .

Next, get the agent and composition ip Proxy string ：

ip=ip_()#ip obtain
proxy_str="http://"+str(ip['ip'])+':'+str(ip['port'])#ip proxy Combine

Then use webdriver Set up a proxy for Google browser ：

options = webdriver.ChromeOptions()
options.add_argument("--proxy-server=%s" % proxy_str)
options.add_argument('--ignore-certificate-errors')
options.add_argument('-ignore -ssl-errors')

In the above code options.add_argument Add proxy for browser , The next two sentences are just to ignore some mistakes , Of course, it's basically OK if you don't add it .

2.2 Fetching the data

Then create a variable url Store links that need to grab pages ：

url='https://www.quora.com/topic/Chinese-Food?q=Chinese%20food'

Next create Google browser object ：

driver = webdriver.Chrome(executable_path=r'C:\webdriver\chromedriver.exe',options=options)
driver.get(url)
input()

webdriver.Chrome Medium executable_path Download for the specified driver The address of ,option Configure the agent .

Create good after driver You can understand it as Chrome Google browser objects , Using Google browser to open a specified page only needs to use get Method , stay get Method url.

Because we found that the page is automatically refreshed when the browser slides to the bottom , At this point, we just need to use the cycle to repeat and slide down ：

for i in range(0,500):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(10)

The code in the above loop driver.execute_script Indicates execution script command , among window.scrollTo(0, document.body.scrollHeight); Is the corresponding sliding command . Give him a break after each slide , Otherwise, the effect is not very good , So use sleep rest 10s Wait for loading .

Then we get pieces of data in the page ：

In order to avoid missing anything bad , Here I typed the code .

At this point, we can right-click to check , Open source ：

Insert picture description here

Now we can see this piece html Under the code is the corresponding content ：

Insert picture description here

We learned from it that , This part of class Its name is q-box, We can go through driver Medium find_element_by_class_name Methods , Find this element , And get the corresponding text .

Then we look at all the content blocks and know , Is to use q-box As name ：

Insert picture description here

Then we just need to use the code ：

content=driver.find_element_by_class_name('q-box')

You can grab all of them named q-box The object of .

At this point, we only need to use... For this object .text You can get the text , Reuse f.write Write it into the text ：

f = open(r'C:\Users\Administrator\Desktop\data\data.txt',mode='w',encoding='UTF-8')
f.write(content.text)
f.close()

The complete code for this part is as follows ：

from selenium import webdriver
import requests,json,time
# agent
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:55.0) Gecko/20100101 Firefox/55.0'}
# Agent acquisition
def ip_():
url=r"http://tiqu.ipidea.io:81/abroad?num=1&type=2&lb=1&sb=0&flow=1&regions=in&port=1"
r = requests.get(url, headers=headers,timeout=3)
data = json.loads(r.content)
print(data)
ip_=data['data'][0]
return ip_
ip=ip_()#ip obtain
proxy_str="http://"+str(ip['ip'])+':'+str(ip['port'])#ip proxy Combine
options = webdriver.ChromeOptions()
options.add_argument("--proxy-server=%s" % proxy_str)
options.add_argument('--ignore-certificate-errors')
options.add_argument('-ignore -ssl-errors')
url='https://www.quora.com/topic/Chinese-Food?q=Chinese%20food'
driver = webdriver.Chrome(executable_path=r'C:\webdriver\chromedriver.exe',options=options)
driver.get(url)
input()
for i in range(0,500):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(10)
title=driver.find_element_by_class_name('q-box')
#title=driver.find_element_by_css_selector("[class='dtb-style-1 table-dragColumns']")
f = open(r'C:\Users\Administrator\Desktop\data\data.txt',mode='w',encoding='UTF-8')
f.write(title.text)
f.close()

3、 ... and 、 Word segmentation statistics

3.1 Data cleaning

Then we can translate the data , But the free library has a daily limit , Then the best way , The free way is to assign the content to online translation , Right , Although there are a lot of data , But also good , No big problem .

The translation is completed in a copy of the text , I name this text datacn.

Create a name here cut Of py file , And introduce... In the head ：

import jieba,re,jieba.analyse # Stuttering participle
from wordcloud import WordCloud # The word cloud
import matplotlib.pyplot as plt

After the introduction, create a function to read the translated text datacn The content of ：

def get_str(path):
f = open(path,encoding="utf-8" )
data = f.read()
f.close()
return data

The code is simple , Namely open file ,read The reading is complete , However, some students are prone to coding errors , Remember to add it encoding="utf-8", If you don't believe it , You save the text as , Select the code as... When saving as utf-8 That's all right. ：

next , Let's create another function to clean the content ：

def word_chinese(text):
pattern = re.compile(r'[^\u4e00-\u9fa5]')
clean= re.sub(pattern, '', text)
return clean

In fact, the function above is to find Chinese characters and return , Other contents are not needed , Otherwise, it will affect the effect , For example, some punctuation marks and English letters .

Then we read the data directly ：

path=r"D:\datacn.txt"
text=get_str(path)
text=word_chinese(text)

among path It's the path , Is the path of the text storage I translated , Then pass in the parameter get_str Medium will do , The data read in this way is text, I'm in a hurry to text Passed to the cleaning function word_chinese Wash in , In this way, bad data will be clear ok 了 .

3.2 Word frequency weight statistics

But there are still some bad words at this time , for example you 、 I 、 He 、 Hello 、 know ... These content , How to get rid of it ？ At this time, use the stutter library to set some words. Don't jieba.analyse.set_stop_words, The code is ：

jieba.analyse.set_stop_words(r'D:\StopWords.txt')

among D:\StopWords.txt This text records the words you don't want , I adjusted a lot of words for the accuracy of the data , If you want, you can see the comment area , Too much data is not easy to copy directly .

After setting, you can automatically filter , The next step is word segmentation and word frequency statistics , The code for this step is ：

words = jieba.analyse.textrank(text, topK=168,withWeight=True)

The way to do it is jieba.analyse.textrank(), among text It's the text we cleaned up ,topk It means you want to get the first few words of frequency , I am here topk=168 It means getting the most frequent money 168 Word , Function where withWeight=True Indicates that the word frequency weight value appears in the result , For example, do not use withWeight=True give the result as follows ：

Don't open withWeight The results show ：

Insert picture description here

At this point, the result has been , I found that foreigners like 、 The most frequently mentioned is soy sauce , Then I like . It seems that I really like it .

Then let's make a word cloud , And then analyze . Word cloud requires string , You can't use arrays , Use the following code to make it a string ：

wcstr = " ".join(words)

Then create the word cloud object ：

wc = WordCloud(background_color="white",
width=1000,
height=1000,
font_path='simhei.ttf'
)

In the configuration of word cloud object ,background_color Is string ,width and height Is the width of the word cloud ,font_path Is to set the font . Notice here , The font must be set , Otherwise, you will not see any text .

Then pass the string to the created word cloud object wc Of generate function ：

wc.generate(wcstr)

Next use plt Just display it ：

plt.imshow(wc)
plt.axis("off")
plt.show()

The complete code is as follows ：

import jieba,re,jieba.analyse
from wordcloud import WordCloud
import matplotlib.pyplot as plt
def get_str(path):
f = open(path,encoding="utf-8" )
data = f.read()
f.close()
return data
def word_chinese(text):
pattern = re.compile(r'[^\u4e00-\u9fa5]')
clean = re.sub(pattern, '', text)
return clean
path=r"D:\datacn.txt"
text=get_str(path)
text=word_chinese(text)
jieba.analyse.set_stop_words(r'D:\StopWords.txt')
words = jieba.analyse.textrank(text, topK=168)
print(words)
wcstr = " ".join(words)
wc = WordCloud(background_color="white",
width=1000,
height=1000,
font_path='simhei.ttf'
)
wc.generate(wcstr)
plt.imshow(wc)
plt.axis("off")
plt.show()

The final result is as follows ：

Four 、 Find... From the data TOP The most

Because there's so much data , It is not convenient to use broken line chart and other statistics , I found the most mentioned by foreigners from the weight Top Several latitudes .

All rankings are as follows ：

Insert picture description here

Foreigners mentioned most Top ：

The holy places of food are Hong Kong 、 Macau 、 guangdong 、 wuxi 、 Guangzhou 、 Beijing 、 Minnan ;

The most mentioned food is ： Fried rice 、 Steamed Rice 、 tofu 、 soybean 、 beef 、 noodles 、 Hot pot 、 Stir fry 、 dumplings 、 The cake 、 Steamed stuffed bun

Mention the most flavors ： Sweet and sour 、 Salty

The most mentioned kitchen utensils ： Hot pot 、 pot 、 Stone pot 、 cooking bench

But the first is what soy sauce is , And like the second , It seems that everyone likes our food better ！ nice ！