Hello everyone , I'm Chen Chen
A few days ago, a certain Ya was fined for tax evasion 13.41 One hundred million yuan , The message came out , But it has aroused thousands of waves on the Internet , Netizens directly fried the pot . All of them are filled with emotion , I don't know if there is more than a fraction of the fine paid by others .
So I crawled the data under this microblog , Conducted a simple public opinion analysis !
Because it is more convenient to crawl microblogs from the mobile terminal , So this time we choose to crawl the microblog from the mobile terminal .
We usually input keywords in this place , To search the microblog content .
I observed this page in developer mode and found that , Every time it makes a request for a keyword , I'm going to return one XHR Respond to .
We have now found the page where the data actually exists , Then you can perform the normal operation of the crawler .
Above we have found the real web page of data storage , Now we just need to make a request for the page , Then extract the data .
By observing the request header , It is not difficult to construct the request code .
The construction code is as follows :
key = input(" Please enter the crawling keyword :") for page in range(1,10): params = ( ('containerid', f'100103type=1&q={key}'), ('page_type', 'searchall'), ('page', str(page)), ) response = requests.get('https://m.weibo.cn/api/container/getIndex', headers=headers, params=params)
From the above observation, we found that this data can be transformed into a dictionary for crawling , But after my actual test, I found that , It is the most simple and convenient to extract with regularization , So here is the way of regular extraction , Interested readers can try to extract data by dictionary . The code is as follows :
r = response.text title = re.findall('"page_title":"(.*?)"',r) comments_count = re.findall('"comments_count":(.*?),',r) attitudes_count = re.findall('"attitudes_count":(.*?),',r) for i in range(len(title)): print(eval(f"'{title[i]}'"),comments_count[i],attitudes_count[i])
The data has been parsed , We can store it directly , Here I store data in csv In file , The code is as follows :
for i in range(len(title)): try: with open(f'{key}.csv', 'a', newline='') as f: writer = csv.writer(f) writer.writerow([eval(f"'{title[i]}'"),comments_count[i],attitudes_count[i],reposts_count[i],created_at[i].split()[-1],created_at[i].split()[1],created_at[i].split()[2],created_at[i].split()[0],created_at[i].split()[3]]) except: pass
After data collection , It needs to be cleaned , Make it meet the analysis requirements before visual analysis .
use pandas Read the crawled data and preview .
import pandas as pd df = pd.read_csv(' Weiya .csv',encoding='gbk') print(df.head(10))
We found that , The data in the month is abbreviated in English , We need to turn it into numbers , The code is as follows :
c = [] for i in list(df[' month ']): if i == 'Nov': c.append(11) elif i == 'Dec': c.append(12) elif i == 'Apr': c.append(4) elif i == 'Jan': c.append(1) elif i == 'Oct': c.append(10) else: c.append(7) df[' month '] = c df.to_csv(' Weiya .csv',encoding='gbk',index=False)
View field types and missing values , Meet the needs of analysis , No additional treatment is required .
df.info()
Let's visually analyze these data .
Here we only climb closer 100 Pages of data , It could be the cause of 20 and 21 The reason why there is less microblog data on .
The code is as follows :
from pyecharts.charts import Bar from pyecharts import options as opts from collections import Counter # Count the frequency of words c=[] d={} a = 0 for i in list(df[' month ']): if i == 12: if list(df[' Japan '])[a] notin c: c.append(list(df[' Japan '])[a]) a+=1 a = 0 for i in c: d[i]=0 for i in list(df[' month ']): if i == 12: d[list(df[' Japan '])[a]]+=1 a += 1 columns = [] data = [] for k,v in d.items(): columns.append(k) data.append(v) bar = ( Bar() .add_xaxis(columns) .add_yaxis(" Number of pieces ", data) .set_global_opts(title_opts=opts.TitleOpts(title=" Daily number of microblogs ")) ) bar.render(" Word frequency .html")
We found that doutujun starfish has the most comments and likes , Yes 7.5w+, Let's take a look at its comments , It makes users like it so much .
Maybe it's early to like so many , The position is relatively front , Another reason may be that the content conforms to everyone's wishes .
Analyze the release time of all comments , We found that 21 Point has the most comments , At that time, it was almost the same time when it came to the hot search list , It seems that not being on the hot search list still has a great impact on Weibo .
The code is as follows :
import pandas as pd df = pd.read_csv('weiya.csv',encoding='gbk') c=[] d={} a = 0 for i in list(df[' when ']): if i notin c: c.append(i) a = 0 for i in c: d[i]=0 for i in list(df[' when ']): d[i]+=1 print(d) from collections import Counter # Count the frequency of words from pyecharts.charts import Bar from pyecharts import options as opts columns = [] data = [] for k,v in d.items(): columns.append(k) data.append(v) bar = ( Bar() .add_xaxis(columns) .add_yaxis(" Time ", data) .set_global_opts(title_opts=opts.TitleOpts(title=" Time distribution ")) ) bar.render(" Word frequency .html")
We can see from the picture of words , There are a lot of tax evasion , In line with the theme , The second is cancellation 、 Blockade and imprisonment , It seems that people still hate illegal acts .
The code is as follows :
from imageio import imread import jieba from wordcloud import WordCloud, STOPWORDS with open("weiya.txt",encoding='utf-8') as f: job_title_1 = f.read() with open(' Stop Thesaurus .txt','r',encoding='utf-8') as f: stop_word = f.read() word = jieba.cut(job_title_1) words = [] for i in list(word): if i notin stop_word: words.append(i) contents_list_job_title = " ".join(words) wc = WordCloud(stopwords=STOPWORDS.add(" One "), collocations=False, background_color="white", font_path=r"K:\ Su Xin's poems are written in regular script .ttf", width=400, height=300, random_state=42, mask=imread('xin.jpg', pilmode="RGB") ) wc.generate(contents_list_job_title) wc.to_file(" Recommended language .png")
As public figures, netizens and celebrities should set an example , You can't enjoy fame and wealth while still doing illegal acts .