您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Python crawler - crawling the precipitation data source of Mianyang City in recent 3 months

編輯：Python

List of articles

Preface
One 、 Basic goal
Two 、 Use steps
- 1. Analyze
- 2. The overall code
result
summary

Preface

Climb close 3 Monthly precipitation data of Mianyang City , And stored in xlsx In the document . utilize xpath and re Crawler technology to obtain data , utilize pandas Store data in xlsx In the document .

️ Tips ： Reptiles cannot be used as illegal activities , Set the sleep time when crawling , Do not over crawl , Causing server downtime , Be legally liable ！！！

One 、 Basic goal

Example ：pandas Is based on NumPy A tool of , The tool is created to solve data analysis tasks .

Two 、 Use steps

1. Analyze

Open the data of China weather , Only found 40 The day option can display the precipitation more comprehensively , But only one month's , I need to find data for other months

I found that the month can be selected here , I think that when I choose the month , Will trigger some function events and return data . Open developer tools , Ready to view network data .

Click other months , There are several html Network data , Click on Preview Check and find the specific weather data , Include date 、 temperature 、 Precipitation 、 The Yellow calendar and so on .

2. The overall code

import requests
import json
import pandas as pd
# result set 
result_list = []
for i in range(2,6):
# To crawl url
url = "http://d1.weather.com.cn/calendar_new/2022/101270401_20220"+(str)(i)+".html"
# Anti creep headers
headers = {

"Referer": "http://www.weather.com.cn/",
"Connection": "keep-alive",
"Content-Encoding": "gzip",
"Content-Type": "text/html",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36",
"Cookie": "f_city=%E5%8D%97%E5%AE%81%7C101300101%7C; Hm_lvt_080dabacb001ad3dc8b9b9049b36d43b=1654830903,1654838893,1654956338; Hm_lpvt_080dabacb001ad3dc8b9b9049b36d43b=1654957148",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "zh-CN,zh;q=0.9"
}
# Crawling 
resp = requests.get(url=url, headers=headers)
resp.encoding = 'utf-8'
# Data string 
data_str = resp.text[11:]
# turn json, Variable data set 
data_list = json.loads(data_str)
# Circular data set , get data 
for data in data_list:
data[' City '] = ' mianyang '
rain = data['hgl']
date = data['date']
result = {
}
result[' City '] = ' mianyang '
result[' Probability of precipitation '] = rain
result[' date '] = date
result_list.append(result)
print(result_list)
# pandas write in excel
data = pd.DataFrame(result_list)
writer = pd.ExcelWriter(' rainfall .xlsx') # write in Excel file 
data.to_excel(writer, 'page_1', float_format='%.5f')
writer.save()
print(" end ！")

result

The program runs as follows

Stored xlsx The documents are as follows

summary

The basic steps of a reptile ：
1. Check whether there is anti climbing , Set the normal reverse crawl ,User-Agent and referer Are the most common anti climbing methods
2. utilize xpath and re Technology positioning , Get the desired data after positioning
3.pandas Write data to xlsx file
4. Pay attention to the settings time Sleep