Record the process of your own crawler , I'm working on a project recently .
The website to be crawled is relatively simple .
The problem is :
post The way , Some of the website's data needs to be used post Way to get .
such as ,
This part should see 《 Projects initiated 》, Mouse click required , At first I thought it was ajax, In fact, it's not , yes js The way to get .
therefore , A careful study reveals , In fact, the website is like this .
https://s*****view.php?id=GKUdgjKayCQvY
Specific parts are omitted , Look at this website , It's nothing , But check through the browser , You can find , Mouse click 《 Projects initiated 》, There will be one. js action .
If there is only one page ,
like this
Then you won't find js action . But if there are many , Need to click , You will find , need js 了 .
This action , Is included post Of .
The specific parameters are as follows
therefore , In fact, the requested URL , It can be formed in this way .
https://sd.zhiyuanyun.com/app/api/view.php?m=get_opps&type=2&id=89608371&p=3
therefore , Here is id,p It's the page . Others are default parameters .
And then use post The way , Construct the request .
def get_proj_number(id):
print("((((((((( >>>>>>>> now obtain organization A total of How many projects ")
params = (('m', 'get_opps'), ('type', '2'), ('id', id), ('p', "1"), )
response = requests.get(
'https://sd.zhiyuanyun.com/app/api/view.php', headers=headers, params=params)
selector = Selector(response)
such , hold p The parameter is made into a for Circulation is all right .
The last data page requested is a list
So how to save this list .
List contains th and td
that I'll go straight to td Make a list , then zip once .
I just got a simpler one . Make one zip(list)