您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Python crawler

編輯：Python

Catalog

Concept

1、 What is a reptile

2. The process of browsing the web

3.URL The meaning of

4. Configuration of the environment

install

introduce

Basic request

basic GET request

basic POST request

The timeout configuration

Conversation object

SSL Certificate validation

agent

actual combat

Complete code

Concept

1、 What is a reptile

Reptiles , Web crawler , You can understand it as a spider crawling on the Internet , The Internet is like a big net , And reptiles are spiders crawling around on this web , If it encounters resources , Then it's going to grab it . Trying to grab something ？ It's up to you to control it . For example, it is crawling a web page , He found a road in the net , It's actually a hyperlink to a web page , Then it can climb to another web to get data . such , The whole connected web is within reach of this spider , It's not a matter to climb down every minute .

2. The process of browsing the web

In the process of browsing the web , We may see many beautiful pictures , such as Baidu pictures - Discover the colorful world , We will see several pictures and Baidu search box , This process is actually after the user enters the web address , after DNS The server , Find the server host , Make a request to the server , After the server is parsed , Send to the user's browser HTML、JS、CSS Wait for the documents , The browser parses it out , Users can see all kinds of pictures . therefore , The web page that the user sees is essentially created by HTML Made up of code , That's what reptiles are crawling about , By analyzing and filtering these HTML Code , Realize the image 、 Access to resources such as words .

3.URL The meaning of

URL, That is, the uniform resource locator , That's what we call the web site , A uniform resource locator is a concise representation of the location and access methods of resources that can be obtained from the Internet , Is the address of standard resources on the Internet . Every file on the Internet has a unique URL, It contains information about the location of the file and how the browser should handle it .

URL The format of is composed of three parts ： ① The first part is the agreement ( Or service mode ). ② The second part is the host where the resource is stored IP Address ( Sometimes a port number is also included ). ③ The third part is the specific address of the host resource , Such as directories and file names .

Crawler must have a target when crawling data URL To get data , therefore , It's the basic basis for reptiles to get data , Accurate understanding of its meaning is very helpful for reptile learning .

4. Configuration of the environment

Study Python, Of course, the configuration of the environment is indispensable , At first I used Notepad++, However, I found that its prompt function is too weak , therefore , stay Windows I'll use it PyCharm, stay Linux I'll use it Eclipse for Python.

install

utilize pip install

$ pip install requests

Or make use of easy_install

$ easy_install requests

The installation can be completed by the above two methods .

introduce

First of all, let's introduce a small example to experience

import requests
r = requests.get('http://cuiqingcai.com')
print type(r)
print r.status_code
print r.encoding
#print r.text
print r.cookies

The above code we requested the website address , Then print out the type of return result , Status code , Encoding mode ,Cookies The content such as . The operation results are as follows

<class 'requests.models.Response'>
200
UTF-8
<RequestsCookieJar[]>

How to , Is it convenient . Don't worry. , It's more convenient in the back .

Basic request

requests The library provides http All the basic request methods . for example

r = requests.post("http://httpbin.org/post")
r = requests.put("http://httpbin.org/put")
r = requests.delete("http://httpbin.org/delete")
r = requests.head("http://httpbin.org/get")
r = requests.options("http://httpbin.org/get")

Um. , In a word .

basic GET request

The most basic GET The request can be used directly get Method

r = requests.get("http://httpbin.org/get")

If you want to add parameters , You can use params Parameters

import requests
payload = {'key1': 'value1', 'key2': 'value2'}
r = requests.get("http://httpbin.org/get", params=payload)
print r.url

Running results

http://httpbin.org/get?key2=value2&key1=value1

If you want to ask JSON file , You can use json () Method resolution For example, write a JSON The file is named a.json, The contents are as follows

1
2
3

["foo", "bar", {
"foo": "bar"
}]

Use the following program to request and parse

import requests
r = requests.get("a.json")
print r.text
print r.json()

The operation results are as follows , One is to output content directly , Another way is to use json () Method resolution , Feel the difference between them

["foo", "bar", {
"foo": "bar"
}]
[u'foo', u'bar', {u'foo': u'bar'}]

If you want to get the raw socket response from the server , You can get r.raw . However, it needs to be set in the initial request stream=True .

r = requests.get('https://github.com/timeline.json', stream=True)
r.raw
<requests.packages.urllib3.response.HTTPResponse object at 0x101194810>
r.raw.read(10)
'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03'

In this way, the original socket content of the web page is obtained . If you want to add headers, It can be transmitted headers Parameters

import requests
payload = {'key1': 'value1', 'key2': 'value2'}
headers = {'content-type': 'application/json'}
r = requests.get("http://httpbin.org/get", params=payload, headers=headers)
print r.url

adopt headers Parameter can add... In the request header headers Information

basic POST request

about POST The request for , We usually need to add some parameters to it . Then the most basic method of parameter transfer can be used data This parameter .

import requests
payload = {'key1': 'value1', 'key2': 'value2'}
r = requests.post("http://httpbin.org/post", data=payload)
print r.text

Running results

{
"args": {},
"data": "",
"files": {},
"form": {
"key1": "value1",
"key2": "value2"
},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Content-Length": "23",
"Content-Type": "application/x-www-form-urlencoded",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.9.1"
},
"json": null,
"url": "http://httpbin.org/post"
}

You can see that the parameters are passed successfully , Then the server returns the data we sent . Sometimes the information we need to send is not in form , We need to pass JSON Format data in the past , So we can use json.dumps () Method to serialize form data .

import json
import requests
url = 'http://httpbin.org/post'
payload = {'some': 'data'}
r = requests.post(url, data=json.dumps(payload))
print r.text

Running results

{
"args": {},
"data": "{\"some\": \"data\"}",
"files": {},
"form": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Content-Length": "16",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.9.1"
},
"json": {
"some": "data"
},
"url": "http://httpbin.org/post"
}

By the above methods , We can POST JSON Formatted data If you want to upload a file , Then use it directly file Parameters can be Create a new one a.txt The file of , It says Hello World!

import requests
url = 'http://httpbin.org/post'
files = {'file': open('test.txt', 'rb')}
r = requests.post(url, files=files)
print r.text

You can see that the running results are as follows

{
"args": {},
"data": "",
"files": {
"file": "Hello World!"
},
"form": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Content-Length": "156",
"Content-Type": "multipart/form-data; boundary=7d8eb5ff99a04c11bb3e862ce78d7000",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.9.1"
},
"json": null,
"url": "http://httpbin.org/post"
}

In this way, we successfully completed the upload of a file . requests It supports streaming upload , This allows you to send large streams or files without first reading them into memory . To use streaming upload , Just provide a class file object for your request body

1
2

with open('massive-body') as f:
requests.post('http://some.url/streamed', data=f)

This is a very practical and convenient function .

Cookies

If a response contains cookie, Then we can use cookies Variables to get

import requests
url = 'http://example.com'
r = requests.get(url)
print r.cookies
print r.cookies['example_cookie_name']

The above procedure is just an example , It can be used cookies Variable to get the cookies In addition, it can be used cookies Variable to send... To the server cookies Information

import requests
url = 'http://httpbin.org/cookies'
cookies = dict(cookies_are='working')
r = requests.get(url, cookies=cookies)
print r.text

Running results

'{"cookies": {"cookies_are": "working"}}'

Yes, it has been successfully sent to the server cookies

The timeout configuration

You can use timeout Variable to configure the maximum request time

requests.get('http://github.com', timeout=0.001)

notes ：timeout Only valid for connection process , Nothing to do with the download of the response body . in other words , This time is limited to the requested time . Even if it comes back response It contains a lot of content , It takes time to download , But it doesn't work .

Conversation object

In the above request , Every request is actually a new request . That is to say, each of our requests is opened by a different browser . That is, it is not a conversation , Even if the same URL is requested . such as

import requests
requests.get('http://httpbin.org/cookies/set/sessioncookie/123456789')
r = requests.get("http://httpbin.org/cookies")
print(r.text)

The result is

1
2
3

{
"cookies": {}
}

Obviously , It's not in a conversation , Can't get cookies, So in some sites , What should we do to keep a lasting conversation ？ It's like browsing Taobao with a browser , Jump between different tabs , In fact, this is to establish a long-term conversation . The solution is as follows

import requests
s = requests.Session()
s.get('http://httpbin.org/cookies/set/sessioncookie/123456789')
r = s.get("http://httpbin.org/cookies")
print(r.text)

Here we have asked twice , One is setting up cookies, One is to get cookies Running results

{
"cookies": {
"sessioncookie": "123456789"
}
}

Discovery can succeed in obtaining cookies 了 , That's how establishing a conversation works . Give it a try . So since the session is a global variable , So we can definitely use it for global configuration .

import requests
s = requests.Session()
s.headers.update({'x-test': 'true'})
r = s.get('http://httpbin.org/headers', headers={'x-test2': 'true'})
print r.text

adopt s.headers.update Method set headers The variable of . Then we set up a... In the request headers, So what happens ？ It's simple , Both variables are passed . Running results

{
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.9.1",
"X-Test": "true",
"X-Test2": "true"
}
}

If get Method of transmission headers also x-test Well ？

r = s.get('http://httpbin.org/headers', headers={'x-test': 'true'})

Um. , It will override the global configuration

{
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.9.1",
"X-Test": "true"
}
}

So if you don't want a variable in the global configuration ？ It's simple , Set to None that will do

r = s.get('http://httpbin.org/headers', headers={'x-test': None})

Running results

{
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.9.1"
}
}

Um. , That's all session Basic usage of conversation

SSL Certificate validation

Now it's everywhere https The first website ,Requests It can be for HTTPS Request validation SSL certificate , It's like web Browser is the same . To check the... Of a host SSL certificate , You can use verify Parameters Now? 12306 The certificate is invalid , Let's test it

import requests
r = requests.get('https://kyfw.12306.cn/otn/', verify=True)
print r.text

result

requests.exceptions.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590)

That's the case Let's try github Of

import requests
r = requests.get('https://github.com', verify=True)
print r.text

Um. , Normal request , I will not output the content . If we want to skip just now 12306 Certificate validation of , hold verify Set to False that will do

import requests
r = requests.get('https://kyfw.12306.cn/otn/', verify=False)
print r.text

Find out and ask for it . By default verify yes True, So if you need to , You need to manually set this variable .

agent

If you need to use a proxy , You can do this by providing proxies Parameter to configure a single request

import requests
proxies = {
"https": "http://41.118.132.69:4433"
}
r = requests.post("http://httpbin.org/post", proxies=proxies)
print r.text

You can also use environment variables HTTP_PROXY and HTTPS_PROXY To configure the agent

1
2

export HTTP_PROXY="http://10.10.1.10:3128"
export HTTPS_PROXY="http://10.10.1.10:1080"

actual combat

Complete code

import csv
import pymysql
import time
import pandas as pd
import re
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By
# Connect to database , take csv File saved to mysql in
def getConnect():
# Connect MySQL database （ Be careful ：charset Parameter is utf8 instead of utf-8）
conn = pymysql.connect(host='localhost',port=3307, user='root', password='liziyi123456', db='db_douban', charset='utf8')
# Create cursor object
cursor = conn.cursor()
# Read csv file
with open('mm.csv', 'r', encoding='utf-8') as f:
read = csv.reader(f)
# Line by line , Remove the first line
for each in list(read)[1:]:
i = tuple(each)
# Use SQL Statement to add data
sql = "INSERT INTO movices VALUES" + str(i) # movices It's the name of the table
cursor.execute(sql) # perform SQL sentence
conn.commit() # Submit data
cursor.close() # Close cursor
conn.close() # Close the database
def getMovice(year):
server = Service('chromedriver.exe')
driver = webdriver.Chrome(service=server)
driver.implicitly_wait(60)
driver.get( "https://movie.douban.com/tag/#/?sort=S&range=0,10&tags="+year+",%E7%94%B5%E5%BD%B1")
driver.maximize_window()
actions = ActionChains(driver)
actions.scroll(0, 0, 0, 600).perform()
for i in range(50): # loop 50 Time , Crawling 1000 movie
btn = driver.find_element(by=By.CLASS_NAME, value="more")
time.sleep(3)
actions.move_to_element(btn).click().perform()
actions.scroll(0, 0, 0, 600).perform()
html = driver.page_source
driver.close()
return html
def getDetails(url): # Get the details of each movie
option = webdriver.ChromeOptions()
option.add_argument('headless')
driver = webdriver.Chrome(options=option)
driver.get(url=url)
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
div = soup.find('div', id='info')
#spans = div.find_all('span')
ls = div.text.split('\n')
#print(ls)
data = None;
country = None;
type = None;
time = None;
for p in ls:
if re.match(' type : .*', p):
type = p[4:].split(' / ')[0]
elif re.match(' Producer country / region : .*', p):
country = p[9:].split(' / ')[0]
elif re.match(' Release date : .*', p):
data = p[6:].split(' / ')[0]
elif re.match(' Film length : .*', p):
time = p[4:].split(' / ')[0]
ls.clear()
driver.quit()
name = soup.find('h1').find('span').text
score = soup.find('strong', class_='ll rating_num').text
numOfRaters = soup.find('div', class_='rating_sum').find('span').text
return {'name': name, 'data': data, 'country': country, 'type': type, 'time': time,
'score': score, 'numOfRaters': numOfRaters}
def getNameAUrl(html, year):
allM = []
soup = BeautifulSoup(html, 'lxml')
divs = soup.find_all('a', class_='item')
for div in divs:
url = div['href'] # Get website links
i = url.find('?')
url = url[0:i]
name = div.find('img')['alt'] # Get the movie title
allM.append({' The movie name ':name,' link ':url})
pf = pd.DataFrame(allM,columns=[' The movie name ',' link '])
pf.to_csv("movice_"+year+".csv",encoding = 'utf-8',index=False)
def getMovices(year):
allM = []
data = pd.read_csv('movice_'+year+'.csv', sep=',', header=0, names=['name', 'url'])
i = int(0)
for row in data.itertuples():
allM.append(getDetails(getattr(row, 'url')))
i += 1
if i == 2:
break
print(' The first '+str(i)+' Successfully written ')
pf = pd.DataFrame(allM,columns=['name','data','country','type','time','score','numOfRaters'])
pf.to_csv("mm.csv",encoding = 'utf-8',index=False,mode='a')
if __name__ == '__main__':
# Get movie Links
# htmll=getMovice('2022')
# getNameAUrl(htmll,'2022')
# Get the details of the movie , Save to movice.csv in
# a=getDetails("https://movie.douban.com/subject/33459931")
# print(a)
getMovices('2022')
#getConnect()