Catalog
Concept
1、 What is a reptile
2. The process of browsing the web
3.URL The meaning of
4. Configuration of the environment
install
introduce
Basic request
basic GET request
basic POST request
Cookies
The timeout configuration
Conversation object
SSL Certificate validation
agent
actual combat
Complete code
Reptiles , Web crawler , You can understand it as a spider crawling on the Internet , The Internet is like a big net , And reptiles are spiders crawling around on this web , If it encounters resources , Then it's going to grab it . Trying to grab something ? It's up to you to control it . For example, it is crawling a web page , He found a road in the net , It's actually a hyperlink to a web page , Then it can climb to another web to get data . such , The whole connected web is within reach of this spider , It's not a matter to climb down every minute .
In the process of browsing the web , We may see many beautiful pictures , such as Baidu pictures - Discover the colorful world , We will see several pictures and Baidu search box , This process is actually after the user enters the web address , after DNS The server , Find the server host , Make a request to the server , After the server is parsed , Send to the user's browser HTML、JS、CSS Wait for the documents , The browser parses it out , Users can see all kinds of pictures . therefore , The web page that the user sees is essentially created by HTML Made up of code , That's what reptiles are crawling about , By analyzing and filtering these HTML Code , Realize the image 、 Access to resources such as words .
URL, That is, the uniform resource locator , That's what we call the web site , A uniform resource locator is a concise representation of the location and access methods of resources that can be obtained from the Internet , Is the address of standard resources on the Internet . Every file on the Internet has a unique URL, It contains information about the location of the file and how the browser should handle it .
URL The format of is composed of three parts : ① The first part is the agreement ( Or service mode ). ② The second part is the host where the resource is stored IP Address ( Sometimes a port number is also included ). ③ The third part is the specific address of the host resource , Such as directories and file names .
Crawler must have a target when crawling data URL To get data , therefore , It's the basic basis for reptiles to get data , Accurate understanding of its meaning is very helpful for reptile learning .
Study Python, Of course, the configuration of the environment is indispensable , At first I used Notepad++, However, I found that its prompt function is too weak , therefore , stay Windows I'll use it PyCharm, stay Linux I'll use it Eclipse for Python.
utilize pip install
1
$ pip install requests
Or make use of easy_install
1
$ easy_install requests
The installation can be completed by the above two methods .
First of all, let's introduce a small example to experience
1 2 3 4 5 6 7 8
import requests r = requests.get('http://cuiqingcai.com') print type(r) print r.status_code print r.encoding #print r.text print r.cookies
The above code we requested the website address , Then print out the type of return result , Status code , Encoding mode ,Cookies The content such as . The operation results are as follows
1 2 3 4
<class 'requests.models.Response'> 200 UTF-8 <RequestsCookieJar[]>
How to , Is it convenient . Don't worry. , It's more convenient in the back .
requests The library provides http All the basic request methods . for example
1 2 3 4 5
r = requests.post("http://httpbin.org/post") r = requests.put("http://httpbin.org/put") r = requests.delete("http://httpbin.org/delete") r = requests.head("http://httpbin.org/get") r = requests.options("http://httpbin.org/get")
Um. , In a word .
The most basic GET The request can be used directly get Method
1
r = requests.get("http://httpbin.org/get")
If you want to add parameters , You can use params Parameters
1 2 3 4 5
import requests payload = {'key1': 'value1', 'key2': 'value2'} r = requests.get("http://httpbin.org/get", params=payload) print r.url
Running results
1
http://httpbin.org/get?key2=value2&key1=value1
If you want to ask JSON file , You can use json () Method resolution For example, write a JSON The file is named a.json, The contents are as follows
1 2 3
["foo", "bar", { "foo": "bar" }]
Use the following program to request and parse
1 2 3 4 5
import requests r = requests.get("a.json") print r.text print r.json()
The operation results are as follows , One is to output content directly , Another way is to use json () Method resolution , Feel the difference between them
1 2 3 4
["foo", "bar", { "foo": "bar" }] [u'foo', u'bar', {u'foo': u'bar'}]
If you want to get the raw socket response from the server , You can get r.raw . However, it needs to be set in the initial request stream=True .
1 2 3 4 5
r = requests.get('https://github.com/timeline.json', stream=True) r.raw <requests.packages.urllib3.response.HTTPResponse object at 0x101194810> r.raw.read(10) '\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03'
In this way, the original socket content of the web page is obtained . If you want to add headers, It can be transmitted headers Parameters
1 2 3 4 5 6
import requests payload = {'key1': 'value1', 'key2': 'value2'} headers = {'content-type': 'application/json'} r = requests.get("http://httpbin.org/get", params=payload, headers=headers) print r.url
adopt headers Parameter can add... In the request header headers Information
about POST The request for , We usually need to add some parameters to it . Then the most basic method of parameter transfer can be used data This parameter .
1 2 3 4 5
import requests payload = {'key1': 'value1', 'key2': 'value2'} r = requests.post("http://httpbin.org/post", data=payload) print r.text
Running results
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
{ "args": {}, "data": "", "files": {}, "form": { "key1": "value1", "key2": "value2" }, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Content-Length": "23", "Content-Type": "application/x-www-form-urlencoded", "Host": "httpbin.org", "User-Agent": "python-requests/2.9.1" }, "json": null, "url": "http://httpbin.org/post" }
You can see that the parameters are passed successfully , Then the server returns the data we sent . Sometimes the information we need to send is not in form , We need to pass JSON Format data in the past , So we can use json.dumps () Method to serialize form data .
1 2 3 4 5 6 7
import json import requests url = 'http://httpbin.org/post' payload = {'some': 'data'} r = requests.post(url, data=json.dumps(payload)) print r.text
Running results
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
{ "args": {}, "data": "{\"some\": \"data\"}", "files": {}, "form": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Content-Length": "16", "Host": "httpbin.org", "User-Agent": "python-requests/2.9.1" }, "json": { "some": "data" }, "url": "http://httpbin.org/post" }
By the above methods , We can POST JSON Formatted data If you want to upload a file , Then use it directly file Parameters can be Create a new one a.txt The file of , It says Hello World!
1 2 3 4 5 6
import requests url = 'http://httpbin.org/post' files = {'file': open('test.txt', 'rb')} r = requests.post(url, files=files) print r.text
You can see that the running results are as follows
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
{ "args": {}, "data": "", "files": { "file": "Hello World!" }, "form": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Content-Length": "156", "Content-Type": "multipart/form-data; boundary=7d8eb5ff99a04c11bb3e862ce78d7000", "Host": "httpbin.org", "User-Agent": "python-requests/2.9.1" }, "json": null, "url": "http://httpbin.org/post" }
In this way, we successfully completed the upload of a file . requests It supports streaming upload , This allows you to send large streams or files without first reading them into memory . To use streaming upload , Just provide a class file object for your request body
1 2
with open('massive-body') as f: requests.post('http://some.url/streamed', data=f)
This is a very practical and convenient function .
If a response contains cookie, Then we can use cookies Variables to get
1 2 3 4 5 6
import requests url = 'http://example.com' r = requests.get(url) print r.cookies print r.cookies['example_cookie_name']
The above procedure is just an example , It can be used cookies Variable to get the cookies In addition, it can be used cookies Variable to send... To the server cookies Information
1 2 3 4 5 6
import requests url = 'http://httpbin.org/cookies' cookies = dict(cookies_are='working') r = requests.get(url, cookies=cookies) print r.text
Running results
1
'{"cookies": {"cookies_are": "working"}}'
Yes, it has been successfully sent to the server cookies
You can use timeout Variable to configure the maximum request time
1
requests.get('http://github.com', timeout=0.001)
notes :timeout Only valid for connection process , Nothing to do with the download of the response body . in other words , This time is limited to the requested time . Even if it comes back response It contains a lot of content , It takes time to download , But it doesn't work .
In the above request , Every request is actually a new request . That is to say, each of our requests is opened by a different browser . That is, it is not a conversation , Even if the same URL is requested . such as
1 2 3 4 5
import requests requests.get('http://httpbin.org/cookies/set/sessioncookie/123456789') r = requests.get("http://httpbin.org/cookies") print(r.text)
The result is
1 2 3
{ "cookies": {} }
Obviously , It's not in a conversation , Can't get cookies, So in some sites , What should we do to keep a lasting conversation ? It's like browsing Taobao with a browser , Jump between different tabs , In fact, this is to establish a long-term conversation . The solution is as follows
1 2 3 4 5 6
import requests s = requests.Session() s.get('http://httpbin.org/cookies/set/sessioncookie/123456789') r = s.get("http://httpbin.org/cookies") print(r.text)
Here we have asked twice , One is setting up cookies, One is to get cookies Running results
1 2 3 4 5
{ "cookies": { "sessioncookie": "123456789" } }
Discovery can succeed in obtaining cookies 了 , That's how establishing a conversation works . Give it a try . So since the session is a global variable , So we can definitely use it for global configuration .
1 2 3 4 5 6
import requests s = requests.Session() s.headers.update({'x-test': 'true'}) r = s.get('http://httpbin.org/headers', headers={'x-test2': 'true'}) print r.text
adopt s.headers.update Method set headers The variable of . Then we set up a... In the request headers, So what happens ? It's simple , Both variables are passed . Running results
1 2 3 4 5 6 7 8 9 10
{ "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "python-requests/2.9.1", "X-Test": "true", "X-Test2": "true" } }
If get Method of transmission headers also x-test Well ?
1
r = s.get('http://httpbin.org/headers', headers={'x-test': 'true'})
Um. , It will override the global configuration
1 2 3 4 5 6 7 8 9
{ "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "python-requests/2.9.1", "X-Test": "true" } }
So if you don't want a variable in the global configuration ? It's simple , Set to None that will do
1
r = s.get('http://httpbin.org/headers', headers={'x-test': None})
Running results
1 2 3 4 5 6 7 8
{ "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "python-requests/2.9.1" } }
Um. , That's all session Basic usage of conversation
Now it's everywhere https The first website ,Requests It can be for HTTPS Request validation SSL certificate , It's like web Browser is the same . To check the... Of a host SSL certificate , You can use verify Parameters Now? 12306 The certificate is invalid , Let's test it
1 2 3 4
import requests r = requests.get('https://kyfw.12306.cn/otn/', verify=True) print r.text
result
1
requests.exceptions.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590)
That's the case Let's try github Of
1 2 3 4
import requests r = requests.get('https://github.com', verify=True) print r.text
Um. , Normal request , I will not output the content . If we want to skip just now 12306 Certificate validation of , hold verify Set to False that will do
1 2 3 4
import requests r = requests.get('https://kyfw.12306.cn/otn/', verify=False) print r.text
Find out and ask for it . By default verify yes True, So if you need to , You need to manually set this variable .
If you need to use a proxy , You can do this by providing proxies Parameter to configure a single request
1 2 3 4 5 6 7
import requests proxies = { "https": "http://41.118.132.69:4433" } r = requests.post("http://httpbin.org/post", proxies=proxies) print r.text
You can also use environment variables HTTP_PROXY and HTTPS_PROXY To configure the agent
1 2
export HTTP_PROXY="http://10.10.1.10:3128" export HTTPS_PROXY="http://10.10.1.10:1080"
import csv
import pymysql
import time
import pandas as pd
import re
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By
# Connect to database , take csv File saved to mysql in
def getConnect():
# Connect MySQL database ( Be careful :charset Parameter is utf8 instead of utf-8)
conn = pymysql.connect(host='localhost',port=3307, user='root', password='liziyi123456', db='db_douban', charset='utf8')
# Create cursor object
cursor = conn.cursor()
# Read csv file
with open('mm.csv', 'r', encoding='utf-8') as f:
read = csv.reader(f)
# Line by line , Remove the first line
for each in list(read)[1:]:
i = tuple(each)
# Use SQL Statement to add data
sql = "INSERT INTO movices VALUES" + str(i) # movices It's the name of the table
cursor.execute(sql) # perform SQL sentence
conn.commit() # Submit data
cursor.close() # Close cursor
conn.close() # Close the database
def getMovice(year):
server = Service('chromedriver.exe')
driver = webdriver.Chrome(service=server)
driver.implicitly_wait(60)
driver.get( "https://movie.douban.com/tag/#/?sort=S&range=0,10&tags="+year+",%E7%94%B5%E5%BD%B1")
driver.maximize_window()
actions = ActionChains(driver)
actions.scroll(0, 0, 0, 600).perform()
for i in range(50): # loop 50 Time , Crawling 1000 movie
btn = driver.find_element(by=By.CLASS_NAME, value="more")
time.sleep(3)
actions.move_to_element(btn).click().perform()
actions.scroll(0, 0, 0, 600).perform()
html = driver.page_source
driver.close()
return html
def getDetails(url): # Get the details of each movie
option = webdriver.ChromeOptions()
option.add_argument('headless')
driver = webdriver.Chrome(options=option)
driver.get(url=url)
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
div = soup.find('div', id='info')
#spans = div.find_all('span')
ls = div.text.split('\n')
#print(ls)
data = None;
country = None;
type = None;
time = None;
for p in ls:
if re.match(' type : .*', p):
type = p[4:].split(' / ')[0]
elif re.match(' Producer country / region : .*', p):
country = p[9:].split(' / ')[0]
elif re.match(' Release date : .*', p):
data = p[6:].split(' / ')[0]
elif re.match(' Film length : .*', p):
time = p[4:].split(' / ')[0]
ls.clear()
driver.quit()
name = soup.find('h1').find('span').text
score = soup.find('strong', class_='ll rating_num').text
numOfRaters = soup.find('div', class_='rating_sum').find('span').text
return {'name': name, 'data': data, 'country': country, 'type': type, 'time': time,
'score': score, 'numOfRaters': numOfRaters}
def getNameAUrl(html, year):
allM = []
soup = BeautifulSoup(html, 'lxml')
divs = soup.find_all('a', class_='item')
for div in divs:
url = div['href'] # Get website links
i = url.find('?')
url = url[0:i]
name = div.find('img')['alt'] # Get the movie title
allM.append({' The movie name ':name,' link ':url})
pf = pd.DataFrame(allM,columns=[' The movie name ',' link '])
pf.to_csv("movice_"+year+".csv",encoding = 'utf-8',index=False)
def getMovices(year):
allM = []
data = pd.read_csv('movice_'+year+'.csv', sep=',', header=0, names=['name', 'url'])
i = int(0)
for row in data.itertuples():
allM.append(getDetails(getattr(row, 'url')))
i += 1
if i == 2:
break
print(' The first '+str(i)+' Successfully written ')
pf = pd.DataFrame(allM,columns=['name','data','country','type','time','score','numOfRaters'])
pf.to_csv("mm.csv",encoding = 'utf-8',index=False,mode='a')
if __name__ == '__main__':
# Get movie Links
# htmll=getMovice('2022')
# getNameAUrl(htmll,'2022')
# Get the details of the movie , Save to movice.csv in
# a=getDetails("https://movie.douban.com/subject/33459931")
# print(a)
getMovices('2022')
#getConnect()
2021 year 9 month Python Analy