Reptiles
- Prerequisite knowledge :
- URL
- HTTP agreement
- web front end ,html css js
- ajax
- re,Xpath
- XML
Definition of reptile
- Detailed introduction on Baidu
- Three steps :
- Download information
- Extract the right information
- According to certain rules, you can jump to another web page to execute the two-step content
- Reptile classification
- Universal crawler
- Dedicated crawler
- pyhon Introduction to network package
- 2.X ----
- 3.x----urllib,urllib3,httplib2,requests
urllib
- Contains modules
- urllib.request: Open and read the... Of the module urls
- urllib.error: contain urllib.request Common errors that occur , Use try capture
- urllib.parse: Including means url Methods
- urllib.robotparse: analysis robots.txt file
- Case study V1
- Web page coding problem solving
- chardet It can automatically detect the encoding format of web pages , But there may be a mistake
- Need to install conda install chardet
- Case list V2
- urlopen Return item for
- geturl: Return the request object
- info: Request to return the... Of the object meta object
- getcode: Return the request status code of the object
- request.code
- Two ways to access the network
- get: In fact, parameters are used to pass information to the server , Parameter use dict, And then use parse code
- post: Generally, the server passes parameters
- post Is to automatically encrypt information
- If you use post Information needs to be used data Parameters
- Use post signify Http The request header of can need to be changed :
- Content-Type:applocation/x-www.from-urlencode
- Content-Length: Data length
- in other words , Once the request method is changed , Note that other request header information is appropriate
- urllib.parse,urlencode You can change the upper string to the network protocol
- Case study V4
- Case study V4
- To set up our request information more , Simple use urlopen It is not very easy to use
- Need to use request.Request() class
- Case study V6
- urllib.error:
- The reasons causing :
- No net
- Server link failed
- Do not know the specified server
- yes osError Subclasses of
- Case study V7
- HTTPError: yes URLError A subclass of
- Case study V8
- UserAgent
- UserAgent: User agent is abbreviated as UA, Belong to Headers Part of , Server pass UA To determine the identity of visitors
- Set up UA have access to
- heads
- add_heads
- Case study V9
- ProxyHandler proxy server
- Using agents IP, Common means of reptiles
- Get the address of the proxy server :
- www.xicidaili.com
- www.goubanjian.com
- Proxy is used to hide the real access summary , The agent also does not allow frequent access to a fixed website , So there must be many agents
- Basic settings for using proxy :
- Set proxy address
- establish ProxyHandler
- establish Opener
- install Opener
- Case study V10
cookie & session
- because http The memoryless of the agreement , People make up for this , A supplementary agreement adopted
- cookie Is half the message sent to the user ,session Is the information stored in the other half of the server , To record information
- cookie and session The difference between :
- Different storage locations
- cookie unsafe
- session Will be on the server for a while , Will be out of date
- Single cookie Save no more than 4K, Many browsers limit a site to a maximum of 20 individual
- session Storage location
- Store on the server
- General situation ,session Is stored in the database
- cookie land
- Simulated Login to Renren
- V11
- Use Cookie land
- Put... Directly cookie Copy down , Then put the request header in manually
- V12
- http The module contains the cookie Module , Automatic use cookie
- CookieJar
- Manage storage Cookie, Outgoing to http Request add Cookie
- cookie Stored in memory ,CookieJar After instance recycling cookie Will disappear
- FileCookieJar
- Use file management cookie
- filename Is save Cookie The file of
- MozillaCookieJar
- establish Mozilla browser Cookie.txt Compatible FileCookieJar example
- LwqCookieJar
- Founded in libwww-perla Standard compatible Set-Cookie3 Format FileCookieJar
- Their relationship is :Cookie Jar–>FileCookieJar–>MozillaCookieJar&LwqCookieJar
- utilize Cooke Ja Visit people's network
- Case study 13
- Automatic use Cookie land
- After opening the login interface, you will automatically log in through the account password
- Automatically extract feedback Cookie
- Use extracted Cookie Log in to the privacy page
- handler yes Headler Example
- Commonly used
- establish cookie example
- cookie = cookiejar.CookieJar()
- Generate cookie The manager of
- cookie_handler = request.HTTPCookieProcessor(cookie)
- establish http Request manager
- http_handler = request.HTTPHandler()
- Generate http Manager
- https_handler = request.HTTPSHandler()
- Create request manager
- opener = request.build_opener(http_handler,https_handler,cookie_handler)
- establish handler after , Use opener open , After opening, the corresponding handler To use
- cookie Print as a variable
- Case study V14
- cookie attribute
- name : name
- value: value
- domain : You can visit here cookie Domain name of
- path: Look at the accessible cookie Page path
- expirse: Expired information
- size: size
- http Field
- cookie The preservation of the —FileCookieJar
- cookie The read
SSL
- SSL Certificate means comply with SSL Secure the server digital certificate of the socket layer protocol
- CA(CertifacateAuthority) It's the digital certification center
- Meet someone you don't trust SSL Certificate processing method
- Case study V17
JS encryption
http://tool.oschina.net
- Some anti - Crawler strategies use js Encrypting the transmitted data is usually md5 value
- Encrypted is ciphertext, but , The encryption function or process must be completed in the browser , Also is to JS The code is exposed to users
- By reading the encryption algorithm , You can simulate the encryption process , So as to crack
- Case study V18
- Use V18 and V19 Contrast
- remember JS Must be saved locally , Then find the encryption algorithm
AIAX
- The essence is a paragraph js Code , It is our web page that makes asynchronous requests
- There will be url, Request method
- Use general json Format
- Case study 20
- commonly GET Method is sent in the form of parameters
- post It uses form Methods , It is also convenient for encryption
Requests Module Xiange human module
- Inherited urlllib All the ways
- The bottom layer uses urllib3
- Open source
- Have a Chinese address
- install pip install request
- get request :
- request.get(url)
- request.request(‘get’,url)
- Can carry headers and parmas Parameters
- Case study 21
- get The return content of
- post
- rsp = resquest.post(url,data)
- Case study 23
- data,headers The requirement is dict type
- proxy agent
- proxy = {
“http”:“ Address ”
“HTTPs”:‘ Address ’
}
rsp = requests.request(“get”,“http::…”,proxies=proxy)
- User authentication
- Proxy verification
- Possible use HTTP basic Auth It can be like this
- The format is the user name : password @ Agency address : Port number
- proxy = {“http”:“china:[email protected]:8888”}
- res = request.get(“http://www.baidu.com”,proxies=proxy)
- web Client authentication
- If you need to verify, you can add auth=( user name , password )
- autu=(“ user name ”,‘’ password "’)
- res = request.get(“http://www.baidu.com”,auth=autu)
- cookie
- request It's automatic cookie Information
- rsp = requests.get(url)
- If the other server sends it cookie Information , You can consider the feedback cookie Attribute , Return to one cookie Example
- cookieJar = rsp.cookies
- Can be cookie Turn it into a dictionary
- cookiedict = requests.utils.dict_from_cookiejar(cookieJar)
- session
- And... On the server session It's different
- Simulate a session , Start linking servers from the client Explorer , Disconnect to client
- Let me keep some parameters across requests , For example, in the same session Between some requests issued by the instance cookie
- establish session When you're with someone , It can be saved cookie value
- ss = requests.session()
- headers = {“User-Agent”:“XXXXXXx”}
- data = {“name”:“XXXXXXx”}
- At this time, there are created session Manage requests , Responsible for making requests
- ss.post(“http://www.baidu.com”, data=data,headers=headers)
- rsp = ss.get(“XXXXXX”)
- https verification SSL certificate
- Parameters verify Responsible for indicating whether it is necessary to SSL certificate , By default TRUE
- If you don't need to SSL Certificate validation , be false
- rsp = requests.get(“https:”,verify=false)
Processing of crawler data
- Structural data : Prior structure , Let's talk about data
- json file
- json Path
- Convert to the corresponding Python Type operation (json class )
- XML
- convert to python The type of (xmtodict)
- Xpath
- CSS Selectors
- Regular
- Unstructured data : First there's data , Let's talk about structure
- Text
- Phone number
- Email address
- Regular expressions are often used to process this data
- Html file
- Regular
- Xpath
- CSS Selectors
Regular expressions
- A set of shares , You can search and replace in string text, and so on
- Case study 24, Basic rules for regular use
- Case study match Basic use of
- Common methods :
- match : Find... From the start , It only matches once
- search: Search from anywhere , One match
- findall: Find all , Returns a list of
- finditer: All match , Back to iteration
- spilt: Split characters , Returns a list of
- sub: Replace
- Matching Chinese
- matching Unicode The scope is mainly 【u4e00-u9fa5]
- Case study V27
- Greed is more than non greed
- Greedy mode : If the whole expression or match succeeds , As many matches as possible
- Non greedy model : As few matches as possible
XML
- XML(ExtensibilityleMarkLanguage)
- http://www.w3cschool
- Case study V28
- Concept ; Parent node , Child node , Predecessor node , Brother node , Next generation nodes
Xpath
- Xpath(XML Path language)
- w3school
- Common path expressions
lxml library
- Case study 29
- analysis html
- File read html
- etree and xpath In combination with
- Case study V31
CSS Selectors beatifulsoup4
Comparison of several tools
- Regular : fast , Not easy to use, no need to install
- beatifulsoup Slow and easy to use
- lxml: Faster
- Use beatifulsoup The case of
- Case study V32
beautifulSoup
- Four objects
- Tag
- NavigableString
- Beautifulsoup
- Comment
- Tag
- Corresponding HTML label
- adopt soup,tag_name()
- tag Two important
- Case study V33
- NavigableString
- Corresponding content value
- Beautifulsoup
- Represents the contents of a document
- comment
- special NavigableString object
- For its output , The content does not include annotation symbols
- Traversing objects
- contents: tag The child node list of
- children: The child node returns in the form of iteration
- descendants: All grandchildren 、
- string
- Case study 34
- Search for document objects
- find_all(name,arrts,recursive,text,** kwaargs**)
- name: Search by string , Content that can be included
- character string
- Regular expressions
- list
- keywortd Parameters , Represents the property
- text : Corresponding tag Text value
- CSS Selectors