Simply speaking , A crawler is an automated program that extracts information from a web page and saves it .
The work of the crawler :
Access to web pages : The crawler needs to get the web information first , That is, follow-up analysis of the web page source code . adopt Python Of urllib,requests And so on .
Analyze the web , Extract target information : After obtaining the web page source code , The crawler will parse the web page , And then extract the target information .
Save the data : Save the extracted target information , For later use .
When we go to a website , You may need to enter your login name and password . When we close the website we have logged in to , When you enter this page again , You don't need to enter your login information again ( Login name and password, etc ), This is it. Session and Cookie The result of cooperation .
Let's first introduce some pre concepts :
- Static web pages and dynamic web pages :
What is a static web page ? In website design , Use pure HTML A web page written in a format is often referred to as “ Static web page ”. There is another definition : Static web pages are relative to Dynamic web pages for , It means that there is no Background database 、 Excluding procedures and It's not interactive The web page of .
Advantages and disadvantages of static web pages :
advantage : Fast loading speed , Write simple .
shortcoming : Poor maintainability , Can't be based on URL Flexible transformation of displayed content .
- What is a dynamic web page ? It refers to a web page programming technology opposite to static web pages . Its main difference from static web pages is : Allow data interaction between users and service background .
- Advantages and disadvantages of dynamic web pages :
advantage : More flexibility , More features .( It can be dynamically parsed URL Changes in parameters in , And then present different contents .)
shortcoming :① Not dominant in access speed .② Not dominant in terms of search engine collection .
Be careful : Distinguishing a web page is “ dynamic ” still “ static state ”, It is not based on whether the content presented is dynamic ( Shuffling figure , Scrolling subtitles, etc ), But according to whether the web page can interact with the background database for data transmission .
No state HTTP:
HTTP The statelessness of means :HTTP Agreements have no memory for processing things , In other words, the server does not know the status of the client .
such as : We log on to a website , Then our login status is “ Logon ”. Due to stateless HTTP Characteristics of , When we request the website again , The server doesn't know if we are logged in , So we should also include our login related information in the request information , This will cause some messages to be sent repeatedly .
therefore , For holding HTTP The technology of connection state appears , When we parted
Session
andCookie
.
Session, It is called conversation in Chinese . Its original meaning refers to a series of actions from beginning to end 、 news . for example : On the phone , The process from picking up the phone and dialing to hanging up the phone can be a Session.
Session Object is used to store users Session The required properties and configuration information of . amount to Session Object saves the state of the current session .
Session Stored on the server . When a user sends a request to the server , If there is no corresponding Session object , The server will create a new one Session object .
Cookie, Sometimes in the plural Cookies. The type is “ Small text files ”, It's some websites to identify users , Conduct Session Tracking data stored on the user's local terminal ( Usually encrypted ), By the user client Information temporarily or permanently stored by a computer .
When the user first requests the server , The server will return a response with
Set-Cookie
Field response to client , This field is used to mark the user . The client will putCookie
Save up , The next time you send a request to this server , Will be preservedCookie
Put it in the request header and send it to the server .The first time the server responds to a client request , Created a response
Session
. Client'sCookie
The corresponding... Is saved inSession
OfID
. The server parses the data sent by the clientCookie
You can locate the correspondingSession
, To get the client status .
Cookie
Property structure of With
Google Chrome
Browser as an example , Enter a web page ( such as : You know ). Press downF12
Enter developer mode . leftStorage
Item inCookies
The subitem containsCookie
Details of .
Name:
Cookie
The name of . Cannot be changed after creation .Value:
Cookie
Value .
Domain
: Specify the domain name that can be accessed . Such as : Set to.zhihu.com
, All in the form ofzhihu.com
The domain name at the end can be accessed .
Path
:Cookie
The use path of .
Max-Age
:Cookie
Expiration time , The unit is in seconds . If it's negative , It means that the browser will be invalid after it is closed .Size:
Cookie
Size
HTTP
: A little
Secure
: A little
Cookie
Effective time of , With in field Max-Age/Expires
decision .
When the user closes the browser , The corresponding on the server Session
Will not disappear immediately , Only when the server is set Session
After the effective time runs out ,Session
Will be deleted by the server , To save storage space .