1. When the user enters a URL And press enter , The browser will go to HTTP Server send HTTP request .HTTP Requests are mainly divided into "Get" and "Post" The two methods .
2. When we type in the browser URL htp://www baidu .com When , Browser send - individual Request Request to get htp:/www baidu com Of html file , Server handle Response The file object is sent back to the browser .
3. Browser analysis Response Medium HTML, Many other files are referenced , such as Images file ,CSS file , JS file . Browser will automatically resend Request To get pictures ,CSS file , perhaps JS file .
4. When all the files are downloaded successfully , The web page will be based on HTML Grammatical structure , The complete display shows .
#url Detailed explanation :
URL yes Uniform Resource Locator Abbreviation , Uniform resource locator .
One URL It consists of the following parts :
scheme://host:port/path/>query-string=xxx#ancho
●scheme: It represents the protocol of access ,- General http perhaps https as well as ftp etc. .
●host: Host name , domain name , such as www.baidu.com
●port: Port number . When you visit - When it comes to a website , Browser defaults to 80 port .
●path: Find the path . such as : www.jianshu.com/trending/now , hinder trending/now Namely path
●query-string: Query string , such as : www.baidu.com/s?wd=python , hinder wd=python Is the search string .
●anchor: Anchor point , You don't have to worry about the backstage , The front end is used for page positioning .
Request in the browser - - individual ufl , The browser will look at this url Make a code . Except English letters , In addition to numbers and some symbols , All the others use the percent sign + Encode the hexadecimal code value .
# Common parameters of request header :
stay http Agreement , Send a request to the server , The data is divided into three parts , The first - One is to put the data in url in , The second is to put the data in body in ( stay post In request ) , The third is to put the data in head in . Here are some request header parameters that are often used in web crawlers :
Referer : Indicates which... The current request is from url Over here . This can also be used for anti crawler technology . If not from the specified page , Then do not make relevant response .
Cookie : http Protocol is stateless . That is, the same - One person sent two requests , The server doesn't have the ability to know if the two requests are from the same person . So use... At this time cookie To make a sign . Generally, if you want to make a website that can be accessed only after logging in , Then you need to send cookie Information. .
stay Http Agreement , Eight request methods are defined . Here are two common request methods , Namely get Request and post request .
The above two methods are commonly used in website development . And generally will follow the principle of use . But some websites and servers are trying to do anti crawler mechanism , Also often will not play according to the common sense , It's possible that one should use get Method request must be changed to post request , It depends on the situation .
200 : Request OK , The server's normal loopback data .
301 : Permanent redirection . Like visiting www.jingdong.com Will redirect to www.jd.com.
302 : Temporary redirection . For example, when visiting a page that needs to log in , And there is no login at this time , Then it will be redirected to the login page .
400 : Requested url Could not find... On the server . In other words, request ur1 error .
403 : Server access denied , Not enough permissions .
500 : Server internal error . It may be the server bug 了 .