大家好,又見面了,我是你們的朋友全棧君.
We usually use a browser to access web page,In essence to see,Is through a client through the network connection to access the server,訪問前,We don't have any content page,Then the content must be from the server transfer to come over.Work is the way of using the programming of the crawler automation to obtain and analyze the data from the server side,We need to crawl content.
So want to use the crawler access to content,First of all we need to analysis the target web page,Understand the data arrangement,Know the data transmission process,Thus to formulate effective crawl right way.
以CSDNI am in before an article as an example https://blog.csdn.net/qq_26292987/article/details/107608315
If we want to get the article content on this page and don't want myself to copy and paste,The crawler is a very effective tool,And analysis the page has several direction:
(一)分析頁面源代碼: 在頁面上單擊右鍵,可以看到選項“查看頁面源代碼”(這裡我使用的是Microsoft Edge浏覽器,Different browsers might be different),Pop-up new web content as shown in figure:
How can fast in the multifarious code information to find what we need? 【1】Find good solution “Ctrl+F”,Search to find the key words,Look to whether can get the result,Here the purpose is to find the article content,直接搜索“BeautifulSoap用法”即可得到如下界面
So why choose search“BeautifulSoap用法”Instead of title or other?Simple view source can be found,The questions in the source code several times,Not good locate articles,The first Duan Tongli(The problem of the first paragraph is mainly I edit) 【2】進階方法 對於這個方法,You first need to know about web site source code organization has certain,也就是需要對HTMLKnowledge has hundred million little understanding:https://www.runoob.com/html/html-tutorial.html 經過簡單的學習(After ten thousands years) 我們知道了: HTML是一種標記語言,Have the rigorous tag set to determine the function of each part of the,而更關鍵的是:
HTML 標簽是由尖括號包圍的關鍵詞,比如
<html>
HTML 標簽通常是成對出現的,比如<b>
和</b>
標簽對中的第一個標簽是開始標簽,第二個標簽是結束標簽 開始和結束標簽也被稱為開放標簽和閉合標簽 聲明為 HTML5 文檔<html>
元素是 HTML 頁面的根元素<head>
元素包含了文檔的元(meta)數據,如 <meta` charset=“utf-8”> 定義網頁編碼格式為 utf-8.<title>
元素描述了文檔的標題<body>
> 元素包含了可見的頁面內容<h1>
元素定義一個大標題<p>
元素定義一個段落<a>
Element defines a link<img>
Element defines an image<div><span>
Element defines a block<script>
定義一個腳本(運行的函數)
有了上面這些知識,That we need must be inbody裡去找的,Specific where to find?(To be honest or“Ctrl+F”更方便,The more knowledge for next step from the page to get content needed)這裡我個人推薦sublimeAs a temporary reader,Configured the software“HTML/CSS/JS Prettify”模塊之後,只需要輕輕一點,Those without indentation and align the code becomes like this:
Then can more convenient by folding for such way to get what we need
(二)分析頁面元素 在頁面上按“F12”,神奇的事情出現了,頁面的右側(Some browsers is below)Interesting changes have taken place in:
The top menu bar“元素、控制台、源代碼、網絡、性能、內存”等幾個選項,Now we only need to pay attention to“元素、網絡”Part two is enough! Element area system shows all elements on a page and content and arrangement way,修改其中的內容,The content of the page will happen corresponding change(別擔心,The content of the stored on the server has not changed),Move the mouse to the appropriate location,You can see on the left side of the corresponding contents appeared the selected effect,Now go to where to find elements is much more easily.
(三)Analysis of network transmission If you are in front of the content and impression,That you should remember,All content on the page is to transmit to our browser on the server side,也就是說,Every element on the page is the result of the transmission on the server,In the network can see this process:
See such a mess,我的內心是崩潰的(What is this thing) 沒事兒,Omnipotent network tell us,遇到事情不要慌,Take out mobile phone photo(笑) First of all, we can see the type,Can clearly see from the picture a large part of its transmission ispng、gif之類的圖片,So what are these pictures?Generally we can click on the left side of the name,In the pop-up preview pages to see them is mainly some icon on the page、Advertising need picture elements, such as;其次是script,This we have seen,It is a web page using a script function,那這些“.js”File big probability is web site running on the content of the script function;緊接著是text,Can see from the type,This is a text file,Instead of similar and typedocument的文件流,The content of many of these files is directly related to the content and we need to. Then we can analysis the file name,確實,The file name is a mess,但是,If the file name is really a mess not regularly,The site operations staff and how fast in a mess is the hard to find and fix problems?So believe that,As long as it is to design,There must be rules to follow,There must be a hole,And then through the comparison of multiple web pages,Can always find what we need. 再次,We can go to see the size.很多時候,To analyze the page contains the most important information the file transfer flow must be not very small,Therefore some only a fewBFlow of information at the time of analysis can ignore,According to the size of the flow of information to determine its important or not is a very convenient and necessary way.
那麼,For primary crawler learner,用好這三個方法,Many web pages can be to try to analysis and find the valuable information hidden in where to crawl it on your own computer.然而,If the crawler is a network world crawl information the sharpest spear,So in order to protect their website information is not optional crawl(After all information is money),At the same time also in order to make website design and the use of more convenient,Web site USES a lot of ways to optimize yourself,To brought many difficulties to our crawl,Common such as asynchronous information flow、動態頁面加載、登陸驗證、ipAccess number etc,Some of these can be solved through the analysis of web page,Most of all need we according to different site on their own to do different optimization to solve.那麼,Next, we do a small project first,Preliminary try the crawler to get the data of happiness!
發布者:全棧程序員棧長,轉載請注明出處:https://javaforall.cn/125155.html原文鏈接:https://javaforall.cn