您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

10 minutes to get started with Python crawler [necessary for newbies]

編輯：Python

歡迎小伙伴的點評,相互學習、互關必回、全天在線
博主🧑🧑 Summarized the recent studypython The crawler lessons,10Minute introduction to the crawler,文章如下

文章目錄

一、淺談python爬蟲流程
- 1.1 獲取網頁
- 1.2解析網頁（提取數據）
- 1.3 存儲數據
二、3Technical implementation of a process
- 2.1Access to web technologies
- 2.2解析網頁技術
- 2.3存儲數據的技術
三、Write a simple web crawler instances
- 3.1Use tool profile
- 3.2實例源碼1
- - 3.2.1 、Crawl baiduhtmlPage and save the
  - 3.2.2效果圖如下
- 3.3實例源碼2
- - 3.3.1 爬取百度logo圖片並保存
  - 3.3.2效果圖如下
四、Python爬蟲總結

一、淺談python爬蟲流程

The process of web crawler is very simple,The main can be divided into three parts：

1.1 獲取網頁

獲取網頁：Just send the request to a url,The url will be returned to the data of the web.Similar to type the url in your browser and press the enter key,And then can see the entire page of website.

1.2解析網頁（提取數據）

解析網頁：Is extracted from the data of the web to data.Similar to the page you want to find the price of the product,Price is what you want to extract the data.

1.3 存儲數據

存儲數據：Is to store data down.我們可以存儲csv中,也可以存儲在數據庫中.

二、3Technical implementation of a process

2.1Access to web technologies

獲取網頁的基礎技術：requests、urllib和selenium.

2.2解析網頁技術

解析網頁的基礎技術：re正則表達式、BeautifulSoup和lxml.

2.3存儲數據的技術

存儲數據的基礎技術：存入txt文件和存入csv文件.

三、Write a simple web crawler instances

3.1Use tool profile

PyCharm Community Edition 2022.1.4
Python3.10
requests
安裝好python後打開cmd安裝requests的命令

pip install requests

3.2實例源碼1

3.2.1 、Crawl baiduhtmlPage and save the


import requests
url = "http://www.baidu.com"
response = requests.get( url )
response.encoding = "utf-8" #設置接收編碼格式
print(" r的類型" + str( type(response) ) )
print(" 狀態碼是:" + str( response.status_code ) )
print(" 頭部信息:" + str( response.headers ) )
print( " 響應內容:" )
print( response.text )
#保存文件
file = open("baidu.html","w",encoding="utf") #打開一個文件,w是文件不存在則新建一個文件,這裡不用wb是因為不用保存成二進制
file.write( response.text )
file.close()

3.2.2效果圖如下

這裡有一個問題 Open the page no baidulogo
That's ok let's go to thelogoHad to climb down,Look at the information found baidu crawllogo如下圖

把百度logo的URLCopy down to grab images

3.3實例源碼2

3.3.1 爬取百度logo圖片並保存


import requests #先導入爬蟲的庫,不然調用不了爬蟲的函數
response = requests.get("https://www.baidu.com/img/bd_logo1.png") #get方法的到圖片響應
file = open("bd_logo1.png","wb") #打開一個文件,保存到本地
file.write(response.content) #寫入文件
file.close()#關閉操作