您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Python crawler data parsing (beautifulsoup)

編輯：Python

BeautifulSoup It's also python A data analysis method commonly used by crawlers , There are two main steps .

1、 Instantiate a Beautifulsoup object , Load the page source code data into the object .

2、 By calling Beautifulsoup Object for tag location and data extraction .

How to instantiate a Beautifulsoup What about objects? ？

Download it first bs4 This library , Then pour in BeautifulSoup package , And then it will be the local HTML The document source code data is loaded into Beautifulsoup In the object , Or load the real-time web page source code data into Beautifulsoup in .

from bs4 import BeautifulSoup
# Will local html You can only get the text content directly below the tag
fp = open('./douban.html','r',encoding='utf-8')
soup = BeautifulSoup(fp,"html.parser")

html.parser ： yes HTML Parser for type documents ,Beautifulsoup Can not only parse HTML, Can also parse xml,json wait , Different types of files use different parsers .

import requests
from bs4 import BeautifulSoup
# Load the web page source code data into the object
html = requests.get(url=url,headers=headers).text
soup = BeautifulSoup(html,"html.parser")

Finished instantiating the object , The next step is to call Beautifulsoup Properties or methods in .

tagName

refer to HTML Name of tag in , for example title、div、a、p wait . Returns the first time in the document tagName Corresponding label content .

from bs4 import BeautifulSoup
# Will local html You can only get the text content directly below the tag
fp = open('./douban.html','r',encoding='utf-8')
soup = BeautifulSoup(fp,"html.parser")
print(soup.title)
>> <title>
Watercress movie Top 250
</title>
# If you only want the content in the label, add .string/.text/.get_text()
#.string Only the text content directly under the label can be obtained
#.text/get_text() You can get all the text content of a label
print(soup.title.string)
>> Watercress movie Top 250
# If you want the attributes in the tag, add .attrs
soup = BeautifulSoup(fp,"html.parser")
print(soup.a.attrs)
# Return... As a dictionary , You can easily get the attributes in the tag
>> {'href': 'https://accounts.douban.com/passport/login?source=movie', 'class': ['nav-login'], 'rel': ['nofollow']}

find()

What is returned is the content corresponding to the tag or attribute in the document for the first time .

from bs4 import BeautifulSoup
# Will local html You can only get the text content directly below the tag
fp = open('./douban.html','r',encoding='utf-8')
soup = BeautifulSoup(fp,"html.parser")
print(soup.find('a')) # stay find Inside plus tagName, Actually sum .tagName equally
>> <a class="nav-login" href="https://accounts.douban.com/passport/login?source=movie" rel="nofollow"> Sign in / register </a>
# Attributes that can be tagged after calss_/id etc.
print(soup.find('a',class_="nav-login"))
>> <a class="nav-login" href="https://accounts.douban.com/passport/login?source=movie" rel="nofollow"> Sign in / register </a>
# You can also write attributes directly without labels
print(soup.find(class_="title"))
>> <span class="title"> Shawshank redemption </span>

find_all()

and find() Use the same , however find_all() You can return all tags that meet the requirements in the form of a list .

from bs4 import BeautifulSoup
# Will local html You can only get the text content directly below the tag
fp = open('./douban.html','r',encoding='utf-8')
soup = BeautifulSoup(fp,"html.parser")
print(soup.find_all('span',class_="title"))
>> [<span class="title"> Shawshank redemption </span>, <span class="title"> / The Shawshank Redemption</span>, <span class="title"> Farewell my concubine </span>, <span class="title"> Forrest gump </span>......]

select()

It can be done through tags 、 Class name 、 attribute 、id And sub tags to find , It returns a list .

from bs4 import BeautifulSoup
# Will local html You can only get the text content directly below the tag
fp = open('./douban.html','r',encoding='utf-8')
soup = BeautifulSoup(fp,"html.parser")
print(soup.select('title')) # label
>> [<title>
Watercress movie Top 250
</title>]
print(soup.select('.title')) # Class name
>> [<span class="title"> Shawshank redemption </span>, <span class="title"> / The Shawshank Redemption</span>, <span class="title"> Farewell my concubine </span>, <span class="title"> Forrest gump </span>......]
print(soup.select('#inp-query')) #id, Add in front. #
>> [<input id="inp-query" maxlength="60" name="search_text" placeholder=" Search for movies 、 TV play 、 variety 、 The photographer " size="22" value=""/>]
print(soup.select('span[class="title"]')) # attribute ,class Don't add it in the back _
>> [<span class="title"> Shawshank redemption </span>, <span class="title"> / The Shawshank Redemption</span>, <span class="title"> Farewell my concubine </span>, <span class="title"> Forrest gump </span>......]
print(soup.select('.info > div > a > span')) # Child tags ,'>' Represents a hierarchy
>> [<span class="title"> Shawshank redemption </span>, <span class="title"> / The Shawshank Redemption</span>, <span class="title"> Farewell my concubine </span>, <span class="title"> Forrest gump </span>......]
print(soup.select('.info span')) # Child tags ,' ' Represents multiple levels
>> [<span class="title"> Shawshank redemption </span>, <span class="title"> / The Shawshank Redemption</span>, <span class="title"> Farewell my concubine </span>, <span class="title"> Forrest gump </span>......]

Above is Beautifulsoup More commonly used methods , Of course Beautifulsoup There are many other ways , If you are interested, you can check the relevant documents .