| 2022-06-28 13:29
Learn this Python course , Easily extract information about web pages .
Browsing the web may take up most of your day . However , You always need to browse manually , This is dislike. , isn't it? ? You must open the browser , Then visit a website , Click the button , Move the mouse …… It's quite time-consuming . If you can interact with the Internet through code , Wouldn't it be better ?
stay Python Of requests
With the help of the module , You can use Python Get data from the Internet :
import requests
DATA = "https://opensource.com/article/22/5/document-source-code-doxygen-linux"
PAGE = requests.get(DATA)
print(PAGE.text)
In the above code example , You first imported requests
modular . next , You created two variables : One of them is called DATA
, It is used to save what you want to download URL. In later code , You will be able to provide different... Each time you run the application URL. however , For now , The easiest way is “ Hard encoding ” A test URL, For demonstration purposes .
Another variable is PAGE
. The code reads the data stored in DATA
Medium URL, Then pass it as a parameter requests.get
function , Finally, use variables PAGE
To receive the return value of the function .requests
Modules and .get
The function :“ Read ” An Internet address ( One URL)、 Visit the Internet , And download anything at that address .
Of course , There are many steps involved . Fortunately, , You don't have to figure it out for yourself , This is what Python Why modules exist . Last , You tell me Python Print requests.get
Stored in PAGE
Variable .text
Everything in the field .
If you run the above example code , You will get examples URL All of , also , They will be output to your terminal indiscriminately . This is because in the code , You are right about requests
The only thing the collected data does , Just print it . However , Parsing text is more interesting .
Python You can use its most basic functions to “ Read ” Text , But parsing text allows you to search for patterns 、 Specific words 、HTML Labels etc. . You can interpret it yourself requests
Returned text , however , It's much easier to use specialized modules . in the light of HTML and XML Text , We have library .
The following code does the same thing , It's just , It has been used. Beautiful Soup To parse the downloaded text . because Beautiful Soup Can identify HTML Elements , So you can use some of its built-in functions , Make the output more eye-friendly .
for example , At the end of the program , You can use Beautiful Soup Of .prettify
Function to process text ( Make it more beautiful ), Instead of printing the original text directly :
from bs4 import BeautifulSoup
import requests
PAGE = requests.get("https://opensource.com/article/22/5/document-source-code-doxygen-linux")
SOUP = BeautifulSoup(PAGE.text, 'html.parser')
# Press the green button in the gutter to run the script.
if __name__ == '__main__':
# do a thing here
print(SOUP.prettify())
Through the above code , We made sure that every open HTML Labels are output on a single line , With appropriate indentation , To help explain the inheritance relationship of labels . actually ,Beautiful Soup Be able to understand in more ways HTML label , Instead of just printing it out .
You can choose to print a specific label , Instead of printing the entire page . for example , Try to change the selector for printing from print(SOUP.prettify())
Change to :
print(SOUP.p)
This will only print one <p>
label . say concretely , It only prints the first one it encounters <p>
label . To print all <p>
label , You need to use a loop .
Use Beautiful Soup Of find_all
function , You can create a for
loop , To traverse the SOUP
Entire page contained in variable . except <p>
Beyond labels , You may also be interested in other labels , So it's best to build it as a custom function , from Python Medium def
keyword ( intend “ Definition ”) Appoint .
def loopit():
for TAG in SOUP.find_all('p'):
print(TAG)
You can change temporary variables at will TAG
Name , for example ITEM
or i
Or whatever you like . Every time the cycle runs ,TAG
It will include find_all
Function search results . In this code , It searches for <p>
label .
Functions don't execute automatically , Unless you explicitly call it . You can call this function at the end of the code :
# Press the green button in the gutter to run the script.
if __name__ == '__main__':
# do a thing here
loopit()
Run the code to see all <p>
Labels and their contents .
You can specify by just “ character string ”( It is “ word ” Programming terms for ) To exclude printing labels .
def loopit():
for TAG in SOUP.find_all('p'):
print(TAG.string)
Of course , Once you have the text of the page , You can use the standard Python The string library parses it further . for example , You can use len
and split
Function to get the number of words :
def loopit():
for TAG in SOUP.find_all('p'):
if TAG.string is not None:
print(len(TAG.string.split()))
This will print the number of strings in each paragraph element , Omit paragraphs that do not have any strings . To get the total number of strings , You need to use variables and some basic math :
def loopit():
NUM = 0
for TAG in SOUP.find_all('p'):
if TAG.string is not None:
NUM = NUM + len(TAG.string.split())
print("Grand total is ", NUM)
You can use Beautiful Soup and Python Extract more information . Here are some ideas on how to improve your application :
<img>
label ) The number of .<img>
label ) The number of ( for example , Only in <main>
div Picture in , Or just in </p>
Picture after tag ).via:
author : Topic selection : translator : proofreading :
This paper is written by Original compilation , Honor roll out