您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Python crawler: crawl movie leaderboard data and write to CSV file

編輯：Python

1. Import the required modules

To complete this task , We need to import requests To get the data of the corresponding website , Import re Use regular expressions to intercept the required data （ For example, movie name , Score and the number of people scoring ）, Import prettytable To format the output , Import csv The module writes data to csv In file . The specific import module code is as follows ：

import requests
import re
import prettytable as pd
import csv

2. Clarify the rules of link paging

Because we use requests Only one page of leaderboard data can be requested at a time , There are a total of 25 Movie data , And we need to get Top250 The data of , So we need to get 10 Pages of data . Because we need to find the rule of paging , So as to use the loop to automatically obtain the data of each page and write it to the file . Click the link corresponding to the number of pages , You can easily find the rules , The specific URL links of each page are as follows ：

therefore , We can use a for Loop can access the link data of each page in turn , The details are as follows ：

for i in range(10):
url = 'https://movie.douban.com/top250?start='+str(25*i)+'&filter='
headers = {

'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Mobile Safari/537.36'
}
response = requests.get(url, headers=headers)

3. Write regular expressions to intercept the required information

As shown in the figure below , What we need is the movie title , Movie year , Film ratings and the number of film ratings , On the corresponding website page Press down F12 Or right click to check Open developer tools , Select the icon shown in the following figure , Then click the information content to crawl , In this way, we can find the location of the information we need in many redundant codes .

Find the location of the corresponding information , We can write regular expressions according to their hierarchical relationship to obtain the required content . The specific regular expression code is as follows ：

p = re.compile(r'<li>.*?<span class="title">(?P<name>.*?)</span>.*?<br>(?P<year>.*?)&nbsp;'
r'.*?<span class="rating_num" property="v:average">(?P<score>.*?)</span>'
r'.*?<span>(?P<num>.*?)</span>', re.S)
for it in p.finditer(response.text):
print(it.group('name'))
print(it.group('year'))
print(it.group('score'))
print(it.group('num'))

Here we need to pay attention to , Writing regular expressions requires attention to case , If you don't want to pay attention to case , We can set re.I Make it insensitive to case ,compile() The second parameter of the function is to set some optional flag modifiers to control the matching pattern , The modifier is specified as an optional flag , Multiple flags can be passed by biting OR(|) To specify the , Such as (re.I)|(re.M) Set to I and M sign . The specific modifiers and descriptions are shown in the following table ：

Modifier describe re.Ire.I Make match match case insensitive re.L Do localization identification （locale-aware） matching re.Mre.M Multi-line matching , influence ^ and $re.Sre.S send . Match all characters including line breaks re.Ure.U according to Unicode Character set parsing characters . This sign affects \w, \W, \b, \B.re.X This flag allows you to write regular expressions more easily by giving you a more flexible format .

4. Use prettytable Format output

The icing on the cake effect of this part of knowledge , It has no practical effect , It can be used to expand knowledge learning , If you are not interested, you can skip directly ！ This part is equivalent to the data intercepted through regular expressions , Use prettytable Format the output , Make the output effect better , The specific operation code is as follows ：

# Format output 
table = pd.PrettyTable()
# Set the header 
table.field_names = [' The movie name ', ' year ', ' score ', ' Number of raters ']
for it in p.finditer(response.text):
# Add table data 
table.add_row([it.group('name'), it.group('year').strip(), it.group('score'), it.group('num')])
print(table)

The operation effect is as follows ：

5. Save data to csv file

This part of the code is relatively simple , Don't repeat too much , The main thing is to open the file. The code is as follows ：

# Open the file object as an append 
f = open('data.csv', mode='a')
csv_write = csv.writer(f)
for it in p.finditer(response.text):
# Put the iterator it Convert to dictionary 
dic = it.groupdict()
# Key pair is year Remove spaces from the value of 
dic['year'] = dic['year'].strip()
# The dictionary values write in data.csv
csv_write.writerow(dic.values())
print(' Write completion ')

Here we need to pay attention to , If written csv The documents are in PyCharm Cannot display normally , We can download one CSV plug-in unit , Here you can PyCharm The data is displayed normally in . The specific operation steps are as follows ：
Select the file with the mouse , Click settings below

Open Settings , After the following interface appears , Choose plug-ins , Type in the search box CSV, Select the correct plug-in in the following CSV reader , Click Install and then restart .

6. Show complete code

import requests
import re
import prettytable as pd
import csv
for i in range(10):
# The website of the ranking list 
url = 'https://movie.douban.com/top250?start='+str(25*i)+'&filter='
headers = {

'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Mobile Safari/537.36'
}
response = requests.get(url, headers=headers)
result = response.text
p = re.compile(r'<li>.*?<span class="title">(?P<name>.*?)</span>.*?<br>(?P<year>.*?)&nbsp;'
r'.*?<span class="rating_num" property="v:average">(?P<score>.*?)</span>'
r'.*?<span>(?P<num>.*?)</span>', re.S)
# Format output 
table = pd.PrettyTable()
# Set the header 
table.field_names = [' The movie name ', ' year ', ' score ', ' Number of raters ']
for it in p.finditer(result):
# Add table data 
table.add_row([it.group('name'), it.group('year').strip(), it.group('score'), it.group('num')])
print(table)
# Open the file as an append 
f = open('data.csv', mode='a')
csv_write = csv.writer(f)
for it in p.finditer(result):
# Put the iterator it Convert to dictionary 
dic = it.groupdict()
dic['year'] = dic['year'].strip()
csv_write.writerow(dic.values())
print(' Write completion ')