您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Python selenium byte school enrollment internship post screening

編輯：Python

I often think about it recently ,Python What positions can I take office in ？

So I took aim at the ByteDance school recruitment website https://jobs.bytedance.com/campus/position?, Want to use requests Module to crawl Python Related positions

But as we all know ,get It is not difficult to request to crawl to the web page source code , The hard part is to find the corresponding target data post request

To avoid this “ Fucking ” One of the , Operate directly in the browser —— therefore selenium Here it comes

And requests Different ,selenium Just enter a get The request is done. First edit the target url Parameters of ：

After selecting these options on the school recruitment website , The URL on the top of the browser will change （ Don't look at the position category first ）, You can find “?” Then there are several key value pairs ：

https://jobs.bytedance.com/campus/position?keywords= become a regular worker &location=CT_128%2CCT_45&type=3&current=1&limit=10

keywords： Search box text
location： Work city , Short code
category： Position type , Very long code
type：3 On behalf of the intern position
current： The number of the current page （ Not set up ）
limit： The number of positions shown on each page , In order to make all the positions selected on the first page , Directly set to 10000

Parse the key value pairs in the new URL , And modify the code params Dictionary keywords、location、category、type, Use for Loop can be spliced to get the final url

import pandas as pd
from selenium.webdriver import Edge
from tqdm import tqdm
url = f'https://jobs.bytedance.com/campus/position?'
params = {
'keywords': ' become a regular worker ', # Search keywords
'location': 'CT_45%2CCT_128', # Work city : Guangzhou , Shenzhen
'category': '6704215862603155720%2C6704215862557018372%2C6704215956018694411%2C6704215886108035339%2C6704215957146962184%2C6704215897130666254%2C6704215958816295181%2C6704215888985327886%2C6704215963966900491%2C6704216109274368264%2C6704217321877014787%2C6704219452277262596%2C6704216635923761412%2C6704219534724696331%2C6704216296701036811%2C6938376045242353957',
# Position type : Research and development
'type': 3, # Recruitment type : The intern
'limit': 10000 # The page position displays
}
for key in params:
url += f'{key}={params[key]}&'

And requests The difference is ,selenium Unwanted post request , But you need to wait for the page to load , So you need a while Loop to wait for the page to load , Reuse xpath Locate nodes

def xpath(root, value, verbose=False):
''' selenium Node element positioning
root: Root node
value: xpath expression
verbose: Output debugging information '''
while 1:
try:
result = root.find_elements('xpath', value)
if result:
return result if len(result) != 1 else result[0]
except:
if verbose: print('\r No corresponding element found ...', end='')

Check through the web page , Find the location of the corresponding position link , Copy xpath Modify the wave again , You can get links to all positions on this page

In a similar , Find the position connection “ Job requirements ” The location of （ You can also find the position of the title ）, You can write the main function

def byte_dance(keywords=[], stopwords=[]):
''' keywords: Keyword sequence
stopwords: Stop word sequence '''
web = Edge()
web.get(url)
# Find links to each position
links = xpath(web, '//*[@id="bd"]/section/section/main/div/div/div[2]/div[3]/div[1]/div[2]/a', verbose=True)
links = list(map(lambda link: link.get_attribute('href'), links))
# Screen the school enrollment information
desired = []
for link in tqdm(links):
web.get(link)
box = xpath(web, '//*[@id="bd"]/section/section/main/div/div/div[1]')
# Read position name 、 Job requirements
title = xpath(box, 'div[1]/span').text
require = xpath(box, 'div[6]').text
# Check whether the keyword is not in the job requirements
fail = list(filter(lambda kwd: kwd not in require, keywords)) + \
list(filter(lambda swd: swd in require, stopwords))
if not fail: desired.append({'link': link, 'title': title, 'require': require})
web.quit()
return pd.DataFrame(desired)

Finally using pandas Of ExcelWriter Write the filtered data to Excel

def excel_dump(dataframe, file, sheet_name='tzj', float_format='%.4f'):
writer = pd.ExcelWriter(file)
dataframe.to_excel(writer, sheet_name=sheet_name, float_format=float_format)
writer.save()
desired = byte_dance(keywords=['Python'], stopwords=[])
print(desired)
excel_dump(desired, ' I'm sorry .xlsx')

all “ Job requirements ” It appears that “Python” The positions are ：

link Test development interns （ Can become a regular ）- International live broadcast （ Shenzhen / Beijing ） - Add byte jitter Tiktok graphic image algorithm intern （ Can become a regular ） - Add byte jitter Background development intern - Storage （ There is a chance to become a regular ） - Add byte jitter Background development intern - Infrastructure — There is a chance to become a regular - Add byte jitter Back end development interns - Tiktok / Volcano / International video （ There is a chance to become a regular ） - Add byte jitter Background development intern — Advertising system （ Can become a regular ） - Add byte jitter Recommend algorithm interns - Tiktok （ There is a chance to become a regular ） - Add byte jitter Algorithm Intern — Direction of risk control — There is a chance to become a regular - Add byte jitter Test development interns - Advertising system （ Can become a regular ） - Add byte jitter