A static web page is a web page that forms a static page in the server html or htm Document and send it to the web service of the client .
Dynamic web pages need to rely on client-side script and server-side script to render to form the final display document .
Client script :
Mainly JavaScript Script , It allows the client to respond to server-side events .
Server script :
There are many scripting languages on the server side , Include PHP,ASP,ASP.NET,JSP,ColdFusion and Perl Allow response to web page submission events .
Selenium It's a Web Automated test tool , It can be used to operate some browser drivers , And use some headless( No graphical user interface ) Browser , such as PhantomJS.
install Selenium:
pip install selenium
Selenium You also need a browser driver to run , Download driver , I download Chrome drive :
Chrome:https://sites.google.com/chromium.org/driver/Edge:https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/Firefox:https://github.com/mozilla/geckodriver/releasesSafari:https://webkit.org/blog/6900/webdriver-support-in-safari-10/
Be careful ,chromedriver The version of must be the same as that installed on this computer Chrome The browser version is consistent .
Then put it in the system variable Path in .
PhantomJS It's a way to use JavaScript Scripted headless browser .
download PhantomJS:https://phantomjs.org/download.html
After downloading, you only need to bin In the catalog .exe Files in Windows/System32 Under the table of contents :
Web address :http://quotes.toscrape.com/js/
This is a neat looking web page , My goal is to grab the first few slogans .
Next, look at its source code :
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>Quotes to Scrape</title> <link rel="stylesheet" href="/static/bootstrap.min.css"> <link rel="stylesheet" href="/static/main.css"> </head> <body> <div class="container"> <div class="row header-box"> <div class="col-md-8"> <h1> <a href="/" >Quotes to Scrape</a> </h1> </div> <div class="col-md-4"> <p> <a href="/login">Login</a> </p> </div> </div> <script src="/static/jquery.js"></script> <script> var data = [ { "tags": [ "change", "deep-thoughts", "thinking", "world" ], "author": { "name": "Albert Einstein", "goodreads_link": "/author/show/9810.Albert_Einstein", "slug": "Albert-Einstein" }, "text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d" }, { "tags": [ "abilities", "choices" ], "author": { "name": "J.K. Rowling", "goodreads_link": "/author/show/1077326.J_K_Rowling", "slug": "J-K-Rowling" }, "text": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d" }, { "tags": [ "inspirational", "life", "live", "miracle", "miracles" ], "author": { "name": "Albert Einstein", "goodreads_link": "/author/show/9810.Albert_Einstein", "slug": "Albert-Einstein" }, "text": "\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d" }, { "tags": [ "aliteracy", "books", "classic", "humor" ], "author": { "name": "Jane Austen", "goodreads_link": "/author/show/1265.Jane_Austen", "slug": "Jane-Austen" }, "text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d" }, { "tags": [ "be-yourself", "inspirational" ], "author": { "name": "Marilyn Monroe", "goodreads_link": "/author/show/82952.Marilyn_Monroe", "slug": "Marilyn-Monroe" }, "text": "\u201cImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.\u201d" }, { "tags": [ "adulthood", "success", "value" ], "author": { "name": "Albert Einstein", "goodreads_link": "/author/show/9810.Albert_Einstein", "slug": "Albert-Einstein" }, "text": "\u201cTry not to become a man of success. Rather become a man of value.\u201d" }, { "tags": [ "life", "love" ], "author": { "name": "Andr\u00e9 Gide", "goodreads_link": "/author/show/7617.Andr_Gide", "slug": "Andre-Gide" }, "text": "\u201cIt is better to be hated for what you are than to be loved for what you are not.\u201d" }, { "tags": [ "edison", "failure", "inspirational", "paraphrased" ], "author": { "name": "Thomas A. Edison", "goodreads_link": "/author/show/3091287.Thomas_A_Edison", "slug": "Thomas-A-Edison" }, "text": "\u201cI have not failed. I've just found 10,000 ways that won't work.\u201d" }, { "tags": [ "misattributed-eleanor-roosevelt" ], "author": { "name": "Eleanor Roosevelt", "goodreads_link": "/author/show/44566.Eleanor_Roosevelt", "slug": "Eleanor-Roosevelt" }, "text": "\u201cA woman is like a tea bag; you never know how strong it is until it's in hot water.\u201d" }, { "tags": [ "humor", "obvious", "simile" ], "author": { "name": "Steve Martin", "goodreads_link": "/author/show/7103.Steve_Martin", "slug": "Steve-Martin" }, "text": "\u201cA day without sunshine is like, you know, night.\u201d" } ]; for (var i in data) { var d = data[i]; var tags = $.map(d['tags'], function(t) { return "<a class='tag'>" + t + "</a>"; }).join(" "); document.write("<div class='quote'><span class='text'>" + d['text'] + "</span><span>by <small class='author'>" + d['author']['name'] + "</small></span><div class='tags'>Tags: " + tags + "</div></div>"); } </script> <nav> <ul class="pager"> <li class="next"> <a href="/js/page/2/">Next <span aria-hidden="true">→</span></a> </li> </ul> </nav> </div> <footer class="footer"> <div class="container"> <p class="text-muted"> Quotes by: <a href="https://www.goodreads.com/quotes">GoodReads.com</a> </p> <p class="copyright"> Made with <span class='sh-red'></span> by <a href="https://scrapinghub.com">Scrapinghub</a> </p> </div> </footer> </body> </html>
The slogan of this web page depends on the front end JavaScript Script rendering , The data of slogans only exists in the front end html On the file .
stay html The code uses a javascript Script load banner :
for (var i in data) { var d = data[i]; var tags = $.map(d['tags'], function(t) { return "<a class='tag'>" + t + "</a>"; }).join(" "); document.write("<div class='quote'><span class='text'>" + d['text'] + "</span><span>by <small class='author'>" + d['author']['name'] + "</small></span><div class='tags'>Tags: " + tags + "</div></div>"); }
The code on the next page is :
<nav> <ul class="pager"> <li class="next"> <a href="/js/page/2/">Next <span aria-hidden="true">→</span></a> </li> </ul> </nav>
# Introduce the required modules import selenium.webdriver from bs4 import BeautifulSoup as bs from selenium.webdriver.chrome.options import Options chrome_options = Options() chrome_options.add_argument('--headless') #This line should be uncommented if you're using Docker chrome_options.add_argument('--no-sandbox') chrome_options.add_argument('--disable-dev-shm-usage') # call Chrome perhaps PhantomJS driver = webdriver.webdriver.Chrome() #driver = webdriver.webdriver.PhantomJS()
Get web source code :
driver.get('http://quotes.toscrape.com/js/') content=driver.page_source
Page turning code :
host='http://quotes.toscrape.com' biaoyus=[] next='http://quotes.toscrape.com/js/' for i in range(4): # Use driver Access to web pages driver.get(next) content=driver.page_source # Use soup Look for the element eles=soup(content,'html.parser') biaoyus.append(eles.find_all("div",{"class":"quote"})) print(len(biaoyus)) # The next page next=host+eles.find('li',{'class':'next'}).find('a')['href'] print(next)
Complete code :
# Introduce the required modules from selenium import webdriver from bs4 import BeautifulSoup as soup # call Chrome perhaps PhantomJS driver = webdriver.Chrome() #driver = webdriver.webdriver.PhantomJS() # host host='http://quotes.toscrape.com' biaoyus=[] next='http://quotes.toscrape.com/js/' for i in range(4): # Use driver Access to web pages driver.get(next) content=driver.page_source # Use soup Look for the element eles=soup(content,'html.parser') biaoyus.append(eles.find_all("div",{"class":"quote"})) print(len(biaoyus)) next=host+eles.find('li',{'class':'next'}).find('a')['href'] print(next) #input() for biaoyu in biaoyus: for quote in biaoyu: print(quote.find(class_='text').getText()) print(quote.find(class_='author').getText()) print(quote.find(class_='tags').getText()) print('\n')
I'm going to crawl to JD. Com to “python” Before keyword search 200 A book .
Web address :https://search.jd.com/Search?keyword=python&enc=utf-8&wq=python&pvid=3e6f853b03a64d86b17638dc2de70fdf
Web page :
View page source code :
The structure of a book , Books are listed li Is displayed on the web page :
This page uses a sliding fill method to display books . Start by showing only a few books , Only when the user swipes the browser , Will show the rest of the books , Sliding code :
<span class="clr"></span> <div id="J_scroll_loading" class="notice-loading-more"><span> Loading , Please later ~~</span></div> <div class="page clearfix"><div id="J_bottomPage" class="p-wrap"></div></div>
To climb 200 Information about several books , Can't read in one page , To use selenium Provide simulation click function , Jump to multi page crawling information .
# Use class class Locate the next page next=driver.find_element_by_class_name('pn-next') # Click on the simulation next.click()
# Introduce the required modules from selenium import webdriver from bs4 import BeautifulSoup as soup import time import json # call Chrome perhaps PhantomJS driver = webdriver.Chrome() #driver = webdriver.webdriver.PhantomJS() # host next='https://search.jd.com/Search?keyword=python' # Use driver Access to web pages driver.get(next) booksstore=[] # Save the data fi=open("books.txt","a",encoding='utf-8') for j in range(4): #driver Control the roller to slide for i in range(2): driver.execute_script("window.scrollTo(0,document.body.scrollHeight)") # Wait for the page to load time.sleep(4) content=driver.page_source # Use soup Look for the element eles=soup(content,'html.parser') books=eles.find_all('li',{'class':'gl-item'}) print(len(books)) for book in books: name=book.find('div',{'class':'p-name'}).find('a').find('em').getText() price=book.find('div',{'class':'p-price'}).find('i').getText() commit='https:'+book.find('div',{'class':'p-commit'}).find('a')['href'] shop=book.find('div',{'class':'p-shopnum'}).find_all('a') print(name) print(price) print(commit) book={' Book name ':name,' Book price ':price,' Purchase address ':commit} if(len(shop)!=0): shopaddress=shop[0]['href'] shopname=shop[0]['title'] print("http:"+shopaddress) print(shopname) book[' Store address ']="http:"+shopaddress book[' Shop name ']=shopname booksstore.append(book) #booksstore.append('\n') fi.write(json.dumps(book,ensure_ascii=False)) fi.write("\n") # The next page next=driver.find_element_by_class_name('pn-next') print(next.text) next.click() time.sleep(4) print(len(booksstore)) print(booksstore) fi.write fi.close()
Crawling effect :
[1] What is dynamic scripting
[2] Python Reptiles , Use Python Crawling through dynamic web pages - Tencent animation (Selenium)
[3] selenium Control the roller to slide
[4] selenium Element positioning and simulation click events