utilize requests Visit the website for html, use re Regular expressions match and process characters
# -*- coding: utf-8 -*-
# The above line tells the compiler the encoding format to use . In this way, even if there is Chinese, there will be no problem
import re
import requests
response = requests.get('https://www.quora.com/Is-online-education-overrated') # Page to crawl
f = open("words.txt", "a") # Create in read-write mode / Open file
data = response.text # Express the website source code in words , The coding format can be changed
title = ' '.join(re.findall('<title>(.*?)</title>',data)) # Webpage title
result_list = re.findall('"text": "(.*?)."',data)+re.findall(r'''\\\\\\"text\\\\\\": \\\\\\"(.*?).\\\\\\",''',data)
# The regular expression here is more complex , Mainly looking for “text” Content of element . According to the of the web page html The rules are different , What to look for tag Also different
f.write('\n') # Write one line for another
print(title) # Output title
for result in result_list: # Do some follow-up
result = result.replace(r'\u2019', r"'") # Manually switch to special characters
result = result.replace('\\\\\\', "") # Remove the double backslash
result = result.replace(r'/', "") # Remove single slash
result = result.replace(r'\n', "") # Remove the newline
result = result.replace('\\', "") # Remove the single backslash
check = result.split() # Change the format to list, Each element is an answer
for ele in check: # Traversal list elements , Delete other irrelevant characters
if '"modifiers": {"image": ' in ele or len(ele) >= 13:
check.remove(ele)
result = ' '.join(check) # Turn it into str
f.write(result+", "+title+"\n") # Write it in a file
print(result) # Output the main contents written
f.close() # Good habit of saving files
To demonstrate regular matching and crawlers , The code adds a lot of irrelevant code , And the match is not perfect . Most of them are quite accurate , such as
Culture of Qualit
"On quality Terry Anderson emphasized that ""learning- knowledge- assessment- and educational experiences will result in high levels of learning by all He also believes that the ""integration of the new tools and affordances of the educational Semantic Web and emerging social software solutions will further enhance and make more accessible and affordable quality online learning experiences"
Since I have titled this observation as
ODeL Xperitu
[from Latin experitu = experienced tested proven] let me say that learning must progress to maturity; to function well as social innovators promoting excellence through Capacity Building and Development. Yes this is the Quality Assurance (QA) principle that defines and determ
"Michael Moore even says that this is a fact of distance education wherein ""teaching is hardly ever an individual act but a process joining together the expertise of a number of specialists."
However , There are still some strange characters that have not been deleted
#NAME?
#NAME?
#NAME?
#NAME?
#NAME?
For any Query or Enquiry Please Call u2013 Hiren Harwani - 9712186969 (you can join us in Whatsapp also
See if the partners can try to improve this program by themselves !
in addition , If you are crawling Chinese web pages , Pay attention to changing the coding format to utf-8 Oh
In fact, there are many other third-party libraries online , such as beautiful soap 4, It is also worth exploring .