1. Installation method
pip install pyquery
2. Reference method
from pyquery import PyQuery as pq
3. brief introduction
pyquery It's the type jquery A special supply of python The use of html Parsed Library , The method of use is similar to bs4.
4. Usage method
4.1 Initialization method :
from pyquery import PyQuery as pq doc =pq(html) # analysis html character string doc =pq("http://news.baidu.com/") # Parse web pages doc =pq("./a.html") # analysis html Text
4.2 basic CSS Selectors
from pyquery import PyQuery as pq html = ''' <div id="wrap"> <ul class="s_from"> asdasd <link href="http://asda.com">asdadasdad12312</link> <link href="http://asda1.com">asdadasdad12312</link> <link href="http://asda2.com">asdadasdad12312</link> </ul> </div> ''' doc = pq(html) print doc("#wrap .s_from link")
Running results :
<link href="http://asda.com">asdadasdad12312</link> <link href="http://asda1.com">asdadasdad12312</link> <link href="http://asda2.com">asdadasdad12312</link>
# Is to find id The label of . Is to find class The label of link Is to find link label The space in the middle indicates the inner layer
4.3 Find child elements
from pyquery import PyQuery as pq html = ''' <div id="wrap"> <ul class="s_from"> asdasd <link href="http://asda.com">asdadasdad12312</link> <link href="http://asda1.com">asdadasdad12312</link> <link href="http://asda2.com">asdadasdad12312</link> </ul> </div> ''' # Find child elements doc = pq(html) items=doc("#wrap") print(items) print(" The type is :%s"%type(items)) link = items.find('.s_from') print(link) link = items.children() print(link)
Running results :
<div id="wrap"> <ul class="s_from"> asdasd <link href="http://asda.com">asdadasdad12312</link> <link href="http://asda1.com">asdadasdad12312</link> <link href="http://asda2.com">asdadasdad12312</link> </ul> </div> The type is :<class 'pyquery.pyquery.PyQuery'> <ul class="s_from"> asdasd <link href="http://asda.com">asdadasdad12312</link> <link href="http://asda1.com">asdadasdad12312</link> <link href="http://asda2.com">asdadasdad12312</link> </ul> <ul class="s_from"> asdasd <link href="http://asda.com">asdadasdad12312</link> <link href="http://asda1.com">asdadasdad12312</link> <link href="http://asda2.com">asdadasdad12312</link> </ul>
According to the running results, it can be found that the return result type is pyquery, also find Methods and children Methods can get the inner label
4.4 Find the parent element
from pyquery import PyQuery as pq html = ''' <div href="wrap"> hello nihao <ul class="s_from"> asdasd <link href="http://asda.com">asdadasdad12312</link> <link href="http://asda1.com">asdadasdad12312</link> <link href="http://asda2.com">asdadasdad12312</link> </ul> </div> ''' doc = pq(html) items=doc(".s_from") print(items) # Find the parent element parent_href=items.parent() print(parent_href)
Running results :
<ul class="s_from"> asdasd <link href="http://asda.com">asdadasdad12312</link> <link href="http://asda1.com">asdadasdad12312</link> <link href="http://asda2.com">asdadasdad12312</link> </ul> <div href="wrap"> hello nihao <ul class="s_from"> asdasd <link href="http://asda.com">asdadasdad12312</link> <link href="http://asda1.com">asdadasdad12312</link> <link href="http://asda2.com">asdadasdad12312</link> </ul> </div>
parent You can find out the contents of the outer label , Or something like that parents, You can get all outer nodes
4.5 Find sibling elements
from pyquery import PyQuery as pq html = ''' <div href="wrap"> hello nihao <ul class="s_from"> asdasd <link class='active1 a123' href="http://asda.com">asdadasdad12312</link> <link class='active2' href="http://asda1.com">asdadasdad12312</link> <link class='movie1' href="http://asda2.com">asdadasdad12312</link> </ul> </div> ''' doc = pq(html) items=doc("link.active1.a123") print(items) # Find sibling elements siblings_href=items.siblings() print(siblings_href)
Running results :
<link class="active1 a123" href="http://asda.com">asdadasdad12312</link> <link class="active2" href="http://asda1.com">asdadasdad12312</link> <link class="movie1" href="http://asda2.com">asdadasdad12312</link>
According to the running results, we can see ,siblings Returned other tags of the same level
Conclusion : Sub element lookup , Parent element lookup , Brother element search , The result types returned by these methods are pyquery type , You can choose again for the result
4.6 Traverse the search results
from pyquery import PyQuery as pq html = ''' <div href="wrap"> hello nihao <ul class="s_from"> asdasd <link class='active1 a123' href="http://asda.com">asdadasdad12312</link> <link class='active2' href="http://asda1.com">asdadasdad12312</link> <link class='movie1' href="http://asda2.com">asdadasdad12312</link> </ul> </div> ''' doc = pq(html) its=doc("link").items() for it in its: print(it)
Running results :
<link class="active1 a123" href="http://asda.com">asdadasdad12312</link> <link class="active2" href="http://asda1.com">asdadasdad12312</link> <link class="movie1" href="http://asda2.com">asdadasdad12312</link>
4.7 Get attribute information
from pyquery import PyQuery as pq html = ''' <div href="wrap"> hello nihao <ul class="s_from"> asdasd <link class='active1 a123' href="http://asda.com">asdadasdad12312</link> <link class='active2' href="http://asda1.com">asdadasdad12312</link> <link class='movie1' href="http://asda2.com">asdadasdad12312</link> </ul> </div> ''' doc = pq(html) its=doc("link").items() for it in its: print(it.attr('href')) print(it.attr.href)
Running results :
http://asda.com http://asda.com http://asda1.com http://asda1.com http://asda2.com http://asda2.com
4.8 Get text
from pyquery import PyQuery as pq html = ''' <div href="wrap"> hello nihao <ul class="s_from"> asdasd <link class='active1 a123' href="http://asda.com">asdadasdad12312</link> <link class='active2' href="http://asda1.com">asdadasdad12312</link> <link class='movie1' href="http://asda2.com">asdadasdad12312</link> </ul> </div> ''' doc = pq(html) its=doc("link").items() for it in its: print(it.text())
Running results
asdadasdad12312 asdadasdad12312 asdadasdad12312
4.9 obtain HTML Information
from pyquery import PyQuery as pq html = ''' <div href="wrap"> hello nihao <ul class="s_from"> asdasd <link class='active1 a123' href="http://asda.com"><a>asdadasdad12312</a></link> <link class='active2' href="http://asda1.com">asdadasdad12312</link> <link class='movie1' href="http://asda2.com">asdadasdad12312</link> </ul> </div> ''' doc = pq(html) its=doc("link").items() for it in its: print(it.html())
Running results :
<a>asdadasdad12312</a> asdadasdad12312 asdadasdad12312
5. Commonly used DOM operation
5.1 addClass removeClass
add to , remove class label
from pyquery import PyQuery as pq html = ''' <div href="wrap"> hello nihao <ul class="s_from"> asdasd <link class='active1 a123' href="http://asda.com"><a>asdadasdad12312</a></link> <link class='active2' href="http://asda1.com">asdadasdad12312</link> <link class='movie1' href="http://asda2.com">asdadasdad12312</link> </ul> </div> ''' doc = pq(html) its=doc("link").items() for it in its: print(" add to :%s"%it.addClass('active1')) print(" remove :%s"%it.removeClass('active1'))
Running results
add to :<link class="active1 a123" href="http://asda.com"><a>asdadasdad12312</a></link> remove :<link class="a123" href="http://asda.com"><a>asdadasdad12312</a></link> add to :<link class="active2 active1" href="http://asda1.com">asdadasdad12312</link> remove :<link class="active2" href="http://asda1.com">asdadasdad12312</link> add to :<link class="movie1 active1" href="http://asda2.com">asdadasdad12312</link> remove :<link class="movie1" href="http://asda2.com">asdadasdad12312</link>
It should be noted that there are already class Tags will not continue to be added
5.2 attr css
attr To get / Modify properties css add to style attribute
from pyquery import PyQuery as pq html = ''' <div href="wrap"> hello nihao <ul class="s_from"> asdasd <link class='active1 a123' href="http://asda.com"><a>asdadasdad12312</a></link> <link class='active2' href="http://asda1.com">asdadasdad12312</link> <link class='movie1' href="http://asda2.com">asdadasdad12312</link> </ul> </div> ''' doc = pq(html) its=doc("link").items() for it in its: print(" modify :%s"%it.attr('class','active')) print(" add to :%s"%it.css('font-size','14px'))
Running results
C:\Python27\python.exe D:/test_his/test_re_1.py modify :<link class="active" href="http://asda.com"><a>asdadasdad12312</a></link> add to :<link class="active" href="http://asda.com" ><a>asdadasdad12312</a></link> modify :<link class="active" href="http://asda1.com">asdadasdad12312</link> add to :<link class="active" href="http://asda1.com" >asdadasdad12312</link> modify :<link class="active" href="http://asda2.com">asdadasdad12312</link> add to :<link class="active" href="http://asda2.com" >asdadasdad12312</link>
attr css The operation directly modifies the
5.3 remove
remove Remove the label
from pyquery import PyQuery as pq html = ''' <div href="wrap"> hello nihao <ul class="s_from"> asdasd <link class='active1 a123' href="http://asda.com"><a>asdadasdad12312</a></link> <link class='active2' href="http://asda1.com">asdadasdad12312</link> <link class='movie1' href="http://asda2.com">asdadasdad12312</link> </ul> </div> ''' doc = pq(html) its=doc("div") print(' Get text results before removing :\n%s'%its.text()) it=its.remove('ul') print(' Get text results after removal :\n%s'%it.text())
Running results
Get text results before removing : hello nihao asdasd asdadasdad12312 asdadasdad12312 asdadasdad12312 Get text results after removal : hello nihao
other DOM Method reference :
http://pyquery.readthedocs.io/en/latest/api.html
6. Pseudo class selector
from pyquery import PyQuery as pq html = ''' <div href="wrap"> hello nihao <ul class="s_from"> asdasd <link class='active1 a123' href="http://asda.com"><a>helloasdadasdad12312</a></link> <link class='active2' href="http://asda1.com">asdadasdad12312</link> <link class='movie1' href="http://asda2.com">asdadasdad12312</link> </ul> </div> ''' doc = pq(html) its=doc("link:first-child") print(' First label :%s'%its) its=doc("link:last-child") print(' The last label :%s'%its) its=doc("link:nth-child(2)") print(' Second label :%s'%its) its=doc("link:gt(0)") # Starting from scratch print(" obtain 0 Future labels :%s"%its) its=doc("link:nth-child(2n-1)") print(" Get odd tags :%s"%its) its=doc("link:contains('hello')") print(" Get text containing hello The label of :%s"%its)
Running results
First label :<link class="active1 a123" href="http://asda.com"><a>helloasdadasdad12312</a></link> The last label :<link class="movie1" href="http://asda2.com">asdadasdad12312</link> Second label :<link class="active2" href="http://asda1.com">asdadasdad12312</link> obtain 0 Future labels :<link class="active2" href="http://asda1.com">asdadasdad12312</link> <link class="movie1" href="http://asda2.com">asdadasdad12312</link> Get odd tags :<link class="active1 a123" href="http://asda.com"><a>helloasdadasdad12312</a></link> <link class="movie1" href="http://asda2.com">asdadasdad12312</link> Get text containing hello The label of :<link class="active1 a123" href="http://asda.com"><a>helloasdadasdad12312</a></link>