xpath
<p>xpath语法</p>
<p>XPath 使用路径表达式来选取 XML 文档中的节点或节点集。节点是通过沿着路径 (path) 或者步 (steps) 来选取的。</p>
<p>xpath选取节点 xpath提供了六种选取节点的表达式 可以混合使用</p>
<p>1、nodename(节点名字例:div a book):表示选取此节点的所有子节点;</p>
<p>2、/ :表示从根节点选取;</p>
<p>3、// :从匹配选择的当前节点选择文档中的节点,而不考虑它们的位置;</p>
<p>4、. :选取当前节点;</p>
<p>5、.. :选取当前节点的父节点;</p>
<p>6、@ : 选取属性。</p>
<p>例:</p>
<p>from lxml import etree</p>
<p>doc = """
<?xml version="1.0" encoding="ISO-8859-1"?></p>
<html>
<body>
<bookstore id="test" class="ttt">
<book id= "1" class = "2">
<title lang="eng">Harry Potter</title>
<price>29.99</price>
</book>
<book id = "2222222222222">11111111111111111111
<title lang="abc">Learning XML</title>
<price>39.95</price>
</book>
</bookstore>
<a></a>
</body>
</html>
<p>"""</p>
<p>html = etree.HTML(doc)
print(html.xpath("body")) </p>
<h1>result: [<Element body at 0x24ea98109c8>]</h1>
<h1>/ 只从从跟标签下查找 // 从全文中查找所有匹配的标签</h1>
<p>print(html.xpath("/bookstore")) # 从根标签开始找所有匹配的(根标签下只有body和head标签)</p>
<h1>result: []</h1>
<p>print(html.xpath("//bookstore")) # 全文中找所有匹配的</p>
<h1>result: [<Element bookstore at 0x241ba69f948>]</h1>
<p>print(html.xpath("//bookstore[@class='ttt']//book")) # 全文中找所有匹配的,这种也是从全文中取查找book标签</p>
<h1>result: [<Element book at 0x13ddc771948>, <Element book at 0x13ddc771988>]</h1>
<p>print(html.xpath("//book"))</p>
<h1>result: [<Element book at 0x13ddc771948>, <Element book at 0x13ddc771988>]</h1>
<p>print(html.xpath("//<em>")) # </em>为通配符</p>
<h1>result: [<Element html at 0x13ddc771848>, <Element body at 0x13ddc771948>, <Element bookstore at 0x13ddc771988>, <Element book at 0x13ddc7719c8>, <Element title at 0x13ddc771a08>, <Element price at 0x13ddc771a88>, <Element book at 0x13ddc771ac8>, <Element title at 0x13ddc771b08>, <Element price at 0x13ddc771b48>, <Element a at 0x13ddc771a48>]</h1>
<p>View Code
为了方便更加精确的查询 xpath中还提供了一个谓语的概念,即限制条件,一般放在中括弧中</p>
<h1>指定要获取的索引</h1>
<p>print(html.xpath("//bookstore/book[1]/title/text()")) # 获取第一个</p>
<h1>result: ['Harry Potter']</h1>
<p>print(html.xpath("//bookstore/book[last()-1]/title/text()")) # last() 最后一个 last()-1 倒数第二个</p>
<h1>result: ['Harry Potter']</h1>
<p>print(html.xpath("//bookstore/book[position()>1]/title/text()")) # 索引大于1的</p>
<h1>result: ['Learning XML']</h1>
<h1>用属性来作限制</h1>
<h1>只要存在lang属性即可</h1>
<p>print(html.xpath("//*[@lang]"))</p>
<h1>result: [<Element title at 0x26867410948>, <Element title at 0x26867410908>]</h1>
<h1>只要 有属性即可 @表示属性 *表示通配符</h1>
<p>print(html.xpath("//<em>[@</em>]"))</p>
<h1>result: [<Element bookstore at 0x224c1d41a08>, <Element book at 0x224c1d41a48>, <Element title at 0x224c1d419c8>, <Element book at 0x224c1d41a88>, <Element title at 0x224c1d41ac8>]</h1>
<p>View Code
当存在多个匹配条件时可以用 "|" 来表示可供选择</p>
<h1>多个匹配条件</h1>
<p>print(html.xpath("//title|//price"))</p>
<h1>result: [<Element title at 0x1d6f4ec0b48>, <Element price at 0x1d6f4ec0a48>, <Element title at 0x1d6f4ec0bc8>, <Element price at 0x1d6f4ec0a88>]</h1>
<p>View Code
xpath中还提供了轴,它可以用于定义相当于当前节点的节点集</p>
<p>1、ancestor:选取当前节点的所有先辈(父、祖父等)。</p>
<p>2、ancestor-or-self:选取当前节点的所有先辈(父、祖父等)以及当前节点本身。</p>
<p>3、attribute:选取当前节点的所有属性。</p>
<p>4、child:选取当前节点的所有子元素。</p>
<p>5、descendant:选取当前节点的所有后代元素(子、孙等)。</p>
<p>6、descendant-or-self:选取当前节点的所有后代元素(子、孙等)以及当前节点本身。</p>
<p>7、following:选取文档中当前节点的结束标签之后的所有节点。</p>
<p>8、namespace:选取当前节点的所有命名空间节点。</p>
<p>9、parent:选取当前节点的父节点。</p>
<p>10、preceding:选取文档中当前节点的开始标签之前的所有节点。</p>
<p>11、preceding-sibling:选取当前节点之前的所有同级节点。</p>
<p>12、self:选取当前节点。</p>
<p>例:</p>
<p>from lxml import etree</p>
<p>doc = """
<?xml version="1.0" encoding="ISO-8859-1"?></p>
<html>
<body>
<bookstore id="test" class="ttt">
<book id= "1" class = "2">
<title lang="eng">Harry Potter</title>
<price>29.99</price>
</book>
<book id = "2222222222222">11111111111111111111
<title lang="abc">Learning XML</title>
<price>39.95</price>
</book>
</bookstore>
<a></a>
</body>
</html>
<p>"""</p>
<p>html = etree.HTML(doc)</p>
<h1>轴标签</h1>
<p>print(html.xpath("//bookstore/ancestor::*")) # 所有父标签</p>
<h1>result: [<Element html at 0x1fbeac80848>, <Element body at 0x1fbeac80948>]</h1>
<p>print(html.xpath("//bookstore/ancestor::body")) # 所有叫body的先辈(父标签)</p>
<h1>result: [<Element body at 0x203f46f0988>]</h1>
<p>print(html.xpath("//bookstore/ancestor-or-self::*")) # 所有叫父标签(包括自己)</p>
<h1>result: [<Element html at 0x203f46f0848>, <Element body at 0x203f46f0948>, <Element bookstore at 0x203f46f09c8>]</h1>
<p>View Code
附:</p>
<p>tag = html.xpath('//ul[@class="gl-warp clearfix"]/li/div/div[@class="p-img"]/a/img/@src')</p>
<h1>与下面的标签时可以捕获到相同的内容</h1>
<p>tags = html.xpath('//ul[@class="gl-warp clearfix"]/child::*')
tag = tags[0]
imgpath = tag.xpath('./div/div[@class="p-img"]/a/img/@src')
View Code</p>