Python


xpath

<p>xpath语法</p> <p>XPath 使用路径表达式来选取 XML 文档中的节点或节点集。节点是通过沿着路径 (path) 或者步 (steps) 来选取的。</p> <p>xpath选取节点 xpath提供了六种选取节点的表达式 可以混合使用</p> <p>1、nodename(节点名字例:div a book):表示选取此节点的所有子节点;</p> <p>2、/ :表示从根节点选取;</p> <p>3、// :从匹配选择的当前节点选择文档中的节点,而不考虑它们的位置;</p> <p>4、. :选取当前节点;</p> <p>5、.. :选取当前节点的父节点;</p> <p>6、@ : 选取属性。</p> <p>例:</p> <p>from lxml import etree</p> <p>doc = &quot;&quot;&quot; &lt;?xml version=&quot;1.0&quot; encoding=&quot;ISO-8859-1&quot;?&gt;</p> <html> <body> <bookstore id="test" class="ttt"> <book id= "1" class = "2"> <title lang="eng">Harry Potter</title> <price>29.99</price> </book> <book id = "2222222222222">11111111111111111111 <title lang="abc">Learning XML</title> <price>39.95</price> </book> </bookstore> <a></a> </body> </html> <p>&quot;&quot;&quot;</p> <p>html = etree.HTML(doc) print(html.xpath(&quot;body&quot;)) </p> <h1>result: [&lt;Element body at 0x24ea98109c8&gt;]</h1> <h1>/ 只从从跟标签下查找 // 从全文中查找所有匹配的标签</h1> <p>print(html.xpath(&quot;/bookstore&quot;)) # 从根标签开始找所有匹配的(根标签下只有body和head标签)</p> <h1>result: []</h1> <p>print(html.xpath(&quot;//bookstore&quot;)) # 全文中找所有匹配的</p> <h1>result: [&lt;Element bookstore at 0x241ba69f948&gt;]</h1> <p>print(html.xpath(&quot;//bookstore[@class='ttt']//book&quot;)) # 全文中找所有匹配的,这种也是从全文中取查找book标签</p> <h1>result: [&lt;Element book at 0x13ddc771948&gt;, &lt;Element book at 0x13ddc771988&gt;]</h1> <p>print(html.xpath(&quot;//book&quot;))</p> <h1>result: [&lt;Element book at 0x13ddc771948&gt;, &lt;Element book at 0x13ddc771988&gt;]</h1> <p>print(html.xpath(&quot;//<em>&quot;)) # </em>为通配符</p> <h1>result: [&lt;Element html at 0x13ddc771848&gt;, &lt;Element body at 0x13ddc771948&gt;, &lt;Element bookstore at 0x13ddc771988&gt;, &lt;Element book at 0x13ddc7719c8&gt;, &lt;Element title at 0x13ddc771a08&gt;, &lt;Element price at 0x13ddc771a88&gt;, &lt;Element book at 0x13ddc771ac8&gt;, &lt;Element title at 0x13ddc771b08&gt;, &lt;Element price at 0x13ddc771b48&gt;, &lt;Element a at 0x13ddc771a48&gt;]</h1> <p>View Code 为了方便更加精确的查询 xpath中还提供了一个谓语的概念,即限制条件,一般放在中括弧中</p> <h1>指定要获取的索引</h1> <p>print(html.xpath(&quot;//bookstore/book[1]/title/text()&quot;)) # 获取第一个</p> <h1>result: ['Harry Potter']</h1> <p>print(html.xpath(&quot;//bookstore/book[last()-1]/title/text()&quot;)) # last() 最后一个 last()-1 倒数第二个</p> <h1>result: ['Harry Potter']</h1> <p>print(html.xpath(&quot;//bookstore/book[position()&gt;1]/title/text()&quot;)) # 索引大于1的</p> <h1>result: ['Learning XML']</h1> <h1>用属性来作限制</h1> <h1>只要存在lang属性即可</h1> <p>print(html.xpath(&quot;//*[@lang]&quot;))</p> <h1>result: [&lt;Element title at 0x26867410948&gt;, &lt;Element title at 0x26867410908&gt;]</h1> <h1>只要 有属性即可 @表示属性 *表示通配符</h1> <p>print(html.xpath(&quot;//<em>[@</em>]&quot;))</p> <h1>result: [&lt;Element bookstore at 0x224c1d41a08&gt;, &lt;Element book at 0x224c1d41a48&gt;, &lt;Element title at 0x224c1d419c8&gt;, &lt;Element book at 0x224c1d41a88&gt;, &lt;Element title at 0x224c1d41ac8&gt;]</h1> <p>View Code 当存在多个匹配条件时可以用 &quot;|&quot; 来表示可供选择</p> <h1>多个匹配条件</h1> <p>print(html.xpath(&quot;//title|//price&quot;))</p> <h1>result: [&lt;Element title at 0x1d6f4ec0b48&gt;, &lt;Element price at 0x1d6f4ec0a48&gt;, &lt;Element title at 0x1d6f4ec0bc8&gt;, &lt;Element price at 0x1d6f4ec0a88&gt;]</h1> <p>View Code xpath中还提供了轴,它可以用于定义相当于当前节点的节点集</p> <p>1、ancestor:选取当前节点的所有先辈(父、祖父等)。</p> <p>2、ancestor-or-self:选取当前节点的所有先辈(父、祖父等)以及当前节点本身。</p> <p>3、attribute:选取当前节点的所有属性。</p> <p>4、child:选取当前节点的所有子元素。</p> <p>5、descendant:选取当前节点的所有后代元素(子、孙等)。</p> <p>6、descendant-or-self:选取当前节点的所有后代元素(子、孙等)以及当前节点本身。</p> <p>7、following:选取文档中当前节点的结束标签之后的所有节点。</p> <p>8、namespace:选取当前节点的所有命名空间节点。</p> <p>9、parent:选取当前节点的父节点。</p> <p>10、preceding:选取文档中当前节点的开始标签之前的所有节点。</p> <p>11、preceding-sibling:选取当前节点之前的所有同级节点。</p> <p>12、self:选取当前节点。</p> <p>例:</p> <p>from lxml import etree</p> <p>doc = &quot;&quot;&quot; &lt;?xml version=&quot;1.0&quot; encoding=&quot;ISO-8859-1&quot;?&gt;</p> <html> <body> <bookstore id="test" class="ttt"> <book id= "1" class = "2"> <title lang="eng">Harry Potter</title> <price>29.99</price> </book> <book id = "2222222222222">11111111111111111111 <title lang="abc">Learning XML</title> <price>39.95</price> </book> </bookstore> <a></a> </body> </html> <p>&quot;&quot;&quot;</p> <p>html = etree.HTML(doc)</p> <h1>轴标签</h1> <p>print(html.xpath(&quot;//bookstore/ancestor::*&quot;)) # 所有父标签</p> <h1>result: [&lt;Element html at 0x1fbeac80848&gt;, &lt;Element body at 0x1fbeac80948&gt;]</h1> <p>print(html.xpath(&quot;//bookstore/ancestor::body&quot;)) # 所有叫body的先辈(父标签)</p> <h1>result: [&lt;Element body at 0x203f46f0988&gt;]</h1> <p>print(html.xpath(&quot;//bookstore/ancestor-or-self::*&quot;)) # 所有叫父标签(包括自己)</p> <h1>result: [&lt;Element html at 0x203f46f0848&gt;, &lt;Element body at 0x203f46f0948&gt;, &lt;Element bookstore at 0x203f46f09c8&gt;]</h1> <p>View Code 附:</p> <p>tag = html.xpath('//ul[@class=&quot;gl-warp clearfix&quot;]/li/div/div[@class=&quot;p-img&quot;]/a/img/@src')</p> <h1>与下面的标签时可以捕获到相同的内容</h1> <p>tags = html.xpath('//ul[@class=&quot;gl-warp clearfix&quot;]/child::*') tag = tags[0] imgpath = tag.xpath('./div/div[@class=&quot;p-img&quot;]/a/img/@src') View Code</p>

页面列表

ITEM_HTML