python

18、爬虫 a href直接匹配

<p>例子</p> <pre><code>#!/usr/bin/python #-*- coding:utf-8 -*- import urllib2 import re t=urllib2.urlopen('http://www.so.com') d=t.read() p=re.compile(r'<a href="(.+?)">') ret=p.findall(d) for i in ret: print i 返回是这样的： http://zhidao.baidu.com/topic/yaan/" target="_blank http://google.org/personfinder/2013-sichuan-earthquake/" target="_blank http://gongyi.in.sohu.com/yaan/index.html" target="_blank http://www.sogou.com/yaan.html" target="_blank http://weibo.com/yijijin" target="_blank http://gongyi.weibo.com/140996" target="_blank http://www.miibeian.gov.cn/ http://info.so.360.cn/feedback.html" data-linkid="1 http://zhanzhang.so.com" data-linkid="2 http://www.360.cn/about/index.html" data-linkid="3 http://www.so.com/help/help_1_1.html" data-linkid="4 http://www.so.com/help/help_iduty.html" data-linkid="4 http://e.360.cn?src=srp" data-linkid="5 </code></pre> <p>当不是空格的时候比如：</p> <pre><code><span\r\nlang=EN-US style=\'font-size:12.0pt\'><a\r\nhref="http://cjc.ict.ac.cn/online/onlinepaper/cx-201811670104.pdf"></code></pre> <p>需要补全<a\r\nhref="(.+?)"></p> <pre><code>url = 'http://cjc.ict.ac.cn/qwjs/No2018-01.htm' t=urllib2.urlopen(url) d=t.read() p=re.compile(r'<a\r\nhref="(.+?)">') ret=p.findall(d) print ret</code></pre> <h3>完整示例</h3> <pre><code>from bs4 import BeautifulSoup import urllib.request import re # import urllib.request from lxml import etree # import pandas as pd # 返回html的soup解析 def openUrl(url): #headers = {'User-Agent': 'User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'} it_header = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.1.1 Safari/605.1.15'} req = urllib.request.Request(url, headers=it_header) response = urllib.request.urlopen(req) #请求 html = response.read().decode("gb2312") #print(html) Soup = BeautifulSoup(html, 'lxml') return Soup url = 'http://cjc.ict.ac.cn/qwjs/No2018-01.htm' # Soup = openUrl(url) page = urllib.request.urlopen(url) html = page.read().decode("gb2312") fanye_urls = re.findall(re.compile(r'<a\r\nhref="(.+?)">'), html, flags=0) fanye_urls</code></pre> <pre><code>['http://cjc.ict.ac.cn/online/onlinepaper/liuq-201811662728.pdf', 'http://cjc.ict.ac.cn/online/onlinepaper/pyh-201811662750.pdf', 'http://cjc.ict.ac.cn/online/onlinepaper/lx-201811663141.pdf', 'http://cjc.ict.ac.cn/online/onlinepaper/pyt-201811664814.pdf', 'http://cjc.ict.ac.cn/online/onlinepaper/zy-201811664924.pdf', 'http://cjc.ict.ac.cn/online/onlinepaper/zql-201811665026.pdf', 'http://cjc.ict.ac.cn/online/onlinepaper/cdh-201811665125.pdf', 'http://cjc.ict.ac.cn/online/onlinepaper/zj-201811670401.pdf', 'http://cjc.ict.ac.cn/online/onlinepaper/hym-201811665355.pdf', 'http://cjc.ict.ac.cn/online/onlinepaper/tw-201811665510.pdf', 'http://cjc.ict.ac.cn/online/onlinepaper/lxx-201811665617.pdf', 'http://cjc.ict.ac.cn/online/onlinepaper/cxf-201811665811.pdf', 'http://cjc.ict.ac.cn/online/onlinepaper/lch-201811665911.pdf', 'http://cjc.ict.ac.cn/online/onlinepaper/wlp-201811670015.pdf', 'http://cjc.ict.ac.cn/online/onlinepaper/cx-201811670104.pdf']</code></pre>

页面列表