18、爬虫 a href直接匹配
<p>例子</p>
<pre><code>#!/usr/bin/python
#-*- coding:utf-8 -*-
import urllib2
import re
t=urllib2.urlopen('http://www.so.com')
d=t.read()
p=re.compile(r'<a href="(.+?)">')
ret=p.findall(d)
for i in ret:
print i
返回是这样的:
http://zhidao.baidu.com/topic/yaan/" target="_blank
http://google.org/personfinder/2013-sichuan-earthquake/" target="_blank
http://gongyi.in.sohu.com/yaan/index.html" target="_blank
http://www.sogou.com/yaan.html" target="_blank
http://weibo.com/yijijin" target="_blank
http://gongyi.weibo.com/140996" target="_blank
http://www.miibeian.gov.cn/
http://info.so.360.cn/feedback.html" data-linkid="1
http://zhanzhang.so.com" data-linkid="2
http://www.360.cn/about/index.html" data-linkid="3
http://www.so.com/help/help_1_1.html" data-linkid="4
http://www.so.com/help/help_iduty.html" data-linkid="4
http://e.360.cn?src=srp" data-linkid="5
</code></pre>
<p>当不是空格的时候
比如:</p>
<pre><code><span\r\nlang=EN-US style=\'font-size:12.0pt\'><a\r\nhref="http://cjc.ict.ac.cn/online/onlinepaper/cx-201811670104.pdf"></code></pre>
<p>需要补全<a\r\nhref="(.+?)"></p>
<pre><code>url = 'http://cjc.ict.ac.cn/qwjs/No2018-01.htm'
t=urllib2.urlopen(url)
d=t.read()
p=re.compile(r'<a\r\nhref="(.+?)">')
ret=p.findall(d)
print ret</code></pre>
<h3>完整示例</h3>
<pre><code>from bs4 import BeautifulSoup
import urllib.request
import re
# import urllib.request
from lxml import etree
# import pandas as pd
# 返回html的soup解析
def openUrl(url):
#headers = {'User-Agent': 'User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'}
it_header = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.1.1 Safari/605.1.15'}
req = urllib.request.Request(url, headers=it_header)
response = urllib.request.urlopen(req) #请求
html = response.read().decode("gb2312")
#print(html)
Soup = BeautifulSoup(html, 'lxml')
return Soup
url = 'http://cjc.ict.ac.cn/qwjs/No2018-01.htm'
# Soup = openUrl(url)
page = urllib.request.urlopen(url)
html = page.read().decode("gb2312")
fanye_urls = re.findall(re.compile(r'<a\r\nhref="(.+?)">'), html, flags=0)
fanye_urls</code></pre>
<pre><code>['http://cjc.ict.ac.cn/online/onlinepaper/liuq-201811662728.pdf',
'http://cjc.ict.ac.cn/online/onlinepaper/pyh-201811662750.pdf',
'http://cjc.ict.ac.cn/online/onlinepaper/lx-201811663141.pdf',
'http://cjc.ict.ac.cn/online/onlinepaper/pyt-201811664814.pdf',
'http://cjc.ict.ac.cn/online/onlinepaper/zy-201811664924.pdf',
'http://cjc.ict.ac.cn/online/onlinepaper/zql-201811665026.pdf',
'http://cjc.ict.ac.cn/online/onlinepaper/cdh-201811665125.pdf',
'http://cjc.ict.ac.cn/online/onlinepaper/zj-201811670401.pdf',
'http://cjc.ict.ac.cn/online/onlinepaper/hym-201811665355.pdf',
'http://cjc.ict.ac.cn/online/onlinepaper/tw-201811665510.pdf',
'http://cjc.ict.ac.cn/online/onlinepaper/lxx-201811665617.pdf',
'http://cjc.ict.ac.cn/online/onlinepaper/cxf-201811665811.pdf',
'http://cjc.ict.ac.cn/online/onlinepaper/lch-201811665911.pdf',
'http://cjc.ict.ac.cn/online/onlinepaper/wlp-201811670015.pdf',
'http://cjc.ict.ac.cn/online/onlinepaper/cx-201811670104.pdf']</code></pre>