python-爬虫
<h4>编写人:谭文</h4>
<h4>爬取微博热搜关键词分为主要三步</h4>
<p>1、获取响应的网页信息
2、解析数据,筛选我们需要的数据
3、存储数据</p>
<p>准备工作:导入包</p>
<p>import json
import requests
from bs4 import BeautifulSoup</p>
<h4>1、获取响应的网页信息</h4>
<p>url = "<a href="https://s.weibo.com/top/summary?cate=realtimehot">https://s.weibo.com/top/summary?cate=realtimehot</a>"
HEADERS= {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"
}
res = requests.get(url,headers = HEADERS)
text = res.text</p>
<h4>2、解析数据,筛选我们需要的数据</h4>
<p>#实例化
soup = BeautifulSoup(text,'lxml')</p>
<p>#定位class属性为data 的div 标签
div= soup.find('div',attrs={"class":"data"})</p>
<p>#定位tbody标签
tbody = div.find("tbody")</p>
<p>#定位所有的tr标签
trs = tbody.find_all("tr")
hots = []</p>
<p>#循环所有的tr标签,找到td标签,拿取文本值
for tr in trs:
td = tr.find_all('td')[1]
print(td)
realtimehot = list(td.stripped_strings)[0]
hots.append(realtimehot)
print(hots)
#列表格式转换格式为字符串
hots = json.dumps(hots,ensure_ascii=False)</p>
<h4>3、存储数据</h4>
<p>with open("realtime_hot.txt","w",encoding='utf-8') as f:
f.write(hots)</p>
<p>运行脚本,可以发现微博热搜前50个词语被我们成功爬取下来了
<img src="http://rpddoc.weoa.com/server/../Public/Uploads/2019-08-23/5d5f4c13337d4.png" alt="" /></p>
<h4>方法二:用正则表达式提取热搜词</h4>
<p><img src="http://rpddoc.weoa.com/server/../Public/Uploads/2019-08-26/5d63473031d53.png" alt="" />
运行脚本,得到热搜词
<img src="http://rpddoc.weoa.com/server/../Public/Uploads/2019-08-26/5d6346c7a62d9.png" alt="" /></p>
<hr />
<h3>评论区</h3>
<h4>编写人:燎江</h4>
<h4>方法三:</h4>
<pre><code class="language-python">import requests
from lxml import etree
# 1、获取响应的网页信息
url = "https://s.weibo.com/top/summary?cate=realtimehot"
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36"
}
res = requests.get(url, headers=HEADERS)
text = res.text
# 这里我用的是lxml库中etree
html = etree.HTML(text)
# 使用的是xpath方法
trs = html.xpath("//tbody/tr[position()>1]")
lis = []
for tr in trs:
# 获取排名
num = tr.xpath("./td[@class='td-01 ranktop']/text()")[0]
# 获取热搜词
name = tr.xpath("./td[@class='td-02']/a/text()")[0]
# 存入字典中
dic = {
u'排名': num,
u'热搜词': name
}
lis.append(dic)
for dic in lis:
with open("realtime_hot2.txt", 'a+', encoding="utf-8") as f:
f.write(str(dic) + "\n")</code></pre>
<p>结果截图(部分)</p>
<h3><img src="http://rpddoc.weoa.com/server/../Public/Uploads/2019-09-12/5d7a0761df4e7.png" alt="" /></h3>