My_Project

资料整理


python-爬虫

<h4>编写人:谭文</h4> <h4>爬取微博热搜关键词分为主要三步</h4> <p>1、获取响应的网页信息 2、解析数据,筛选我们需要的数据 3、存储数据</p> <p>准备工作:导入包</p> <p>import json import requests from bs4 import BeautifulSoup</p> <h4>1、获取响应的网页信息</h4> <p>url = &quot;<a href="https://s.weibo.com/top/summary?cate=realtimehot">https://s.weibo.com/top/summary?cate=realtimehot</a>&quot; HEADERS= { &quot;User-Agent&quot;:&quot;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36&quot; } res = requests.get(url,headers = HEADERS) text = res.text</p> <h4>2、解析数据,筛选我们需要的数据</h4> <p>#实例化 soup = BeautifulSoup(text,'lxml')</p> <p>#定位class属性为data 的div 标签 div= soup.find('div',attrs={&quot;class&quot;:&quot;data&quot;})</p> <p>#定位tbody标签 tbody = div.find(&quot;tbody&quot;)</p> <p>#定位所有的tr标签 trs = tbody.find_all(&quot;tr&quot;) hots = []</p> <p>#循环所有的tr标签,找到td标签,拿取文本值 for tr in trs: td = tr.find_all('td')[1] print(td) realtimehot = list(td.stripped_strings)[0] hots.append(realtimehot) print(hots) #列表格式转换格式为字符串 hots = json.dumps(hots,ensure_ascii=False)</p> <h4>3、存储数据</h4> <p>with open(&quot;realtime_hot.txt&quot;,&quot;w&quot;,encoding='utf-8') as f: f.write(hots)</p> <p>运行脚本,可以发现微博热搜前50个词语被我们成功爬取下来了 <img src="http://rpddoc.weoa.com/server/../Public/Uploads/2019-08-23/5d5f4c13337d4.png" alt="" /></p> <h4>方法二:用正则表达式提取热搜词</h4> <p><img src="http://rpddoc.weoa.com/server/../Public/Uploads/2019-08-26/5d63473031d53.png" alt="" /> 运行脚本,得到热搜词 <img src="http://rpddoc.weoa.com/server/../Public/Uploads/2019-08-26/5d6346c7a62d9.png" alt="" /></p> <hr /> <h3>评论区</h3> <h4>编写人:燎江</h4> <h4>方法三:</h4> <pre><code class="language-python">import requests from lxml import etree # 1、获取响应的网页信息 url = "https://s.weibo.com/top/summary?cate=realtimehot" HEADERS = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36" } res = requests.get(url, headers=HEADERS) text = res.text # 这里我用的是lxml库中etree html = etree.HTML(text) # 使用的是xpath方法 trs = html.xpath("//tbody/tr[position()&gt;1]") lis = [] for tr in trs: # 获取排名 num = tr.xpath("./td[@class='td-01 ranktop']/text()")[0] # 获取热搜词 name = tr.xpath("./td[@class='td-02']/a/text()")[0] # 存入字典中 dic = { u'排名': num, u'热搜词': name } lis.append(dic) for dic in lis: with open("realtime_hot2.txt", 'a+', encoding="utf-8") as f: f.write(str(dic) + "\n")</code></pre> <p>结果截图(部分)</p> <h3><img src="http://rpddoc.weoa.com/server/../Public/Uploads/2019-09-12/5d7a0761df4e7.png" alt="" /></h3>

页面列表

ITEM_HTML