Python


requests

<p>requests: 发送HTTP请求,接收响应</p> <ol> <li>如果浏览器能访问,requests不能访问,最坏是把浏览器请求头部,都写到requests请求头部中</li> </ol> <p>import requests from bs4 import BeautifulSoup</p> <p>r1 = requests.get( url='<a href="https://dig.chouti.com/">https://dig.chouti.com/</a>', headers={ # 注意,如果浏览器能访问,requests不能访问,最坏是把浏览器请求头部,都写到次处 'user-agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36' # 伪装成Chrome浏览器访问 } )</p> <p>soup = BeautifulSoup(r1.text,'html.parser')</p> <h1>标签对象</h1> <p>content_list = soup.find(name='div',id='content-list')</p> <h1>print(content_list)</h1> <h1>[标签对象,标签对象]</h1> <p>item_list = content_list.find_all(name='div',attrs={'class':'item'}) for item in item_list: a = item.find(name='a',attrs={'class':'show-content color-chag'}) print(a.text.strip())</p> <h1>print(a.text)</h1> <p>  </p> <ol> <li>Referer(根据目标网页检查来源网页,防止盗链) 在HTTP协议中,有一个表头字段叫referer,采用URL的格式来表示从哪儿链接到当前的网页或文件。换句话说,通过referer,网站可以检测目标网页访问的来源网页,如果是资源文件, 则可以跟踪到显示它的网页地址。有了referer跟踪来源就好办了,这时就可以通过技术手段来进行处理,一旦检测到来源不是本站即进行阻止或者返回指定的页面</li> </ol> <p>import re import requests</p> <p>r1 = requests.get( url='<a href="https://passport.lagou.com/login/login.html">https://passport.lagou.com/login/login.html</a>', headers={ 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36', } ) X_Anti_Forge_Token = re.findall(&quot;X_Anti_Forge_Token = '(.<em>?)'&quot;, r1.text, re.S)[0] X_Anti_Forge_Code = re.findall(&quot;X_Anti_Forge_Code = '(.</em>?)'&quot;, r1.text, re.S)[0]</p> <h1>print(X_Anti_Forge_Token, X_Anti_Forge_Code)</h1> <h1>print(r1.text)</h1> <p># r2 = requests.post( url='<a href="https://passport.lagou.com/login/login.json">https://passport.lagou.com/login/login.json</a>', headers={ 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36', 'X-Anit-Forge-Code':X_Anti_Forge_Code, 'X-Anit-Forge-Token':X_Anti_Forge_Token, 'Referer': '<a href="https://passport.lagou.com/login/login.html">https://passport.lagou.com/login/login.html</a>', # 上一次请求地址是什么? }, data={ &quot;isValidate&quot;: True, 'username': '15131255089', 'password': 'ab18d270d7126ea65915c50288c22c0d', 'request_form_verifyCode': '', 'submit': '' }, cookies=r1.cookies.get_dict() ) print(r2.text)</p> <ol> <li>requests 参数</li> </ol> <p>url headers data cookies json params proxies timeout</p>

页面列表

ITEM_HTML