Python爬虫实现HTTP网络请求多种实现方式

2024-12-01 03:45:07

1、通过urllib.requests模块实现发送请求并读取网页内容的简单示例如下：

#导入模块
import urllib.request
#打开需要爬取的网页
response = urllib.request.urlopen('http://www.baidu.com')
#读取网页代码
html = response.read()
#打印读取的内容
print(html)

结果：

b'<!DOCTYPE html><!--STATUS OK-->\n\n\n \n \n       <html><head><meta http-equiv="Content-Type" content="text/html;charset=utf-8"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"><meta content="always" name="referrer"><meta name="theme-color" content="#2932e1"><meta name="description" content="\xe5\x85\xa8\xe7\x90\x83\xe6\x9c\x80\xe5\xa4\xa7\xe7\x9a\x84\xe4\xb8\xad\xe6\x96\x87\xe6\x90\x9c\xe7\xb4\xa2\xe5\xbc\x95\xe6\x93\x8e\xe3\x80\x81\xe8\x87\xb4\xe5\x8a\x9b\xe4\xba\x8e\xe8\xae\xa9\xe7\xbd\x91\xe6\xb0\x91\xe6\x9b\xb4\xe4\xbe\xbf\xe6\x8d\xb7\xe5\x9c\xb0\xe8\x8e\xb7\xe5\x8f\x96\xe4\xbf\xa1\xe6\x81\xaf\xef\xbc\x8c\xe6\x89\xbe\xe5\x88\xb0\xe6\x89\x80\xe6\xb1\x82\xe3\x80\x82\xe7\x99\xbe\xe5\xba\xa6\xe8\xb6\x85\xe8\xbf\x87\xe5\x8d\x83\xe4\xba\xbf\xe7\x9a\x84\xe4\xb8\xad\xe6\x96\x87\xe7\xbd\x91\xe9\xa1\xb5\xe6\x95\xb0\xe6\x8d\xae\xe5\xba\x93\xef\xbc\x8c\xe5\x8f\xaf\xe4\xbb\xa5\xe7\x9e\xac\xe9\x97\xb4\xe6\x89\xbe\xe5\x88\xb0\xe7\x9b\xb8\xe5\x85\xb3\xe7\x9a\x84\xe6\x90\x9c\xe7\xb4\xa2\xe7\xbb\x93\xe6\x9e\x9c\xe3\x80\x82"><link rel="shortcut icon" href="/favicon.ico" rel="external nofollow" type="image/x-icon" /><link rel="search" type="application/opensearchdescription+xml" href="/content-search.xml" rel="external nofollow" title="\xe7\x99\xbe\xe5\xba\xa6\xe6\x90\x9c\xe7\xb4\xa2" /><link rel="icon" sizes="any" mask href="//www.baidu.com/img/baidu_85beaf5496f291521eb75ba38eacbd87.svg" rel="external nofollow" ><link rel="dns-prefetch" href="//dss0.bdstatic.com" rel="external nofollow" /><link rel="dns-prefetch" href="//dss1.bdstatic.com" rel="external nofollow" /><link rel="dns-prefetch" href="//ss1.bdstatic.com" rel="external nofollow" /><link rel="dns-prefetch" href="//sp0.baidu.com" rel="external nofollow" /><link rel="dns-prefetch" href="//sp1.baidu.com" rel="external nofollow" /><link rel="dns-prefetch" href="//sp2.baidu.com" rel="external nofollow" /><title>\xe7\x99\xbe\xe5\xba\xa6\xe4\xb8\x80\xe4\xb8\x8b\xef\xbc\x8c\xe4\xbd\xa0\xe5\xb0\xb1\xe7\x9f\xa5\xe9\x81\x93</title><style index="newi" type="text/css">#form .bdsug{top:39px}.bdsug{display:none;position:absolute;width:535px;background:#fff;border:1px solid
………………（太多省略）

以上示例中是通过get请求方式获取百度的网页内容。

下面是通过urllib.request模块的post请求实现获取网页信息的内容：

#导入模块
import urllib.parse
import urllib.request
#将数据使用urlencode编码处理后，再使用encoding设置为utf-8编码
data = bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf-8')
#打开指定需要爬取的网页
response = urllib.request.urlopen('http://httpbin.org/post',data=data)
html = response.read()
#打印读取的内容
print(html)

结果：

b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "word": "hello"\n }, \n "headers": {\n "Accept-Encoding": "identity", \n "Content-Length": "10", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "httpbin.org", \n "User-Agent": "Python-urllib/3.7", \n "X-Amzn-Trace-Id": "Root=1-5ec3f607-00f717e823a5c268fe0e0be8"\n }, \n "json": null, \n "origin": "123.139.39.71", \n "url": "http://httpbin.org/post"\n}\n'

2、urllib3模块

通过urllib3模块实现发送网络请求的示例代码：

#导入模块
import urllib3
#创建PoolManager对象，用于处理与线程池的连接以及线程安全的所有细节
http = urllib3.PoolManager()
#对需要爬取的网页发送请求
response = http.request('GET','https://www.baidu.com/')
#打印读取的内容
print(response.data)

结果：

b'<!DOCTYPE html><!--STATUS OK-->\r\n<html>\r\n<head>\r\n\t<meta http-equiv="content-type" content="text/html;charset=utf-8">\r\n\t<meta http-equiv="X-UA-Compatible" content="IE=Edge">\r\n\t<link rel="dns-prefetch" href="//s1.bdstatic.com" rel="external nofollow" />\r\n\t<link rel="dns-prefetch" href="//t1.baidu.com" rel="external nofollow" />\r\n\t<link rel="dns-prefetch" href="//t2.baidu.com" rel="external nofollow" />\r\n\t<link rel="dns-prefetch" href="//t3.baidu.com" rel="external nofollow" />\r\n\t<link rel="dns-prefetch" href="//t10.baidu.com" rel="external nofollow" />\r\n\t<link rel="dns-prefetch" href="//t11.baidu.com" rel="external nofollow" />\r\n\t<link rel="dns-prefetch" href="//t12.baidu.com" rel="external nofollow" />\r\n\t<link rel="dns-prefetch" href="//b1.bdstatic.com" rel="external nofollow" />\r\n\t<title>\xe7\x99\xbe\xe5\xba\xa6\xe4\xb8\x80\xe4\xb8\x8b\xef\xbc\x8c\xe4\xbd\xa0\xe5\xb0\xb1\xe7\x9f\xa5\xe9\x81\x93</title>\r\n\t<link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/static/home/css/index.css" rel="external nofollow" rel="stylesheet" type="text/css" />\r\n\t<!--[if lte IE 8]><style index="index" >#content{height:480px\\9}#m{top:260px\\9}</style><![endif]-->\r\n\t<!--[if IE 8]><style index="index" >#u1 a.mnav,#u1 a.mnav:visited{font-family:simsun}</style><![endif]-->\r\n\t<script>var hashMatch = document.location.href.match(/#+(.*wd=[^&].+)/);if (hashMatch && hashMatch[0] && hashMatch[1]) {document.location.replace("http://"+location.host+"/s?"+hashMatch[1]);}
…………………………（太多省略）

post请求实现获取网页信息的内容：

#导入模块
import urllib3
#创建PoolManager对象，用于处理与线程池的连接以及线程安全的所有细节
http = urllib3.PoolManager()
#对需要爬取的网页发送请求
response = http.request('POST','http://httpbin.org/post',fields={'word':'hello'})
#打印读取的内容
print(response.data)

结果：

b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "word": "hello"\n }, \n "headers": {\n "Accept-Encoding": "identity", \n "Content-Length": "128", \n "Content-Type": "multipart/form-data; boundary=06ff68d7a4a22f600244a70bf9382ab2", \n "Host": "httpbin.org", \n "X-Amzn-Trace-Id": "Root=1-5ec3f8c3-9f33c46c1c1b37f6774b84f2"\n }, \n "json": null, \n "origin": "123.139.39.71", \n "url": "http://httpbin.org/post"\n}\n'

3、requests模块

以GET请求方式为例，打印多种请求信息的代码:

#导入模块
import requests
#对需要爬取的网页发送请求
response = requests.get('http://www.baidu.com')
#打印状态码
print('状态码:',response.status_code)
#打印请求url
print('url:',response.url)
#打印头部信息
print('header:',response.headers)
#打印cookie信息
print('cookie:',response.cookies)
#以文本形式打印网页源码
print('text:',response.text)
#以字节流形式打印网页源码
print('content:',response.content)

结果：

状态码: 200
url: http://www.baidu.com/
header: {'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Tue, 19 May 2020 15:28:30 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:27:32 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'}
cookie: <RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
text: <!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8>
………………（此处省略）
content: b'<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8>
………………（此处省略）

以POST请求方式，发送HTTP网页请求的示例：

#导入模块
import requests
#表单参数
data = {'word':'hello'}
#对需要爬取的网页发送请求
response = requests.post('http://httpbin.org/post',data=data)
#以字节流形式打印网页源码
print(response.content)

结果：

b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "word": "hello"\n }, \n "headers": {\n "Accept": "*/*", \n "Accept-Encoding": "gzip, deflate", \n "Content-Length": "10", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "httpbin.org", \n "User-Agent": "python-requests/2.23.0", \n "X-Amzn-Trace-Id": "Root=1-5ec3fc97-965139d919e5a08e8135e731"\n }, \n "json": null, \n "origin": "123.139.39.71", \n "url": "http://httpbin.org/post"\n}\n'

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持我们。

python通过get,post方式发送http请求和接收http响应的方法

本文实例讲述了python通过get,post方式发送http请求和接收http响应的方法.分享给大家供大家参考.具体如下: 测试用CGI,名字为test.py,放在apache的cgi-bin目录下: #!/usr/bin/python import cgi def main(): print "Content-type: text/html\n" form = cgi.FieldStorage() if form.has_key("ServiceCode") a
如何基于Python + requests实现发送HTTP请求

这篇文章主要介绍了如何基于Python + requests实现发送HTTP请求,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友可以参考下一.在接口自动化测试过程中,存在两种情况: 一种是不需要鉴权的接口,可以直接访问的. 还有一种情况是需要鉴权才可以访问的接口. 这里我们通过Python + requests 实现这两种发送请求的方法 """ ============================ author:Treasure丶
python 请求服务器的实现代码(http请求和https请求)

一.http请求 1.http请求方式:get和post get一般用于获取/查询资源信息,在浏览器中直接输入url+请求参数点击enter之后连接成功服务器就能获取到的内容,post请求一般用于更新资源,通过form表单或者json.xml等其他形式提交给服务器端,然后等待服务器端给返回一个结果的方式(这个返回结果一般就是被修改之后的是否成功的状态,或者是修改后的最新数据table等). http请求,不论是get还是post请求,都会包含几个部分,分别是header,cookie,get会有
Python 使用指定的网卡发送HTTP请求的实例

需求: 一台机器上有多个网卡, 如何访问指定的 URL 时使用指定的网卡发送数据呢? $ curl --interface eth0 www.baidu.com # curl interface 可以指定网卡阅读 urllib.py 的源码, 追述到 open_http –> httplib.HTTP –> httplib.HTTP._connection_class = HTTPConnection HTTPConnection 在创建的时候会指定一个 source_address. HT
Python3处理HTTP请求的实例

Python3处理HTTP请求的包:http.client,urllib,urllib3,requests 其中,http 比较 low-level,一般不直接使用 urllib更 high-level一点,属于标准库.urllib3跟urllib类似,拥有一些重要特性而且易于使用,但是属于扩展库,需要安装 requests 基于urllib3 ,也不是标准库,但是使用非常方便个人感觉,如果非要用标准库,就使用urllib.如果没有限制,就用requests # import http.cli
解决Python发送Http请求时,中文乱码的问题

解决方法: 先encode再quote. 原理: msg.encode('utf-8')是解决中文乱码问题. quote():假如URL的 name 或者 value 值中有『&』.『%』或者『=』等符号,就会有问题.所以URL中的参数字符串也需要把『&=』等符号进行编码,quote()就是对参数字符串中的『&=%』等符号进行编码. 例子: # -*- coding: UTF-8 -*- # python2.7 from urllib import quote import req
对Python发送带header的http请求方法详解

简单的header import urllib2 request = urllib2.Request('http://example.com/') request.add_header('User-Agent', 'fake-client') response = urllib2.urlopen(request) print request.read() 包含较多元素的header import urllib,urllib2 url = 'http://example.com/' headers
Python使用指定端口进行http请求的例子

使用requests库 class SourcePortAdapter(HTTPAdapter): """"Transport adapter" that allows us to set the source port.""" def __init__(self, port, *args, **kwargs): self.poolmanager = None self._source_port = port super().
python用requests实现http请求代码实例

这篇文章主要介绍了python用requests实现http请求过程解析,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友可以参考下 1. get import requests # 最简单的get请求 r = requests.get(url) print(r.status_code) print(r.json()) # url 中?key=value&key=value r = requests.get(url, params=params) # for
Python爬虫实现HTTP网络请求多种实现方式

1.通过urllib.requests模块实现发送请求并读取网页内容的简单示例如下: #导入模块 import urllib.request #打开需要爬取的网页 response = urllib.request.urlopen('http://www.baidu.com') #读取网页代码 html = response.read() #打印读取的内容 print(html) 结果: b'<!DOCTYPE html>\n\n\n \n \n &
Python爬虫基础讲解之请求

一.请求目标(URL) URL又叫作统一资源定位符,是用于完整地描述Internet上网页和其他资源的地址的一种方法.类似于windows的文件路径. 二.网址的组成: 1.http://:这个是协议,也就是HTTP超文本传输协议,也就是网页在网上传输的协议. 2.mail:这个是服务器名,代表着是一个邮箱服务器,所以是mail. 3.163.com:这个是域名,是用来定位网站的独一无二的名字. 4.mail.163.com:这个是网站名,由服务器名+域名组成. 5./:这个是根目录,也就是说,
使用Python爬虫库requests发送请求、传递URL参数、定制headers

首先我们先引入requests模块 import requests 一.发送请求 r = requests.get('https://api.github.com/events') # GET请求 r = requests.post('http://httpbin.org/post', data = {'key':'value'}) # POST请求 r = requests.put('http://httpbin.org/put', data = {'key':'value'}) # PUT请
python 包实现 urllib 网络请求操作

目录一.简介二.发起请求三.携带参数请求四.获取响应数据五.设置headers 六.使用代理七.认证登录八.设置cookie 九.异常处理十.HTTP异常十一.超时异常十二.解析编码十三.参数拼接十四.请求链接解析十五.拼接链接十六.字典转换参数一.简介是一个 python 内置包,不需要额外安装即可使用 urllib 是 Python 标准库中用于网络请求的库,内置四个模块,分别是 urllib.request:用来打开和读取 url,可以用它来模拟发送请求,获
详解Python 重学requests发起请求的基本方式

安装相关模块 pip install requests requests-toolbelt 代码实例 import requests import json from PIL import Image from io import BytesIO from requests_toolbelt import MultipartEncoder ''' 使用 requests 请求返回的 response 注意事项 response.text 获得响应结果的字符串类型 response.content
快速一键生成Python爬虫请求头

今天介绍个神奇的网站!堪称爬虫偷懒的神器! 我们在写爬虫,构建网络请求的时候,不可避免地要添加请求头( headers ),以 mdn 学习区为例,我们的请求头是这样的: 一般来说,我们只要添加 user-agent 就能满足绝大部分需求了,Python 代码如下: import requests headers = { #'authority': 'developer.mozilla.org', #'pragma': 'no-cache', #'cache-control': 'no-cach
python爬虫之请求模块urllib的基本使用

目录前言 urllib的子模块 HttpResponse常用方法与属性获取信息 urlli.parse的使用(一般用于处理带中文的url) 爬取baidu官网HTML源代码添加请求头信息(重构user_agent) 扩展知识 with open和open两者的区别总结前言在实现网络爬虫的爬取工作时,就必须使用网络请求,只有进行了网络请求才可以对响应结果中的数据进行提取,urllib模块是python自带的网络请求模块,无需安装,导入即可使用.下面将介绍如果使用python中的urlli
python爬虫基本知识

爬虫简介根据百度百科定义:网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本.另外一些不常使用的名字还有蚂蚁.自动索引.模拟程序或者蠕虫. 随着大数据的不断发展,爬虫这个技术慢慢走入人们的视野,可以说爬虫是大数据应运而生的产物,至少我解除了大数据才了解到爬虫这一技术随着数据的海量增长,我们需要在互联网上选取所需要的数据进行自己研究的分析和实验.这就用到了爬虫这一技术,下面就跟着小编一起初遇python
Python爬虫爬取疫情数据并可视化展示

目录知识点开发环境爬虫完整代码导入模块分析网站发送请求获取数据解析数据保存数据数据可视化导入模块读取数据死亡率与治愈率各地区确诊人数与死亡人数情况知识点爬虫基本流程 json requests 爬虫当中发送网络请求 pandas 表格处理 / 保存数据 pyecharts 可视化开发环境 python 3.8 比较稳定版本解释器发行版 anaconda jupyter notebook 里面写数据分析代码专业性 pycharm 专业代码编辑器按照年份与月
你会使用python爬虫抓取弹幕吗

目录前言一.爬虫是什么? 二.饲养步骤 1.请求弹幕 2.解析弹幕 3.存储弹幕 4.总代码三.总结前言时隔108天,何同学在B站发布了最新的视频,<[何同学]我用108天开了个灯…>.那么就让我们用爬虫,爬取视频的弹幕,看看小伙伴们是怎么评价的吧一.爬虫是什么? 百度百科这样说:自动获取网页内容的程序.在我理解看来,爬虫就是~~“在网络上爬来爬去的…”住口!~~那么接下来就让我们看看如何养搬运B站弹幕的“虫”吧二.饲养步骤 1.请求弹幕首先,得知道爬取的网站url是什么.对于

Python爬虫实现HTTP网络请求多种实现方式

相关推荐

随机推荐