python urllib库的使用详解

2025-02-03 07:50:36

相关：urllib是python内置的http请求库，本文介绍urllib三个模块：请求模块urllib.request、异常处理模块urllib.error、url解析模块urllib.parse。

1、请求模块：urllib.request

python2

import urllib2
response = urllib2.urlopen('http://httpbin.org/robots.txt')

python3

import urllib.request
res = urllib.request.urlopen('http://httpbin.org/robots.txt')
urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)
urlopen()方法中的url参数可以是字符串，也可以是一个Request对象

#url可以是字符串
import urllib.request

resp = urllib.request.urlopen('http://www.baidu.com')
print(resp.read().decode('utf-8'))  # read()获取响应体的内容，内容是bytes字节流，需要转换成字符串

##url可以也是Request对象
import urllib.request

request = urllib.request.Request('http://httpbin.org')
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

data参数：post请求

# coding:utf8
import urllib.request, urllib.parse

data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf8')
resp = urllib.request.urlopen('http://httpbin.org/post', data=data)
print(resp.read())

urlopen()中的参数timeout：设置请求超时时间：

# coding:utf8
#设置请求超时时间
import urllib.request

resp = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)
print(resp.read().decode('utf-8'))

响应类型：

# coding:utf8
#响应类型
import urllib.request

resp = urllib.request.urlopen('http://httpbin.org/get')
print(type(resp))

响应的状态码、响应头：

# coding:utf8
#响应的状态码、响应头
import urllib.request

resp = urllib.request.urlopen('http://www.baidu.com')
print(resp.status)
print(resp.getheaders())  # 数组（元组列表）
print(resp.getheader('Server'))  # "Server"大小写不区分

200
[('Bdpagetype', '1'), ('Bdqid', '0xa6d873bb003836ce'), ('Cache-Control', 'private'), ('Content-Type', 'text/html'), ('Cxy_all', 'baidu+b8704ff7c06fb8466a83df26d7f0ad23'), ('Date', 'Sun, 21 Apr 2019 15:18:24 GMT'), ('Expires', 'Sun, 21 Apr 2019 15:18:03 GMT'), ('P3p', 'CP=" OTI DSP COR IVA OUR IND COM "'), ('Server', 'BWS/1.1'), ('Set-Cookie', 'BAIDUID=8C61C3A67C1281B5952199E456EEC61E:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'), ('Set-Cookie', 'BIDUPSID=8C61C3A67C1281B5952199E456EEC61E; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'), ('Set-Cookie', 'PSTM=1555859904; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'), ('Set-Cookie', 'delPer=0; path=/; domain=.baidu.com'), ('Set-Cookie', 'BDSVRTM=0; path=/'), ('Set-Cookie', 'BD_HOME=0; path=/'), ('Set-Cookie', 'H_PS_PSSID=1452_28777_21078_28775_28722_28557_28838_28584_28604; path=/; domain=.baidu.com'), ('Vary', 'Accept-Encoding'), ('X-Ua-Compatible', 'IE=Edge,chrome=1'), ('Connection', 'close'), ('Transfer-Encoding', 'chunked')]
BWS/1.1

使用代理：urllib.request.ProxyHandler()：

# coding:utf8
proxy_handler = urllib.request.ProxyHandler({'http': 'http://www.example.com:3128/'})
proxy_auth_handler = urllib.request.ProxyBasicAuthHandler()
proxy_auth_handler.add_password('realm', 'host', 'username', 'password')

opener = urllib.request.build_opener(proxy_handler, proxy_auth_handler)
# This time, rather than install the OpenerDirector, we use it directly:
resp = opener.open('http://www.example.com/login.html')
print(resp.read())

2、异常处理模块：urllib.error

异常处理实例1：

# coding:utf8
from urllib import error, request

try:
    resp = request.urlopen('http://www.blueflags.cn')
except error.URLError as e:
    print(e.reason)

异常处理实例2：

# coding:utf8
from urllib import error, request

try:
    resp = request.urlopen('http://www.baidu.com')
except error.HTTPError as e:
    print(e.reason, e.code, e.headers, sep='\n')
except error.URLError as e:
    print(e.reason)
else:
    print('request successfully')

异常处理实例3：

# coding:utf8
import socket, urllib.request, urllib.error

try:
    resp = urllib.request.urlopen('http://www.baidu.com', timeout=0.01)
except urllib.error.URLError as e:
    print(type(e.reason))
    if isinstance(e.reason,socket.timeout):
        print('time out')

3、url解析模块：urllib.parse

parse.urlencode

# coding:utf8
from urllib import request, parse

url = 'http://httpbin.org/post'
headers = {
    'Host': 'httpbin.org',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36'
}
dict = {'name': 'Germey'}
data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.Request(url=url, data=data, headers=headers, method='POST')
resp = request.urlopen(req)
print(resp.read().decode('utf-8'))

{
"args": {},
"data": "",
"files": {},
"form": {
"name": "Thanlon"
},
"headers": {
"Accept-Encoding": "identity",
"Content-Length": "12",
"Content-Type": "application/x-www-form-urlencoded",
"Host": "httpbin.org",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"
},
"json": null,
"origin": "117.136.78.194, 117.136.78.194",
"url": "https://httpbin.org/post"
}

add_header方法添加请求头：

# coding:utf8
from urllib import request, parse

url = 'http://httpbin.org/post'
dict = {'name': 'Thanlon'}
data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.Request(url=url, data=data, method='POST')
req.add_header('User-Agent',
               'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36')
resp = request.urlopen(req)
print(resp.read().decode('utf-8'))

parse.urlparse：

# coding:utf8
from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=1#comment')
print(type(result))
print(result)

<class 'urllib.parse.ParseResult'>
ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=1', fragment='comment')

from urllib.parse import urlparse

result = urlparse('www.baidu.com/index.html;user?id=1#comment', scheme='https')
print(type(result))
print(result)

<class 'urllib.parse.ParseResult'>
ParseResult(scheme='https', netloc='', path='www.baidu.com/index.html', params='user', query='id=1', fragment='comment')

# coding:utf8
from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=1#comment', scheme='https')
print(result)

ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=1', fragment='comment')

# coding:utf8
from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=1#comment',allow_fragments=False)
print(result)

ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=1', fragment='comment')

parse.urlunparse：

# coding:utf8
from urllib.parse import urlunparse

data = ['http', 'www.baidu.com', 'index.html', 'user', 'name=Thanlon', 'comment']
print(urlunparse(data))

parse.urljoin：

# coding:utf8
from urllib.parse import urljoin

print(urljoin('http://www.bai.com', 'index.html'))
print(urljoin('http://www.baicu.com', 'https://www.thanlon.cn/index.html'))#以后面为基准

urlencode将字典对象转换成get请求的参数:

# coding:utf8
from urllib.parse import urlencode

params = {
    'name': 'Thanlon',
    'age': 22
}
baseUrl = 'http://www.thanlon.cn?'
url = baseUrl + urlencode(params)
print(url)

4、Cookie

cookie的获取(保持登录会话信息)：

# coding:utf8
#cookie的获取(保持登录会话信息)
import urllib.request, http.cookiejar

cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
res = opener.open('http://www.baidu.com')
for item in cookie:
    print(item.name + '=' + item.value)

MozillaCookieJar(filename)形式保存cookie

# coding:utf8
#将cookie保存为cookie.txt
import http.cookiejar, urllib.request

filename = 'cookie.txt'
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
res = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

LWPCookieJar(filename)形式保存cookie：

# coding:utf8
import http.cookiejar, urllib.request

filename = 'cookie.txt'
cookie = http.cookiejar.LWPCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
res = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

读取cookie请求，获取登陆后的信息

# coding:utf8
import http.cookiejar, urllib.request

cookie = http.cookiejar.LWPCookieJar()
cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
resp = opener.open('http://www.baidu.com')
print(resp.read().decode('utf-8'))

以上就是python urllib库的使用详解的详细内容，更多关于python urllib库的资料请关注我们其它相关文章！

python3 中使用urllib问题以及urllib详解

今天遇到一个蛮奇怪的问题:当我在控制台中使用 urllib 没问题,但是当我在 vscode 中 .py 文件中导入 urllib 使用时会报错: AttributeError: module 'urllib' has no attribute 'request' 查了一下资料是 python3 的 urllib 不会自动导入其under层的包,需要手动导入. import urllib import urllib.parse import urllib.request 再次使用即可成功. ur
python爬虫基础之urllib的使用

一.urllib 和 urllib2的关系在python2中,主要使用urllib和urllib2,而python3对urllib和urllib2进行了重构,拆分成了urllib.request, urllib.parse, urllib.error,urllib.robotparser等几个子模块,这样的架构从逻辑和结构上说更加合理.urllib库无需安装,python3自带.python 3.x中将urllib库和urilib2库合并成了urllib库. urllib2.urlopen()
python urllib和urllib3知识点总结

在python数据请求中,我们有一个标准库专门处理这方面的问题,那就是urllib库.在不同的python版本中,urllib也有着版本上的变化.本篇就urllib和urllib3这两种库为大家带来介绍,分析其基本的用法.不同点.使用注意和实例,希望能对大家在数据请求的学习有所帮助. 1.说明在可供使用的网络库中,urllib和urllib3可能是投入产出比最高的两个.它们能让你通过网络访问文件,就像这些文件位于你的计算机中一样.只需一个简单的函数调用,就几乎可将统一资源定位符(URL)可指向
python urllib.request模块的使用详解

python的urllib模块提供了一系列操作url的功能,可以让我们通过url打开任意资源.其中比较常用的就是request模块,本篇主要介绍requset模块. urllib子模块 urllib.request 打开或请求url urllib.error 捕获处理请求时产生的异常 urllib.parse 解析url urllib.robotparser 用于解析robots.txt文件 robots.txt是一种存放于网站根目录下文本文件,用来告诉网络爬虫服务器上的那些文件可以被查看.又被
Python urllib2运行过程原理解析

1.urlopen函数 urllib2.urlopen(url[, data[, timeout[, cafile[, capath[, cadefault[, context]]]]]) 注: url表示目标网页地址,可以是字符串,也可以是请求对象Request req= urllib2.Request(url, data,headers) response = urllib2.urlopen(req,timeout=3) data表示post方式提交给目标服务器的参数 data = urll
Python urllib3软件包的使用说明

urllib3是一款Python 3的HTTP客户端. Python标准库提供了urllib.在Python 2中,另外提供了urllib2:而在Python 3中,重构了urllib和urllib2到标准库urllib,并另外提供了urllib3. 1. urllib3的特性线程安全连接缓冲池客户端SSL/TLS验证文件上传请求重试 HTTP重定向支持gzip和deflate encoding 支持HTTP和SOCKS的代理 2. 安装 urllib3不是Python 3的标准库,
Python urllib库如何添加headers过程解析

对于请求一些网站,我们需要加上请求头才可以完成网页的抓取,不然会得到一些错误,无法返回抓取的网页.下面,介绍两种添加请求头的方法. 方法一:借助build_opener和addheaders完成 import urllib.request url="http://www.meizitu.com" #注意:在urllib 中headers是元组 headers=("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64
详解python内置模块urllib

urllib 是 python 的内置模块, 主要用于处理url相关的一些操作,例如访问url.解析url等操作. urllib 包下面的 request 模块主要用于访问url,但是用得太多,因为它的光芒全都被 requests 这个第三方库覆盖了,最常用的还是 parse 模块. 写爬虫过程中,经常要对url进行参数的拼接.编码.解码,域名.资源路径提取等操作,这时 parse 模块就可以排上用场. 一.urlparse urlparse 方法是把一个完整的URL拆分成不同的组成部分,你可以
python 如何用urllib与服务端交互(发送和接收数据)

urllib是Python3中内置的HTTP请求库,不需要单独安装,官方文档链接如下:https://docs.python.org/3/library/urllib.html从官方文档可以看出,urllib包含4个模块,如图所示. 这4个模块的功能描述如下: request:最基本的HTTP请求模块,可以用来发送HTTP请求,并接收服务端的响应数据.这个过程就像在浏览器地址栏输入URL,然后按Enter键一样. error:异常处理模块,如果出现请求错误,我们可以捕获这些异常,然后根据实际情况
python中urllib.request和requests的使用及区别详解

urllib.request 我们都知道,urlopen()方法能发起最基本对的请求发起,但仅仅这些在我们的实际应用中一般都是不够的,可能我们需要加入headers之类的参数,那需要用功能更为强大的Request类来构建了在不需要任何其他参数配置的时候,可直接通过urlopen()方法来发起一个简单的web请求发起一个简单的请求 import urllib.request url='https://www.douban.com' webPage=urllib.request.urlopen(
Python urllib.request对象案例解析

刚刚接触爬虫,基础的东西得时时回顾才行,这么全面的帖子无论如何也得厚着脸皮转过来啊! 什么是 Urllib 库? urllib 库是 Python 内置的 HTTP 请求库.urllib 模块提供的上层接口,使访问 www 和 ftp 上的数据就像访问本地文件一样. 有以下几种模块: 1.urllib.request 请求模块 2. urllib.error 异常处理模块 3. urllib.parse url 解析模块 4. urllib.robotparser robots.txt 解析模
Python urllib request模块发送请求实现过程解析

1.Request()的参数 import urllib.request request=urllib.request.Request('https://python.org') response=urllib.request.urlopen(request) print(response.read().decode('utf-8')) 通过构造这个数据结构,一方面可以我们可以将请求独立成一个对象,另一方面可以更加丰富和灵活地配置参数. 它的构造方法如下: class.urllib.reques

python urllib库的使用详解

1、请求模块：urllib.request

data参数：post请求

urlopen()中的参数timeout：设置请求超时时间：

响应类型：

响应的状态码、响应头：

使用代理：urllib.request.ProxyHandler()：

2、异常处理模块：urllib.error

异常处理实例1：

异常处理实例2：

异常处理实例3：

3、url解析模块：urllib.parse

parse.urlencode

add_header方法添加请求头：

parse.urlparse：

parse.urlunparse：

parse.urljoin：

urlencode将字典对象转换成get请求的参数:

4、Cookie

cookie的获取(保持登录会话信息)：

MozillaCookieJar(filename)形式保存cookie

LWPCookieJar(filename)形式保存cookie：

读取cookie请求，获取登陆后的信息

相关推荐

随机推荐