python实战之Scrapy框架爬虫爬取微博热搜

2025-03-31 10:34:52

前言：大概一年前写的，前段时间跑了下，发现还能用，就分享出来了供大家学习，代码的很多细节不太记得了，也尽力做了优化。
因为毕竟是微博，反爬技术手段还是很周全的，怎么绕过反爬的话要在这说都可以单独写几篇文章了（包括网页动态加载，ajax动态请求，token密钥等等，特别是二级评论，藏得很深，记得当时想了很久才成功拿到），直接上代码。

主要实现的功能：
0.理所应当的,绕过了各种反爬。
1.爬取全部的热搜主要内容。
2.爬取每条热搜的相关微博。
3.爬取每条相关微博的评论，评论用户的各种详细信息。
4.实现了自动翻译，理论上来说，是可以拿下与热搜相关的任何细节，但数据量比较大，推荐使用数据库对这个爬虫程序进行优化（因为当时还没学数据库，不会用，就按照一定格式在本地进行了存储）

（未实现功能）：
利用爬取数据构建社交网。可构建python的数据分析，将爬取的用户构成一个社交网络。

项目结构：

weibo.py

用于爬取需要数据，调用回调分析数据后移交给item，再由item移交给管道进行处理，包括持久化数据等等。

import scrapy
from copy import deepcopy
from time import sleep
import json
from lxml import etree
import re

class WeiboSpider(scrapy.Spider):
    name = 'weibo'
    start_urls = ['https://s.weibo.com/top/summary?Refer=top_hot&topnav=1&wvr=6']
    home_page = "https://s.weibo.com/"
    #携带cookie发起请求
    def start_requests(self):
        cookies = "" #获取一个cookie
        cookies = {i.split("=")[0]: i.split("=")[1] for i in cookies.split("; ")}
        yield scrapy.Request(
            self.start_urls[0],
            callback=self.parse,
            cookies=cookies
        )

    #分析热搜和链接
    def parse(self, response, **kwargs):
        page_text = response.text
        with open('first.html','w',encoding='utf-8') as fp:
            fp.write(page_text)
        item = {}
        tr = response.xpath('//*[@id="pl_top_realtimehot"]/table//tr')[1:]
        #print(tr)
        for t in tr:
            item['title'] = t.xpath('./td[2]//text()').extract()[1]
            print('title : ',item['title'])
        #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
        #item['description'] = response.xpath('//div[@id="description"]').get()
            detail_url = self.home_page + t.xpath('./td[2]//@href').extract_first()
            item['href'] = detail_url
            print("href:",item['href'])

            #print(item)
            #yield item
            yield scrapy.Request(detail_url,callback=self.parse_item, meta={'item':deepcopy(item)})
            # print("parse完成")
            sleep(3)

            #print(item)
#       item{'title':href,}

    #分析每种热搜下的各种首页消息
    def parse_item(self, response, **kwargs):
        # print("开始parse_item")
        item = response.meta['item']
        #print(item)
        div_list = response.xpath('//div[@id="pl_feedlist_index"]//div[@class="card-wrap"]')[1:]
        #print('--------------')
        #print(div_list)
        #details_url_list = []
        #print("div_list : ",div_list)
        #创建名字为标题的文本存储热搜
        name = item['title']
        file_path = './' + name
        for div in div_list:
            author = div.xpath('.//div[@class="info"]/div[2]/a/@nick-name').extract_first()
            brief_con = div.xpath('.//p[@node-type="feed_list_content_full"]//text()').extract()
            if brief_con is None:
                brief_con = div.xpath('.//p[@class="txt"]//text()').extract()
            brief_con = ''.join(brief_con)
            print("brief_con : ",brief_con)
            link = div.xpath('.//p[@class="from"]/a/@href').extract_first()

            if author is None or link is None:
                continue
            link = "https:" + link + '_&type=comment'
            news_id = div.xpath('./@mid').extract_first()
            print("news_id : ",news_id)
            # print(link)
            news_time = div.xpath(".//p[@class='from']/a/text()").extract()
            news_time = ''.join(news_time)
            print("news_time:", news_time)
            print("author为:",author)
            item['author'] = author
            item['news_id'] = news_id
            item['news_time'] = news_time
            item['brief_con'] = brief_con
            item['details_url'] = link
            #json链接模板:https://weibo.com/aj/v6/comment/big?ajwvr=6&id=4577307216321742&from=singleWeiBo
            link = "https://weibo.com/aj/v6/comment/big?ajwvr=6&id="+ news_id + "&from=singleWeiBo"
            # print(link)

            yield scrapy.Request(link,callback=self.parse_detail,meta={'item':deepcopy(item)})

        #if response.xpath('.//')

    #分析每条消息的详情和评论
    #https://weibo.com/1649173367/JwjbPDW00?refer_flag=1001030103__&type=comment
    #json数据包
    #https://weibo.com/aj/v6/comment/big?ajwvr=6&id=4577307216321742&from=singleWeiBo&__rnd=1606879908312
    def parse_detail(self, response, **kwargs):
        # print("status:",response.status)
        # print("ur;:",response.url)
        # print("request:",response.request)
        # print("headers:",response.headers)
        # #print(response.text)
        # print("parse_detail开始")
        item = response.meta['item']
        all= json.loads(response.text)['data']['html']
        # #print(all)
        with open('3.html','w',encoding='utf-8') as fp:
            fp.write(all)
        tree = etree.HTML(all)
        # print(type(tree))
        # username = tree.xpath('//div[@class="list_con"]/div[@class="WB_text"]/a[1]/text()')
        # usertime = re.findall('<div class="WB_from S_txt2">(.*?)</div>', all)
        # comment = tree.xpath('//div[@class="list_con"]/div[@class="WB_text"]//text()')
        # print(usertime)
        # #因为评论前有个中文的引号,正则格外的好用
        # #comment = re.findall(r'</a>：(.*?)<',all)
        # for i in comment:
        #     for w in i:
        #         if i == "\\n":
        #             comment.pop(i)
        #             break
        # with open("12.txt","w",encoding='utf-8') as fp:
        #     for i in comment:
        #         fp.write(i)
        # print(comment)
        #95-122
        div_lists = tree.xpath('.//div[@class="list_con"]')
        final_lists = []
        #print(div_lists)

        with open('13.txt', 'a', encoding='utf-8') as fp:
            for div in div_lists:
                list = []
                username = div.xpath('./div[@class="WB_text"]/a[1]/text()')[0]
                usertime = div.xpath('.//div[@class="WB_from S_txt2"]/text()')[0]
                usercontent = div.xpath('./div[@class="WB_text"]/text()')
                str = usertime + '\n' + username
                #print(username,usertime,usercontent)
                # fp.write(usertime + '\n' + username)
                for con in usercontent[1:]:
                    str += '\n' + username + '\n' + usertime + '\n' + con + '\n'
                #
                usercontent = ''.join(usercontent)
                #print('usercontent:',usercontent)
                item['username'] = username
                item['usertime'] = usertime
                item['usercontent'] = usercontent
                list.append(username)
                list.append(usertime)
                list.append(usercontent)
                final_lists.append(list)
                #item['user'] = [username,usertime,usercontent]

            item['user'] = final_lists
            yield item

items.py

在这里定义分析的数据,移交给管道处理

import scrapy

class WeiboproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    #热搜标题
    title = scrapy.Field()
    #热搜的链接
    href = scrapy.Field()

    #发布每条相关热搜消息的作者
    author = scrapy.Field()
    #发布每条相关热搜消息的时间
    news_time = scrapy.Field()
    #发布每条相关热搜消息的内容
    brief_con = scrapy.Field()
    #发布每条相关热搜消息的详情链接
    details_url = scrapy.Field()
    #详情页ID,拿json必备
    news_id = scrapy.Field()

    #传入每条热搜消息微博详情页下的作者
    username = scrapy.Field()
    #传入每条热搜消息微博详情页下的时间
    usertime = scrapy.Field()
    #传入每条热搜消息微博详情页下的评论
    usercontent = scrapy.Field()

    #所有评论和人
    user = scrapy.Field()

middlewares.py

中间件，用于处理spider和服务器中间的通讯。

import random
# 自定义微博请求的中间件
class WeiboproDownloaderMiddleware(object):

    def process_request(self, request, spider):
        # "设置cookie"
        cookies = ""
        cookies = {i.split("=")[0]: i.split("=")[1] for i in cookies.split("; ")}
        request.cookies = cookies
        #  设置ua
        ua = random.choice(spider.settings.get("USER_AGENT_LIST"))
        request.headers["User-Agent"] = ua
        return None

pipelines.py

from itemadapter import ItemAdapter
class WeiboproPipeline:
    fp = None
    def open_spider(self,spider):
        print("starting...")

    def process_item(self, item, spider):

        title = item['title']
        href = item['href']
        author = item['author']
        news_time = item['news_time']
        brief_con = item['brief_con']
        details_url = item['details_url']
        news_id = item['news_id']
        #username = item['username']
        #usertime = item['usertime']
        #usercontent = item['usercontent']
        user = item['user']
        filepath = './' + title + '.txt'
        with open(filepath,'a',encoding='utf-8') as fp:
            fp.write('title:\n' + title + '\n' + 'href:\n'+href + '\n' +'author:\n' + author + '\n' + 'news_time:\n' +news_time + '\n' + 'brief_con\n' + brief_con + '\n' +'details_url:\n' + details_url + '\n' +'news_id'+news_id + '\n')
            for u in user:
                fp.write('username:'+u[0] + '\n' + u[1] + '\n' +'usercontent:\n'+u[2] + '\n\n\n')
            fp.write('---------------------------------------------------------\n')
        fp.close()
        return item

setting.py

设置spider的属性，包括在这里已经加入了各种浏览器请求头，设置线程数，爬取频率等等，能够让spider拥有更强大的反爬

# Scrapy settings for weiboPro project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'weiboPro'

SPIDER_MODULES = ['weiboPro.spiders']
NEWSPIDER_MODULE = 'weiboPro.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'weiboPro (+http://www.yourdomain.com)'
MEDIA_ALLOW_REDIRECTS = True
USER_AGENT_LIST = ["Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60",
        "Opera/8.0 (Windows NT 5.1; U; en)",
        "Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50",
        "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50",
        # Firefox
        "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0",
        "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10",
        # Safari
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2",
        # chrome
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
        "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16",
        # 360
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36",
        "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko",
        # 淘宝浏览器
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",
        # 猎豹浏览器
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",
        "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",
        # QQ浏览器
        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",
        "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
        # sogou浏览器
        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)",
        # maxthon浏览器
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.4.3.4000 Chrome/30.0.1599.101 Safari/537.36",
        # UC浏览器
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36"
              ]
LOG_LEVEL = 'ERROR'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
# SPIDER_MIDDLEWARES = {
#    'weiboPro.middlewares.WeiboproSpiderMiddleware': 543,
# }

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
   'weiboPro.middlewares.WeiboproDownloaderMiddleware': 543,
}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'weiboPro.pipelines.WeiboproPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

scrapy.cfg

配置文件,没啥好写的

[settings]
default = weiboPro.settings

[deploy]
#url = http://localhost:6800/
project = weiboPro

剩下的两个__init__文件空着就行，用不上。

到此这篇关于python实战之Scrapy框架爬虫爬取微博热搜的文章就介绍到这了,更多相关python Scrapy 爬取微博热搜内容请搜索我们以前的文章或继续浏览下面的相关文章希望大家以后多多支持我们！

一文读懂python Scrapy爬虫框架

Scrapy是什么? 先看官网上的说明,http://scrapy-chs.readthedocs.io/zh_CN/latest/intro/overview.html Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架.可以应用在包括数据挖掘,信息处理或存储历史数据等一系列的程序中. 其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的, 也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫. S
Python之Scrapy爬虫框架安装及使用详解

题记:早已听闻python爬虫框架的大名.近些天学习了下其中的Scrapy爬虫框架,将自己理解的跟大家分享.有表述不当之处,望大神们斧正. 一.初窥Scrapy Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架. 可以应用在包括数据挖掘,信息处理或存储历史数据等一系列的程序中. 其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的, 也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫. 本文档将
Python3环境安装Scrapy爬虫框架过程及常见错误

Windows •安装lxml 最好的安装方式是通过wheel文件来安装,http://www.lfd.uci.edu/~gohlke/pythonlibs/,从该网站找到lxml的相关文件.假如是Python3.5版本,WIndows 64位系统,那就找到lxml‑3.7.2‑cp35‑cp35m‑win_amd64.whl 这个文件并下载,然后通过pip安装. 下载之后,运行如下命令安装: pip3 install wheel pip3 install lxml‑3.7.2‑cp35‑cp3
Python使用Scrapy爬虫框架全站爬取图片并保存本地的实现代码

大家可以在Github上clone全部源码. Github:https://github.com/williamzxl/Scrapy_CrawlMeiziTu Scrapy官方文档:http://scrapy-chs.readthedocs.io/zh_CN/latest/index.html 基本上按照文档的流程走一遍就基本会用了. Step1: 在开始爬取之前,必须创建一个新的Scrapy项目. 进入打算存储代码的目录中,运行下列命令: scrapy startproject CrawlMe
python Scrapy爬虫框架的使用

导读:如何使用scrapy框架实现爬虫的4步曲?什么是CrawSpider模板?如何设置下载中间件?如何实现Scrapyd远程部署和监控?想要了解更多,下面让我们来看一下如何具体实现吧! Scrapy安装(mac) pip install scrapy 注意:不要使用commandlinetools自带的python进行安装,不然可能报架构错误:用brew下载的python进行安装. Scrapy实现爬虫新建爬虫 scrapy startproject demoSpider,demoSpide
python实战之Scrapy框架爬虫爬取微博热搜

前言:大概一年前写的,前段时间跑了下,发现还能用,就分享出来了供大家学习,代码的很多细节不太记得了,也尽力做了优化. 因为毕竟是微博,反爬技术手段还是很周全的,怎么绕过反爬的话要在这说都可以单独写几篇文章了(包括网页动态加载,ajax动态请求,token密钥等等,特别是二级评论,藏得很深,记得当时想了很久才成功拿到),直接上代码. 主要实现的功能: 0.理所应当的,绕过了各种反爬. 1.爬取全部的热搜主要内容. 2.爬取每条热搜的相关微博. 3.爬取每条相关微博的评论,评论用户的各种详细信息.
Python爬虫爬取微博热搜保存为 Markdown 文件的源码

什么是爬虫? 网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本.另外一些不常使用的名字还有蚂蚁.自动索引.模拟程序或者蠕虫. 其实通俗的讲就是通过程序去获取web页面上自己想要的数据,也就是自动抓取数据爬虫可以做什么? 你可以爬取小姐姐的图片,爬取自己有兴趣的岛国视频,或者其他任何你想要的东西,前提是,你想要的资源必须可以通过浏览器访问的到. 爬虫的本质是什么? 上面关于爬虫可以做什么,定义了一个前提
python实战项目scrapy管道学习爬取在行高手数据

目录爬取目标站点分析编码时间爬取结果展示爬取目标站点分析本次采集的目标站点为:https://www.zaih.com/falcon/mentors,目标数据为在行高手数据. 本次数据保存到 MySQL 数据库中,基于目标数据,设计表结构如下所示. 对比表结构,可以直接将 scrapy 中的 items.py 文件编写完毕. class ZaihangItem(scrapy.Item): # define the fields for your item here like: name
Python网络爬虫之爬取微博热搜

微博热搜的爬取较为简单,我只是用了lxml和requests两个库 url= https://s.weibo.com/top/summary?Refer=top_hot&topnav=1&wvr=6 1.分析网页的源代码:右键--查看网页源代码. 从网页代码中可以获取到信息 (1)热搜的名字都在<td class="td-02">的子节点<a>里 (2)热搜的排名都在<td class=td-01 ranktop>的里(注意置顶微博是
如何用python爬取微博热搜数据并保存

主要用到requests和bf4两个库将获得的信息保存在d://hotsearch.txt下 import requests; import bs4 mylist=[] r = requests.get(url='https://s.weibo.com/top/summary?Refer=top_hot&topnav=1&wvr=6',timeout=10) print(r.status_code) # 获取返回状态 r.encoding=r.apparent_encoding demo
Python定时爬取微博热搜示例介绍

目录前言页面分析采集代码设置定时运行前言相信大家在工作无聊时,总想掏出手机,看看微博热搜在讨论什么有趣的话题,但又不方便直接打开微博浏览,今天就和大家分享一个有趣的小爬虫,定时采集微博热搜榜&热评,下面让我们来看看具体的实现方法. 页面分析热搜页热榜首页:https://s.weibo.com/top/summary?cate=realtimehot 热榜首页的榜单中共五十条数据,在这个页面,我们需要获取排行.热度.标题,以及详情页的链接. 我们打开页面后要先登录,之后使用 F
Python 爬取微博热搜页面

前期准备: fiddler 抓包工具Python3.6谷歌浏览器分析: 1.清理浏览器缓存cookie以至于看到整个请求过程,因为Python代码开始请求的时候不带任何缓存.2.不考虑过多的header参数,先请求一次,看看返回结果图中第一个链接是无缓存cookie直接访问的,状态码为302进行了重定向,用返回值.url会得到该url后面会用到(headers里的Referer参数值)2 ,3 链接没有用太大用处为第 4 个链接做铺垫但是都可以用固定参数可以不用访问 cb 和fp参数都是前两
Python 爬取微博热搜页面

前期准备: fiddler 抓包工具Python3.6谷歌浏览器分析: 1.清理浏览器缓存cookie以至于看到整个请求过程,因为Python代码开始请求的时候不带任何缓存.2.不考虑过多的header参数,先请求一次,看看返回结果图中第一个链接是无缓存cookie直接访问的,状态码为302进行了重定向,用返回值.url会得到该url后面会用到(headers里的Referer参数值)2 ,3 链接没有用太大用处为第 4 个链接做铺垫但是都可以用固定参数可以不用访问访问https://pa
python+selenium爬取微博热搜存入Mysql的实现方法

最终的效果废话不多少,直接上图这里可以清楚的看到,数据库里包含了日期,内容,和网站link 下面我们来分析怎么实现使用的库 import requests from selenium.webdriver import Chrome, ChromeOptions import time from sqlalchemy import create_engine import pandas as pd 目标分析这是微博热搜的link:点我可以到目标网页首先我们使用selenium对目标网页进
Python 详解通过Scrapy框架实现爬取CSDN全站热榜标题热词流程

目录前言环境部署实现过程创建项目定义Item实体关键词提取工具爬虫构造中间件代码构造制作自定义pipeline settings配置执行主程序执行结果总结前言接着我的上一篇:Python 详解爬取并统计CSDN全站热榜标题关键词词频流程我换成Scrapy架构也实现了一遍.获取页面源码底层原理是一样的,Scrapy架构更系统一些.下面我会把需要注意的问题,也说明一下. 提供一下GitHub仓库地址:github本项目地址环境部署 scrapy安装 pip insta