scrapy爬虫完整实例

2025-03-30 03:18:12

本文主要通过实例介绍了scrapy框架的使用，分享了两个例子，爬豆瓣文本例程 douban 和图片例程 douban_imgs ，具体如下。

例程1： douban

目录树

douban
--douban
 --spiders
  --__init__.py
  --bookspider.py
  --douban_comment_spider.py
  --doumailspider.py
 --__init__.py
 --items.py
 --pipelines.py
 --settings.py
--scrapy.cfg

–spiders–init.py

# This package will contain the spiders of your Scrapy project
#
# Please refer to the documentation for information on how to create and manage
# your spiders.

bookspider.py

# -*- coding:utf-8 -*-
'''by sudo rm -rf http://imchenkun.com'''
import scrapy
from douban.items import DoubanBookItem

class BookSpider(scrapy.Spider):
  name = 'douban-book'
  allowed_domains = ['douban.com']
  start_urls = [
    'https://book.douban.com/top250'
  ]

  def parse(self, response):
    # 请求第一页
    yield scrapy.Request(response.url, callback=self.parse_next)

    # 请求其它页
    for page in response.xpath('//div[@class="paginator"]/a'):
      link = page.xpath('@href').extract()[0]
      yield scrapy.Request(link, callback=self.parse_next)

  def parse_next(self, response):
    for item in response.xpath('//tr[@class="item"]'):
      book = DoubanBookItem()
      book['name'] = item.xpath('td[2]/div[1]/a/@title').extract()[0]
      book['content'] = item.xpath('td[2]/p/text()').extract()[0]
      book['ratings'] = item.xpath('td[2]/div[2]/span[2]/text()').extract()[0]
      yield book

douban_comment_spider.py

# -*- coding:utf-8 -*-
import scrapy
from faker import Factory
from douban.items import DoubanMovieCommentItem
import urlparse
f = Factory.create()

class MailSpider(scrapy.Spider):
  name = 'douban-comment'
  allowed_domains = ['accounts.douban.com', 'douban.com']
  start_urls = [
    'https://www.douban.com/'
  ]

  headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
    'Connection': 'keep-alive',
    'Host': 'accounts.douban.com',
    'User-Agent': f.user_agent()
  }

  formdata = {
    'form_email': '你的邮箱',
    'form_password': '你的密码',
    # 'captcha-solution': '',
    # 'captcha-id': '',
    'login': '登录',
    'redir': 'https://www.douban.com/',
    'source': 'None'
  }

  def start_requests(self):
    return [scrapy.Request(url='https://www.douban.com/accounts/login',
                headers=self.headers,
                meta={'cookiejar': 1},
                callback=self.parse_login)]

  def parse_login(self, response):
    # 如果有验证码要人为处理
    if 'captcha_image' in response.body:
      print 'Copy the link:'
      link = response.xpath('//img[@class="captcha_image"]/@src').extract()[0]
      print link
      captcha_solution = raw_input('captcha-solution:')
      captcha_id = urlparse.parse_qs(urlparse.urlparse(link).query, True)['id']
      self.formdata['captcha-solution'] = captcha_solution
      self.formdata['captcha-id'] = captcha_id
    return [scrapy.FormRequest.from_response(response,
                         formdata=self.formdata,
                         headers=self.headers,
                         meta={'cookiejar': response.meta['cookiejar']},
                         callback=self.after_login
                         )]

  def after_login(self, response):
    print response.status
    self.headers['Host'] = "www.douban.com"
    yield scrapy.Request(url='https://movie.douban.com/subject/22266320/reviews',
               meta={'cookiejar': response.meta['cookiejar']},
               headers=self.headers,
               callback=self.parse_comment_url)
    yield scrapy.Request(url='https://movie.douban.com/subject/22266320/reviews',
               meta={'cookiejar': response.meta['cookiejar']},
               headers=self.headers,
               callback=self.parse_next_page,
               dont_filter = True)  #不去重

  def parse_next_page(self, response):
    print response.status
    try:
      next_url = response.urljoin(response.xpath('//span[@class="next"]/a/@href').extract()[0])
      print "下一页"
      print next_url
      yield scrapy.Request(url=next_url,
               meta={'cookiejar': response.meta['cookiejar']},
               headers=self.headers,
               callback=self.parse_comment_url,
               dont_filter = True)
      yield scrapy.Request(url=next_url,
               meta={'cookiejar': response.meta['cookiejar']},
               headers=self.headers,
               callback=self.parse_next_page,
               dont_filter = True)
    except:
      print "Next page Error"
      return

  def parse_comment_url(self, response):
    print response.status
    for item in response.xpath('//div[@class="main review-item"]'):
      comment_url = item.xpath('header/h3[@class="title"]/a/@href').extract()[0]
      comment_title = item.xpath('header/h3[@class="title"]/a/text()').extract()[0]
      print comment_title
      print comment_url
      yield scrapy.Request(url=comment_url,
               meta={'cookiejar': response.meta['cookiejar']},
               headers=self.headers,
               callback=self.parse_comment)

  def parse_comment(self, response):
    print response.status
    for item in response.xpath('//div[@id="content"]'):
      comment = DoubanMovieCommentItem()
      comment['useful_num'] = item.xpath('//div[@class="main-panel-useful"]/button[1]/text()').extract()[0].strip()
      comment['no_help_num'] = item.xpath('//div[@class="main-panel-useful"]/button[2]/text()').extract()[0].strip()
      comment['people'] = item.xpath('//span[@property="v:reviewer"]/text()').extract()[0]
      comment['people_url'] = item.xpath('//header[@class="main-hd"]/a[1]/@href').extract()[0]
      comment['star'] = item.xpath('//header[@class="main-hd"]/span[1]/@title').extract()[0]

      data_type = item.xpath('//div[@id="link-report"]/div/@data-original').extract()[0]
      print "data_type: "+data_type
      if data_type == '0':
        comment['comment'] = "\t#####\t".join(map(lambda x:x.strip(), item.xpath('//div[@id="link-report"]/div/p/text()').extract()))
      elif data_type == '1':
        comment['comment'] = "\t#####\t".join(map(lambda x:x.strip(), item.xpath('//div[@id="link-report"]/div[1]/text()').extract()))
      comment['title'] = item.xpath('//span[@property="v:summary"]/text()').extract()[0]
      comment['comment_page_url'] = response.url
      #print comment
      yield comment

doumailspider.py

# -*- coding:utf-8 -*-
'''by sudo rm -rf http://imchenkun.com'''
import scrapy
from faker import Factory
from douban.items import DoubanMailItem
import urlparse
f = Factory.create()

class MailSpider(scrapy.Spider):
  name = 'douban-mail'
  allowed_domains = ['accounts.douban.com', 'douban.com']
  start_urls = [
    'https://www.douban.com/'
  ]

  headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
    'Connection': 'keep-alive',
    'Host': 'accounts.douban.com',
    'User-Agent': f.user_agent()
  }

  formdata = {
    'form_email': '你的邮箱',
    'form_password': '你的密码',
    # 'captcha-solution': '',
    # 'captcha-id': '',
    'login': '登录',
    'redir': 'https://www.douban.com/',
    'source': 'None'
  }

  def start_requests(self):
    return [scrapy.Request(url='https://www.douban.com/accounts/login',
                headers=self.headers,
                meta={'cookiejar': 1},
                callback=self.parse_login)]

  def parse_login(self, response):
    # 如果有验证码要人为处理
    if 'captcha_image' in response.body:
      print 'Copy the link:'
      link = response.xpath('//img[@class="captcha_image"]/@src').extract()[0]
      print link
      captcha_solution = raw_input('captcha-solution:')
      captcha_id = urlparse.parse_qs(urlparse.urlparse(link).query, True)['id']
      self.formdata['captcha-solution'] = captcha_solution
      self.formdata['captcha-id'] = captcha_id
    return [scrapy.FormRequest.from_response(response,
                         formdata=self.formdata,
                         headers=self.headers,
                         meta={'cookiejar': response.meta['cookiejar']},
                         callback=self.after_login
                         )]

  def after_login(self, response):
    print response.status
    self.headers['Host'] = "www.douban.com"
    return scrapy.Request(url='https://www.douban.com/doumail/',
               meta={'cookiejar': response.meta['cookiejar']},
               headers=self.headers,
               callback=self.parse_mail)

  def parse_mail(self, response):
    print response.status
    for item in response.xpath('//div[@class="doumail-list"]/ul/li'):
      mail = DoubanMailItem()
      mail['sender_time'] = item.xpath('div[2]/div/span[1]/text()').extract()[0]
      mail['sender_from'] = item.xpath('div[2]/div/span[2]/text()').extract()[0]
      mail['url'] = item.xpath('div[2]/p/a/@href').extract()[0]
      mail['title'] = item.xpath('div[2]/p/a/text()').extract()[0]
      print mail
      yield mail

init.py

(此文件内无代码)

items.py

# -*- coding: utf-8 -*-
import scrapy

class DoubanBookItem(scrapy.Item):
  name = scrapy.Field()      # 书名
  price = scrapy.Field()      # 价格
  edition_year = scrapy.Field()  # 出版年份
  publisher = scrapy.Field()    # 出版社
  ratings = scrapy.Field()     # 评分
  author = scrapy.Field()     # 作者
  content = scrapy.Field()

class DoubanMailItem(scrapy.Item):
  sender_time = scrapy.Field()   # 发送时间
  sender_from = scrapy.Field()   # 发送人
  url = scrapy.Field()       # 豆邮详细地址
  title = scrapy.Field()      # 豆邮标题

class DoubanMovieCommentItem(scrapy.Item):
  useful_num = scrapy.Field()   # 多少人评论有用
  no_help_num = scrapy.Field()   # 多少人评论无用
  people = scrapy.Field()     # 评论者
  people_url = scrapy.Field()   # 评论者页面
  star = scrapy.Field()      # 评分
  comment = scrapy.Field()     # 评论
  title = scrapy.Field()      # 标题
  comment_page_url = scrapy.Field()# 当前页

pipelines.py

# -*- coding: utf-8 -*-

class DoubanBookPipeline(object):
  def process_item(self, item, spider):
    info = item['content'].split(' / ') # [法] 圣埃克苏佩里 / 马振聘 / 人民文学出版社 / 2003-8 / 22.00元
    item['name'] = item['name']
    item['price'] = info[-1]
    item['edition_year'] = info[-2]
    item['publisher'] = info[-3]
    return item

class DoubanMailPipeline(object):
  def process_item(self, item, spider):
    item['title'] = item['title'].replace(' ', '').replace('\\n', '')
    return item

class DoubanMovieCommentPipeline(object):
  def process_item(self, item, spider):
    return item

settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for douban project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#   http://doc.scrapy.org/en/latest/topics/settings.html
#   http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#   http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'douban'

SPIDER_MODULES = ['douban.spiders']
NEWSPIDER_MODULE = 'douban.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
from faker import Factory
f = Factory.create()
USER_AGENT = f.user_agent()

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
  'Host': 'book.douban.com',
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
  'Accept-Encoding': 'gzip, deflate, br',
  'Connection': 'keep-alive',
}
#DEFAULT_REQUEST_HEADERS = {
#  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#  'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#  'douban.middlewares.MyCustomSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#  'douban.middlewares.MyCustomDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#  'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
  #'douban.pipelines.DoubanBookPipeline': 300,
  #'douban.pipelines.DoubanMailPipeline': 600,
  'douban.pipelines.DoubanMovieCommentPipeline': 900,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

scrapy.cfg

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.org/en/latest/deploy.html

[settings]
default = douban.settings

[deploy]
#url = http://localhost:6800/
project = douban

例程2： douban_imgs

目录树

douban_imgs
--douban
 --spiders
  --__init__.py
  --download_douban.py
 --__init__.py
 --items.py
 --pipelines.py
 --run_spider.py
 --settings.py
--scrapy.cfg

–spiders–init.py

# This package will contain the spiders of your Scrapy project
#
# Please refer to the documentation for information on how to create and manage
# your spiders.

download_douban.py

# coding=utf-8
from scrapy.spiders import Spider
import re
from scrapy import Request
from douban_imgs.items import DoubanImgsItem

class download_douban(Spider):
  name = 'download_douban'

  default_headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Encoding': 'gzip, deflate, sdch, br',
    'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6',
    'Cache-Control': 'max-age=0',
    'Connection': 'keep-alive',
    'Host': 'www.douban.com',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36',
  }

  def __init__(self, url='1638835355', *args, **kwargs):
    self.allowed_domains = ['douban.com']
    self.start_urls = [
      'http://www.douban.com/photos/album/%s/' % (url)]
    self.url = url
    # call the father base function

    #super(download_douban, self).__init__(*args, **kwargs)

  def start_requests(self):

    for url in self.start_urls:
      yield Request(url=url, headers=self.default_headers, callback=self.parse)

  def parse(self, response):
    list_imgs = response.xpath('//div[@class="photolst clearfix"]//img/@src').extract()
    if list_imgs:
      item = DoubanImgsItem()
      item['image_urls'] = list_imgs
      yield item

init.py

(此文件内无代码)

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy
from scrapy import Item, Field

class DoubanImgsItem(scrapy.Item):
  # define the fields for your item here like:
  # name = scrapy.Field()
  image_urls = Field()
  images = Field()
  image_paths = Field()

pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy import Request
from scrapy import log

class DoubanImgsPipeline(object):
  def process_item(self, item, spider):
    return item

class DoubanImgDownloadPipeline(ImagesPipeline):
  default_headers = {
    'accept': 'image/webp,image/*,*/*;q=0.8',
    'accept-encoding': 'gzip, deflate, sdch, br',
    'accept-language': 'zh-CN,zh;q=0.8,en;q=0.6',
    'cookie': 'bid=yQdC/AzTaCw',
    'referer': 'https://www.douban.com/photos/photo/2370443040/',
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36',
  }

  def get_media_requests(self, item, info):
    for image_url in item['image_urls']:
      self.default_headers['referer'] = image_url
      yield Request(image_url, headers=self.default_headers)

  def item_completed(self, results, item, info):
    image_paths = [x['path'] for ok, x in results if ok]
    if not image_paths:
      raise DropItem("Item contains no images")
    item['image_paths'] = image_paths
    return item

run_spider.py

from scrapy import cmdline
cmd_str = 'scrapy crawl download_douban'
cmdline.execute(cmd_str.split(' '))

settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for douban_imgs project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#   http://doc.scrapy.org/en/latest/topics/settings.html
#   http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#   http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'douban_imgs'

SPIDER_MODULES = ['douban_imgs.spiders']
NEWSPIDER_MODULE = 'douban_imgs.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = 'douban_imgs (+http://www.yourdomain.com)'

# Configure maximum concurrent requests performed by Scrapy (default: 16)
# CONCURRENT_REQUESTS=32

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# DOWNLOAD_DELAY=3
# The download delay setting will honor only one of:
# CONCURRENT_REQUESTS_PER_DOMAIN=16
# CONCURRENT_REQUESTS_PER_IP=16

# Disable cookies (enabled by default)
# COOKIES_ENABLED=False

# Disable Telnet Console (enabled by default)
# TELNETCONSOLE_ENABLED=False

# Override the default request headers:
# DEFAULT_REQUEST_HEADERS = {
#  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#  'Accept-Language': 'en',
# }

# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
# SPIDER_MIDDLEWARES = {
#  'douban_imgs.middlewares.MyCustomSpiderMiddleware': 543,
# }

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {
#  'douban_imgs.middlewares.MyCustomDownloaderMiddleware': 543,
# }

# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
# EXTENSIONS = {
#  'scrapy.telnet.TelnetConsole': None,
# }

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
  'douban_imgs.pipelines.DoubanImgDownloadPipeline': 300,
}

IMAGES_STORE = 'D:\\doubanimgs'
#IMAGES_STORE = '/tmp'

IMAGES_EXPIRES = 90

# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
# NOTE: AutoThrottle will honour the standard settings for concurrency and delay
# AUTOTHROTTLE_ENABLED=True
# The initial download delay
# AUTOTHROTTLE_START_DELAY=5
# The maximum download delay to be set in case of high latencies
# AUTOTHROTTLE_MAX_DELAY=60
# Enable showing throttling stats for every response received:
# AUTOTHROTTLE_DEBUG=False

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
# HTTPCACHE_ENABLED=True
# HTTPCACHE_EXPIRATION_SECS=0
# HTTPCACHE_DIR='httpcache'
# HTTPCACHE_IGNORE_HTTP_CODES=[]
# HTTPCACHE_STORAGE='scrapy.extensions.httpcache.FilesystemCacheStorage'

scrapy.cfg

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.org/en/latest/deploy.html

[settings]
default = douban_imgs.settings

[deploy]
#url = http://localhost:6800/
project = douban_imgs

总结

以上就是本文关于scrapy爬虫完整实例的全部内容，希望对大家有所帮助。感兴趣的朋友可以继续参阅本站其他相关专题，如有不足之处，欢迎留言指出。感谢朋友们对本站的支持！

您可能感兴趣的文章:

Python之Scrapy爬虫框架安装及简单使用详解
Python抓取框架Scrapy爬虫入门：页面提取
Python中Scrapy爬虫图片处理详解
Python之Scrapy爬虫框架安装及使用详解
python爬虫框架scrapy实战之爬取京东商城进阶篇
Python的爬虫框架scrapy用21行代码写一个爬虫
Python的爬虫程序编写框架Scrapy入门学习教程
讲解Python的Scrapy爬虫框架使用代理进行采集的方法

python爬虫框架scrapy实战之爬取京东商城进阶篇

前言之前的一篇文章已经讲过怎样获取链接,怎样获得参数了,详情请看python爬取京东商城普通篇,本文将详细介绍利用python爬虫框架scrapy如何爬取京东商城,下面话不多说了,来看看详细的介绍吧. 代码详解 1.首先应该构造请求,这里使用scrapy.Request,这个方法默认调用的是start_urls构造请求,如果要改变默认的请求,那么必须重载该方法,这个方法的返回值必须是一个可迭代的对象,一般是用yield返回. 代码如下: def start_requests(self): fo
Python之Scrapy爬虫框架安装及简单使用详解

题记:早已听闻python爬虫框架的大名.近些天学习了下其中的Scrapy爬虫框架,将自己理解的跟大家分享.有表述不当之处,望大神们斧正. 一.初窥Scrapy Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架. 可以应用在包括数据挖掘,信息处理或存储历史数据等一系列的程序中. 其最初是为了页面抓取(更确切来说,网络抓取)所设计的, 也可以应用在获取API所返回的数据(例如Amazon Associates Web Services) 或者通用的网络爬虫. 本文档将通过介绍Sc
讲解Python的Scrapy爬虫框架使用代理进行采集的方法

1.在Scrapy工程下新建"middlewares.py" # Importing base64 library because we'll need it ONLY in case if the proxy we are going to use requires authentication import base64 # Start your middleware class class ProxyMiddleware(object): # overwrite process
Python的爬虫框架scrapy用21行代码写一个爬虫

开发说明开发环境:Pycharm 2017.1(目前最新) 开发框架:Scrapy 1.3.3(目前最新) 目标爬取线报网站,并把内容保存到items.json里页面分析根据上图我们可以发现内容都在类为post这个div里下面放出post的代码 <div class="post">  <div class=
Python抓取框架Scrapy爬虫入门：页面提取

前言 Scrapy是一个非常好的抓取框架,它不仅提供了一些开箱可用的基础组建,还能够根据自己的需求,进行强大的自定义.本文主要给大家介绍了关于Python抓取框架Scrapy之页面提取的相关内容,分享出来供大家参考学习,下面随着小编来一起学习学习吧. 在开始之前,关于scrapy框架的入门大家可以参考这篇文章:http://www.jb51.net/article/87820.htm 下面创建一个爬虫项目,以图虫网为例抓取图片. 一.内容分析打开图虫网,顶部菜单"发现" "
Python之Scrapy爬虫框架安装及使用详解

题记:早已听闻python爬虫框架的大名.近些天学习了下其中的Scrapy爬虫框架,将自己理解的跟大家分享.有表述不当之处,望大神们斧正. 一.初窥Scrapy Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架. 可以应用在包括数据挖掘,信息处理或存储历史数据等一系列的程序中. 其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的, 也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫. 本文档将
Python的爬虫程序编写框架Scrapy入门学习教程

1. Scrapy简介 Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架. 可以应用在包括数据挖掘,信息处理或存储历史数据等一系列的程序中. 其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的, 也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫.Scrapy用途广泛,可以用于数据挖掘.监测和自动化测试 Scrapy 使用了 Twisted异步网络库来处理网络通讯.整体架构大致如下 Scrapy
Python中Scrapy爬虫图片处理详解

下载图片下载图片有两种方式,一种是通过 Requests 模块发送 get 请求下载,另一种是使用 Scrapy 的 ImagesPipeline 图片管道类,这里主要讲后者. 安装 Scrapy 时并没有安装图像处理依赖包 Pillow,需手动安装否则运行爬虫出错. 首先在 settings.py 中设置图片的存储路径: IMAGES_STORE = 'D:/' 图片处理相关的选项还有: # 图片最小高度和宽度设置,可以过滤太小的图片 IMAGES_MIN_HEIGHT = 110 IMAG
scrapy爬虫完整实例

本文主要通过实例介绍了scrapy框架的使用,分享了两个例子,爬豆瓣文本例程 douban 和图片例程 douban_imgs ,具体如下. 例程1: douban 目录树 douban --douban --spiders --__init__.py --bookspider.py --douban_comment_spider.py --doumailspider.py --__init__.py --items.py --pipelines.py --settings.py --scrap
10个python爬虫入门基础代码实例 + 1个简单的python爬虫完整实例

本文主要涉及python爬虫知识点: web是如何交互的 requests库的get.post函数的应用 response对象的相关函数,属性 python文件的打开,保存代码中给出了注释,并且可以直接运行哦如何安装requests库(安装好python的朋友可以直接参考,没有的,建议先装一哈python环境) windows用户,Linux用户几乎一样: 打开cmd输入以下命令即可,如果python的环境在C盘的目录,会提示权限不够,只需以管理员方式运行cmd窗口 pip install
Python 微信爬虫完整实例【单线程与多线程】

本文实例讲述了Python 实现的微信爬虫.分享给大家供大家参考,具体如下: 单线程版: import urllib.request import urllib.parse import urllib.error import re,time headers = ("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3107.
Scrapy爬虫实例讲解_校花网

学习爬虫有一段时间了,今天使用Scrapy框架将校花网的图片爬取到本地.Scrapy爬虫框架相对于使用requests库进行网页的爬取,拥有更高的性能. Scrapy官方定义:Scrapy是用于抓取网站并提取结构化数据的应用程序框架,可用于广泛的有用应用程序,如数据挖掘,信息处理或历史存档. 建立Scrapy爬虫工程在安装好Scrapy框架后,直接使用命令行进行项目的创建: E:\ScrapyDemo>scrapy startproject xiaohuar New Scrapy projec
scrapy爬虫实例分享

前一篇文章介绍了很多关于scrapy的进阶知识,不过说归说,只有在实际应用中才能真正用到这些知识.所以这篇文章就来尝试利用scrapy爬取各种网站的数据. 爬取百思不得姐首先一步一步来,我们先从爬最简单的文本开始.这里爬取的就是百思不得姐的的段子,都是文本. 首先打开段子页面,用F12工具查看元素.然后用下面的命令打开scrapyshell. scrapy shell http://www.budejie.com/text/ 稍加分析即可得到我们要获取的数据,在介绍scrapy的第一篇文章中我
python爬虫scrapy图书分类实例讲解

我们去图书馆的时候,会直接去自己喜欢的分类栏目找寻书籍.如果其中的分类不是很细致的话,想找某一本书还是有一些困难的.同样的如果我们获取了一些图书的数据,原始的文件里各种数据混杂在一起,非常不利于我们的查找和使用.所以今天小编教大家如何用python爬虫中scrapy给图书分类,大家一起学习下: spider抓取程序: 在贴上代码之前,先对抓取的页面和链接做一个分析: 网址:http://category.dangdang.com/pg4-cp01.25.17.00.00.00.html 这个是当
python中用Scrapy实现定时爬虫的实例讲解

一般网站发布信息会在具体实现范围内发布,我们在进行网络爬虫的过程中,可以通过设置定时爬虫,定时的爬取网站的内容.使用python爬虫框架Scrapy框架可以实现定时爬虫,而且可以根据我们的时间需求,方便的修改定时的时间. 1.Scrapy介绍 Scrapy是python的爬虫框架,用于抓取web站点并从页面中提取结构化的数据.任何人都可以根据需求方便的修改.Scrapy用途广泛,可以用于数据挖掘.监测和自动化测试. 2.使用Scrapy框架定时爬取 import time from scrapy
Python爬虫基础之初次使用scrapy爬虫实例

项目需求在专门供爬虫初学者训练爬虫技术的网站(http://quotes.toscrape.com)上爬取名言警句. 创建项目在开始爬取之前,必须创建一个新的Scrapy项目.进入您打算存储代码的目录中,运行下列命令: (base) λ scrapy startproject quotes New scrapy project 'quotes ', using template directory 'd: \anaconda3\lib\site-packages\scrapy\temp1at