springBoot+webMagic实现网站爬虫的实例代码

前端时间公司项目需要抓取各类数据,py玩的不6,只好研究Java爬虫方案,做一个总结。

开发环境:

springBoot 2.2.6、jdk1.8。

1、导入依赖

<!--WebMagic核心包-->
  <dependency>
   <groupId>us.codecraft</groupId>
   <artifactId>webmagic-core</artifactId>
   <version>0.7.3</version>
   <!--这里可以去掉WebMagic自带的日志(因为打印的很多。。。。)-->
<!--   <exclusions>-->
<!--    <exclusion>-->
<!--     <groupId>org.slf4j</groupId>-->
<!--     <artifactId>slf4j-log4j12</artifactId>-->
<!--    </exclusion>-->
<!--   </exclusions>-->
  </dependency>
  <!--WebMagic扩展-->
  <dependency>
   <groupId>us.codecraft</groupId>
   <artifactId>webmagic-extension</artifactId>
   <version>0.7.3</version>
  </dependency>

  <!--WebMagic对布隆过滤器的支持-->
  <dependency>
   <groupId>com.google.guava</groupId>
   <artifactId>guava</artifactId>
   <version>16.0</version>
  </dependency>

话不多说,直接上代码。

基础案例

下面代码说明以一个类似列表的页面为例

package com.crawler.project.proTask;

import com.alibaba.fastjson.JSONObject;
import org.springframework.scheduling.annotation.Scheduled;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.scheduler.BloomFilterDuplicateRemover;
import us.codecraft.webmagic.scheduler.QueueScheduler;
import us.codecraft.webmagic.selector.Selectable;

import java.util.List;

public class TaskProcessor implements PageProcessor {

 /*
 * 此方法为爬虫业务实现
 * */
 @Override
 public void process(Page page) {

  //1、爬虫任务获取到一个page 解析page上的列表
  List<Selectable> list = page.getHtml().css("css selector").nodes();
  if (list.size() > 0){//说明为列表页面、需要解析列表中每个元素的链接,存入待获取page队列中
   for (Selectable selectable : list) {
    //遍历集合,将每个元素链接存入待获取page队列中
    page.addTargetRequest(selectable.links().toString());
   }
   //同时将下一页的url存入队列中
   page.addTargetRequest("下一页的url");
  }else {
   //此时为列表中单个元素对应的详情页
   //在自定义方法中处理详细页,获取需要的数据进行处理。
   handle(page);
  }
 }

 private void handle(Page page) {

  //例如 处理后的数据为一个JSONObject对象
  JSONObject tmp = new JSONObject();

  //将这个tmp交由自定义的TaskPipline类处理,若未自定义Pipline并设置到Spider参数中,框架会默认将tmp打印到控制台。
  page.putField("obj",tmp);
 }

 /*
 * 此方法为配置爬虫过程的一些参数
 * */
 private Site site = Site.me()
   .setCharset("UTF-8")
   .setTimeOut(60 * 1000)
   .setRetrySleepTime(60 * 1000)
   .setCycleRetryTimes(5);
 @Override
 public Site getSite() {
  return site;
 }

 /*
 设置定时任务,执行爬虫任务
 * */
 @Scheduled(initialDelay = 1 * 1000,fixedDelay = 2 * 1000)
 public void process(){
  System.out.println("开始执行爬虫抓取任务");
  Spider.create(new TaskProcessor())//注意这里的类名要和当前类名对应
    .addUrl("起始页url")
    .addPipeline(new TaskPipeline()) //此处课自定义 数据处理类 (在handle()方法中有);
    .setScheduler(new QueueScheduler().setDuplicateRemover(new BloomFilterDuplicateRemover(100000)))
    .thread(3)//此处设置线程数量(不宜过多,最好和列表页中列表元素数量一致)
    .run();
 }
}
package com.crawler.project.proTask;

import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONObject;
import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.pipeline.Pipeline;

public class TaskPipeline implements Pipeline {
 @Override
 public void process(ResultItems resultItems, Task task) {
  if (resultItems.getAll() .size() > 0){
   Object obj = resultItems.getAll().get("obj");
   JSONObject jsonObject = JSON.parseObject(obj.toString());
   //获取到JSONObject对象下面可进行自定义的业务处理。
  }
 }
}

特殊情况一

需根据链接下载图片或文件

eg:在上面说到的详情页中含有iframe。

1、首先获取iframe的src

//获得iframe的src (这里要注意获得的src是绝对路径还是相对路径,相对路径需要拼接主站点url)
String src = html.css("css selector", "src").toString();

//采用jsoup解析
Document document = Jsoup.parse(new URL(src),1000);
//获得需要的元素
Element ele = document.select("css selector").last();
//获取需要下载的文件的链接
String downUrl = ele.attr("href");
//根据链接下载文件 返回一个文件的名称
String fileName = downloadFile(downUrl);
//通过url下载文件
public String downloadFile(String fileUrl) throws FileNotFoundException{
 try{
   URL httpUrl = new URL(fileUrl);
   String fileName = UUID.randomUUID().toString() + ".mp3";
   File file = new File(this.STATIC_FILEPATH + fileName);
   System.out.println("============保存文件方法被调用===============");
   FileUtils.copyURLToFile(httpUrl,file);
   return fileName;
  }catch (Exception e){
   e.printStackTrace();
   return null;
  }
}

特殊情况二

有些https站点 无法直接使用WebMagic默认的下载器下载,此时我们可以根据站点ssl类型修改下载器。

在项目中创建一个包用于存放自定义(修改)的下载器类

(!!!摘自webMagic框架中HttpClientDownloader,基于此类修改!!!)

/*
此方法中需要传入一个自定义的生成器(HttpClientGenerator)
*/

package com.crawler.project.spider_download;

import org.apache.commons.io.IOUtils;
import org.apache.http.HttpResponse;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.util.EntityUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Request;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.downloader.AbstractDownloader;
import us.codecraft.webmagic.downloader.HttpClientRequestContext;
import us.codecraft.webmagic.downloader.HttpUriRequestConverter;
import us.codecraft.webmagic.proxy.Proxy;
import us.codecraft.webmagic.proxy.ProxyProvider;
import us.codecraft.webmagic.selector.PlainText;
import us.codecraft.webmagic.utils.CharsetUtils;
import us.codecraft.webmagic.utils.HttpClientUtils;

import java.io.IOException;
import java.nio.charset.Charset;
import java.util.HashMap;
import java.util.Map;

/**
 * The http downloader based on HttpClient.
 *
 * @author code4crafter@gmail.com <br>
 * @since 0.1.0
 */
public class HttpClientDownloader extends AbstractDownloader {

 private Logger logger = LoggerFactory.getLogger(getClass());

 private final Map<String, CloseableHttpClient> httpClients = new HashMap<String, CloseableHttpClient>();

 //自定义的生成器(HttpClientGenerator)注意导入的应为自定义的HttpClientGenerator类,而不是WebMagic依赖中的HttpClientGenerator类。
 private HttpClientGenerator httpClientGenerator = new HttpClientGenerator();

 private HttpUriRequestConverter httpUriRequestConverter = new HttpUriRequestConverter();

 private ProxyProvider proxyProvider;

 private boolean responseHeader = true;

 public void setHttpUriRequestConverter(HttpUriRequestConverter httpUriRequestConverter) {
  this.httpUriRequestConverter = httpUriRequestConverter;
 }

 public void setProxyProvider(ProxyProvider proxyProvider) {
  this.proxyProvider = proxyProvider;
 }

 private CloseableHttpClient getHttpClient(Site site) {
  if (site == null) {
   return httpClientGenerator.getClient(null);
  }
  String domain = site.getDomain();
  CloseableHttpClient httpClient = httpClients.get(domain);
  if (httpClient == null) {
   synchronized (this) {
    httpClient = httpClients.get(domain);
    if (httpClient == null) {
     httpClient = httpClientGenerator.getClient(site);
     httpClients.put(domain, httpClient);
    }
   }
  }
  return httpClient;
 }

 @Override
 public Page download(Request request, Task task) {
  if (task == null || task.getSite() == null) {
   throw new NullPointerException("task or site can not be null");
  }
  CloseableHttpResponse httpResponse = null;
  CloseableHttpClient httpClient = getHttpClient(task.getSite());
  Proxy proxy = proxyProvider != null ? proxyProvider.getProxy(task) : null;
  HttpClientRequestContext requestContext = httpUriRequestConverter.convert(request, task.getSite(), proxy);
  Page page = Page.fail();
  try {
   httpResponse = httpClient.execute(requestContext.getHttpUriRequest(), requestContext.getHttpClientContext());
   page = handleResponse(request, request.getCharset() != null ? request.getCharset() : task.getSite().getCharset(), httpResponse, task);
   onSuccess(request);
   logger.info("downloading page success {}", request.getUrl());
   return page;
  } catch (IOException e) {
   logger.warn("download page {} error", request.getUrl(), e);
   onError(request);
   return page;
  } finally {
   if (httpResponse != null) {
    //ensure the connection is released back to pool
    EntityUtils.consumeQuietly(httpResponse.getEntity());
   }
   if (proxyProvider != null && proxy != null) {
    proxyProvider.returnProxy(proxy, page, task);
   }
  }
 }

 @Override
 public void setThread(int thread) {
  httpClientGenerator.setPoolSize(thread);
 }

 protected Page handleResponse(Request request, String charset, HttpResponse httpResponse, Task task) throws IOException {
  byte[] bytes = IOUtils.toByteArray(httpResponse.getEntity().getContent());
  String contentType = httpResponse.getEntity().getContentType() == null ? "" : httpResponse.getEntity().getContentType().getValue();
  Page page = new Page();
  page.setBytes(bytes);
  if (!request.isBinaryContent()){
   if (charset == null) {
    charset = getHtmlCharset(contentType, bytes);
   }
   page.setCharset(charset);
   page.setRawText(new String(bytes, charset));
  }
  page.setUrl(new PlainText(request.getUrl()));
  page.setRequest(request);
  page.setStatusCode(httpResponse.getStatusLine().getStatusCode());
  page.setDownloadSuccess(true);
  if (responseHeader) {
   page.setHeaders(HttpClientUtils.convertHeaders(httpResponse.getAllHeaders()));
  }
  return page;
 }

 private String getHtmlCharset(String contentType, byte[] contentBytes) throws IOException {
  String charset = CharsetUtils.detectCharset(contentType, contentBytes);
  if (charset == null) {
   charset = Charset.defaultCharset().name();
   logger.warn("Charset autodetect failed, use {} as charset. Please specify charset in Site.setCharset()", Charset.defaultCharset());
  }
  return charset;
 }
}

然后在自定义的HttpClientGenerator类中修改有关ssl的参数

(!!!摘自webMagic框架中HttpClientGenerator,基于此类修改!!!)

/*
自定义的HttpClientGenerator生成器
*/

package com.sealion_crawler.project.spider_download;

import org.apache.http.HttpException;
import org.apache.http.HttpRequest;
import org.apache.http.HttpRequestInterceptor;
import org.apache.http.client.CookieStore;
import org.apache.http.config.Registry;
import org.apache.http.config.RegistryBuilder;
import org.apache.http.config.SocketConfig;
import org.apache.http.conn.socket.ConnectionSocketFactory;
import org.apache.http.conn.socket.PlainConnectionSocketFactory;
import org.apache.http.conn.ssl.DefaultHostnameVerifier;
import org.apache.http.conn.ssl.SSLConnectionSocketFactory;
import org.apache.http.impl.client.*;
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
import org.apache.http.impl.cookie.BasicClientCookie;
import org.apache.http.protocol.HttpContext;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.downloader.CustomRedirectStrategy;

import javax.net.ssl.SSLContext;
import javax.net.ssl.TrustManager;
import javax.net.ssl.X509TrustManager;
import java.io.IOException;
import java.security.KeyManagementException;
import java.security.NoSuchAlgorithmException;
import java.security.cert.CertificateException;
import java.security.cert.X509Certificate;
import java.util.Map;

/**
 * @author code4crafter@gmail.com <br>
 * @since 0.4.0
 */
public class HttpClientGenerator {

 private transient Logger logger = LoggerFactory.getLogger(getClass());

 private PoolingHttpClientConnectionManager connectionManager;

 public HttpClientGenerator() {
  Registry<ConnectionSocketFactory> reg = RegistryBuilder.<ConnectionSocketFactory>create()
    .register("http", PlainConnectionSocketFactory.INSTANCE)
    .register("https", buildSSLConnectionSocketFactory())
    .build();
  connectionManager = new PoolingHttpClientConnectionManager(reg);
  connectionManager.setDefaultMaxPerRoute(100);
 }

 /*
 此方法中设置ssl有关参数。
 */
 private SSLConnectionSocketFactory buildSSLConnectionSocketFactory() {
  try {
   return new SSLConnectionSocketFactory(createIgnoreVerifySSL(), new String[]{"SSLv3", "TLSv1", "TLSv1.1", "TLSv1.2"},
     null,
     new DefaultHostnameVerifier()); // 优先绕过安全证书
  } catch (KeyManagementException e) {
   logger.error("ssl connection fail", e);
  } catch (NoSuchAlgorithmException e) {
   logger.error("ssl connection fail", e);
  }
  return SSLConnectionSocketFactory.getSocketFactory();
 }

 private SSLContext createIgnoreVerifySSL() throws NoSuchAlgorithmException, KeyManagementException {
 // 实现一个X509TrustManager接口,用于绕过验证,不用修改里面的方法
 X509TrustManager trustManager = new X509TrustManager() {

 @Override
 public void checkClientTrusted(X509Certificate[] chain, String authType) throws CertificateException {
 }

 @Override
 public void checkServerTrusted(X509Certificate[] chain, String authType) throws CertificateException {
 }

 @Override
 public X509Certificate[] getAcceptedIssuers() {
 return null;
 }

 };

 /*
 下面为当前框架默认参数
 SSLContext sc = SSLContext.getInstance("SSLv3");
 可修改为需要的ssl参数类型
 */
 SSLContext sc = SSLContext.getInstance("TLS");
 sc.init(null, new TrustManager[] { trustManager }, null);
 return sc;
 }

 public HttpClientGenerator setPoolSize(int poolSize) {
  connectionManager.setMaxTotal(poolSize);
  return this;
 }

 public CloseableHttpClient getClient(Site site) {
  return generateClient(site);
 }

 private CloseableHttpClient generateClient(Site site) {
  HttpClientBuilder httpClientBuilder = HttpClients.custom();

  httpClientBuilder.setConnectionManager(connectionManager);
  if (site.getUserAgent() != null) {
   httpClientBuilder.setUserAgent(site.getUserAgent());
  } else {
   httpClientBuilder.setUserAgent("");
  }
  if (site.isUseGzip()) {
   httpClientBuilder.addInterceptorFirst(new HttpRequestInterceptor() {

    public void process(
      final HttpRequest request,
      final HttpContext context) throws HttpException, IOException {
     if (!request.containsHeader("Accept-Encoding")) {
      request.addHeader("Accept-Encoding", "gzip");
     }
    }
   });
  }
  //解决post/redirect/post 302跳转问题
  httpClientBuilder.setRedirectStrategy(new CustomRedirectStrategy());

  SocketConfig.Builder socketConfigBuilder = SocketConfig.custom();
  socketConfigBuilder.setSoKeepAlive(true).setTcpNoDelay(true);
  socketConfigBuilder.setSoTimeout(site.getTimeOut());
  SocketConfig socketConfig = socketConfigBuilder.build();
  httpClientBuilder.setDefaultSocketConfig(socketConfig);
  connectionManager.setDefaultSocketConfig(socketConfig);
  httpClientBuilder.setRetryHandler(new DefaultHttpRequestRetryHandler(site.getRetryTimes(), true));
  generateCookie(httpClientBuilder, site);
  return httpClientBuilder.build();
 }

 private void generateCookie(HttpClientBuilder httpClientBuilder, Site site) {
  if (site.isDisableCookieManagement()) {
   httpClientBuilder.disableCookieManagement();
   return;
  }
  CookieStore cookieStore = new BasicCookieStore();
  for (Map.Entry<String, String> cookieEntry : site.getCookies().entrySet()) {
   BasicClientCookie cookie = new BasicClientCookie(cookieEntry.getKey(), cookieEntry.getValue());
   cookie.setDomain(site.getDomain());
   cookieStore.addCookie(cookie);
  }
  for (Map.Entry<String, Map<String, String>> domainEntry : site.getAllCookies().entrySet()) {
   for (Map.Entry<String, String> cookieEntry : domainEntry.getValue().entrySet()) {
    BasicClientCookie cookie = new BasicClientCookie(cookieEntry.getKey(), cookieEntry.getValue());
    cookie.setDomain(domainEntry.getKey());
    cookieStore.addCookie(cookie);
   }
  }
  httpClientBuilder.setDefaultCookieStore(cookieStore);
 }
}

好了,到这里 基于WebMagic框架 实现爬虫、包括jsoup的使用总结就到这里的。

到此这篇关于springBoot+webMagic实现网站爬虫的实例代码的文章就介绍到这了,更多相关springBoot webMagic 爬虫内容请搜索我们以前的文章或继续浏览下面的相关文章希望大家以后多多支持我们!

(0)

相关推荐

  • springboot+webmagic实现java爬虫jdbc及mysql的方法

    前段时间需要爬取网页上的信息,自己对于爬虫没有任何了解,就了解了一下webmagic,写了个简单的爬虫. 一.首先介绍一下webmagic: webmagic采用完全模块化的设计,功能覆盖整个爬虫的生命周期(链接提取.页面下载.内容抽取.持久化),支持多线程抓取,分布式抓取,并支持自动重试.自定义UA/cookie等功能. 实现理念: Maven依赖: <dependency> <groupId>us.codecraft</groupId> <artifactId

  • springBoot+webMagic实现网站爬虫的实例代码

    前端时间公司项目需要抓取各类数据,py玩的不6,只好研究Java爬虫方案,做一个总结. 开发环境: springBoot 2.2.6.jdk1.8. 1.导入依赖 <!--WebMagic核心包--> <dependency> <groupId>us.codecraft</groupId> <artifactId>webmagic-core</artifactId> <version>0.7.3</version&g

  • java实现爬虫爬网站图片的实例代码

    第一步,实现 LinkQueue,对url进行过滤和存储的操作 import java.util.ArrayList; import java.util.Collections; import java.util.HashSet; import java.util.List; import java.util.Set; public class LinkQueue { // 已访问的 url 集合 private static Set<String> visitedUrl = Collecti

  • SpringBoot创建JSP登录页面功能实例代码

    添加JSP配置 1.pom.xml添加jsp解析引擎 <dependency> <groupId>org.apache.tomcat.embed</groupId> <artifactId>tomcat-embed-jasper</artifactId> <scope>provided</scope> </dependency> <dependency> <groupId>javax.s

  • springboot 中文件上传下载实例代码

    Spring Boot是由Pivotal团队提供的全新框架,其设计目的是用来简化新Spring应用的初始搭建以及开发过程.该框架使用了特定的方式来进行配置,从而使开发人员不再需要定义样板化的配置.通过这种方式,Spring Boot致力于在蓬勃发展的快速应用开发领域(rapid application development)成为领导者. Spring Boot特点 1. 创建独立的Spring应用程序 2. 嵌入的Tomcat,无需部署WAR文件 3. 简化Maven配置 4. 自动配置Spr

  • SpringBoot使用Thymeleaf自定义标签的实例代码

    此篇文章内容仅限于 描述springboot与 thy 自定义标签的说明,所以你在看之前,请先会使用springboot和thymeleaf!! 之前写过一篇是springMVC与thymeleaf 的自定义标签(属于自定义方言的属性一块,类似thy的th:if和th:text等)文章,如果你想了解,以下是地址: 点击>>Thymeleaf3.0自定义标签属性 这篇例子可以实现你的分页标签实现等功能,不会讲一堆的废话和底层的原理(自行百度),属于快速上手教程,请认真看以下内容! PS: 请允许

  • SpringBoot+WebMagic+MyBaties实现爬虫和数据入库的示例

    目录 创建数据库: 新建SpringBoot项目: 1.配置依赖pom.xml 2.创建CmsContentPO.java 3.创建CrawlerMapper.java 4.配置映射文件CrawlerMapper.xml 5.配置application.properties 6.创建ArticlePageProcessor.java 7.创建ArticlePipeline.java 8.创建ArticleTask.java 9.修改Application 10.执行application,开始抓

  • node.js实现博客小爬虫的实例代码

    前言 爬虫,是一种自动获取网页内容的程序.是搜索引擎的重要组成部分,因此搜索引擎优化很大程度上就是针对爬虫而做出的优化. 这篇文章介绍的是利用node.js实现博客小爬虫,核心的注释我都标注好了,可以自行理解,只需修改url和按照要趴的博客内部dom构造改一下filterchapters和filterchapters1就行了! 下面话不多说,直接来看实例代码 var http=require('http'); var Promise=require('Bluebird'); var cheeri

  • springboot各种格式转pdf的实例代码

    添加依赖 <!--转pdf--> <dependency> <groupId>com.documents4j</groupId> <artifactId>documents4j-local</artifactId> <version>1.0.3</version> </dependency> <dependency> <groupId>com.documents4j</

  • js根据手机客户端浏览器类型,判断跳转官网/手机网站多个实例代码

    实例一.比较简单粗暴缺少点类型判断 <script type="text/javascript"> var sUserAgent = navigator.userAgent.toLowerCase(); var bIsIpad = sUserAgent.match(/ipad/i) == "ipad"; var bIsIphoneOs = sUserAgent.match(/iphone os/i) == "iphone os"; v

随机推荐