springboot集成ES实现磁盘文件全文检索的示例代码

2025-04-01 17:18:40

最近有个朋友咨询如何实现对海量磁盘资料进行目录、文件名及文件正文进行搜索，要求实现简单高效、维护方便、成本低廉。我想了想利用ES来实现文档的索引及搜索是适当的选择，于是就着手写了一些代码来实现，下面就将设计思路及实现方法作以介绍。

整体架构

考虑到磁盘文件分布到不同的设备上，所以采用磁盘扫瞄代理的模式构建系统，即把扫描服务以代理的方式部署到目标磁盘所在的服务器上，作为定时任务执行，索引统一建立到ES中，当然ES采用分布式高可用部署方法，搜索服务和扫描代理部署到一起来简化架构并实现分布式能力。

磁盘文件快速检索架构

部署ES

ES（elasticsearch）是本项目唯一依赖的第三方软件，ES支持docker方式部署，以下是部署过程

docker pull docker.elastic.co/elasticsearch/elasticsearch:6.3.2
docker run -e ES_JAVA_OPTS="-Xms256m -Xmx256m" -d -p 9200:9200 -p 9300:9300 --name es01 docker.elastic.co/elasticsearch/elasticsearch:6.3.2

部署完成后，通过浏览器打开http://localhost:9200，如果正常打开，出现如下界面，则说明ES部署成功。

ES界面

工程结构

工程结构

依赖包

本项目除了引入springboot的基础starter外，还需要引入ES相关包

  <dependencies>
    <dependency>
      <groupId>org.springframework.boot</groupId>
      <artifactId>spring-boot-starter-data-elasticsearch</artifactId>
    </dependency>
    <dependency>
      <groupId>io.searchbox</groupId>
      <artifactId>jest</artifactId>
      <version>5.3.3</version>
    </dependency>
    <dependency>
      <groupId>net.sf.jmimemagic</groupId>
      <artifactId>jmimemagic</artifactId>
      <version>0.1.4</version>
    </dependency>
  </dependencies>

配置文件

需要将ES的访问地址配置到application.yml里边，同时为了简化程序，需要将待扫描磁盘的根目录（index-root）配置进去，后面的扫描任务就会递归遍历该目录下的全部可索引文件。

server:
 port: @elasticsearch.port@
spring:
 application:
  name: @project.artifactId@
 profiles:
  active: dev
 elasticsearch:
  jest:
   uris: http://127.0.0.1:9200
index-root: /Users/crazyicelee/mywokerspace

索引结构数据定义

因为要求文件所在目录、文件名、文件正文都有能够检索，所以要将这些内容都作为索引字段定义，而且添加ES client要求的JestId来注解id。

package com.crazyice.lee.accumulation.search.data;

import io.searchbox.annotations.JestId;
import lombok.Data;

@Data
public class Article {
  @JestId
  private Integer id;
  private String author;
  private String title;
  private String path;
  private String content;
  private String fileFingerprint;
}

扫描磁盘并创建索引

因为要扫描指定目录下的全部文件，所以采用递归的方法遍历该目录，并标识已经处理的文件以提升效率，在文件类型识别方面采用两种方式可供选择，一个是文件内容更为精准判断（Magic），一种是以文件扩展名粗略判断。这部分是整个系统的核心组件。

这里有个小技巧

对目标文件内容计算MD5值并作为文件指纹存储到ES的索引字段里边，每次在重建索引的时候判断该MD5是否存在，如果存在就不用重复建立索引了，可以避免文件索引重复，也能避免系统重启后重复遍历文件。

package com.crazyice.lee.accumulation.search.service;

import com.alibaba.fastjson.JSONObject;
import com.crazyice.lee.accumulation.search.data.Article;
import com.crazyice.lee.accumulation.search.utils.Md5CaculateUtil;
import io.searchbox.client.JestClient;
import io.searchbox.core.Index;
import io.searchbox.core.Search;
import io.searchbox.core.SearchResult;
import lombok.extern.slf4j.Slf4j;
import net.sf.jmimemagic.*;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.elasticsearch.index.query.QueryBuilders;
import org.elasticsearch.search.builder.SearchSourceBuilder;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Component;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;

@Component
@Slf4j
public class DirectoryRecurse {

  @Autowired
  private JestClient jestClient;

  //读取文件内容转换为字符串
  private String readToString(File file, String fileType) {
    StringBuffer result = new StringBuffer();
    switch (fileType) {
      case "text/plain":
      case "java":
      case "c":
      case "cpp":
      case "txt":
        try (FileInputStream in = new FileInputStream(file)) {
          Long filelength = file.length();
          byte[] filecontent = new byte[filelength.intValue()];
          in.read(filecontent);
          result.append(new String(filecontent, "utf8"));
        } catch (FileNotFoundException e) {
          log.error("{}", e.getLocalizedMessage());
        } catch (IOException e) {
          log.error("{}", e.getLocalizedMessage());
        }
        break;
      case "doc":
        //使用HWPF组件中WordExtractor类从Word文档中提取文本或段落
        try (FileInputStream in = new FileInputStream(file)) {
          WordExtractor extractor = new WordExtractor(in);
          result.append(extractor.getText());
        } catch (Exception e) {
          log.error("{}", e.getLocalizedMessage());
        }
        break;
      case "docx":
        try (FileInputStream in = new FileInputStream(file); XWPFDocument doc = new XWPFDocument(in)) {
          XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
          result.append(extractor.getText());
        } catch (Exception e) {
          log.error("{}", e.getLocalizedMessage());
        }
        break;
    }
    return result.toString();
  }

  //判断是否已经索引
  private JSONObject isIndex(File file) {
    JSONObject result = new JSONObject();
    //用MD5生成文件指纹,搜索该指纹是否已经索引
    String fileFingerprint = Md5CaculateUtil.getMD5(file);
    result.put("fileFingerprint", fileFingerprint);
    SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
    searchSourceBuilder.query(QueryBuilders.termQuery("fileFingerprint", fileFingerprint));
    Search search = new Search.Builder(searchSourceBuilder.toString()).addIndex("diskfile").addType("files").build();
    try {
      //执行
      SearchResult searchResult = jestClient.execute(search);
      if (searchResult.getTotal() > 0) {
        result.put("isIndex", true);
      } else {
        result.put("isIndex", false);
      }
    } catch (IOException e) {
      log.error("{}", e.getLocalizedMessage());
    }
    return result;
  }

  //对文件目录及内容创建索引
  private void createIndex(File file, String method) {
    //忽略掉临时文件，以~$起始的文件名
    if (file.getName().startsWith("~$")) return;

    String fileType = null;
    switch (method) {
      case "magic":
        Magic parser = new Magic();
        try {
          MagicMatch match = parser.getMagicMatch(file, false);
          fileType = match.getMimeType();
        } catch (MagicParseException e) {
          //log.error("{}",e.getLocalizedMessage());
        } catch (MagicMatchNotFoundException e) {
          //log.error("{}",e.getLocalizedMessage());
        } catch (MagicException e) {
          //log.error("{}",e.getLocalizedMessage());
        }
        break;
      case "ext":
        String filename = file.getName();
        String[] strArray = filename.split("\\.");
        int suffixIndex = strArray.length - 1;
        fileType = strArray[suffixIndex];
    }

    switch (fileType) {
      case "text/plain":
      case "java":
      case "c":
      case "cpp":
      case "txt":
      case "doc":
      case "docx":
        JSONObject isIndexResult = isIndex(file);
        log.info("文件名：{}，文件类型：{}，MD5：{}，建立索引：{}", file.getPath(), fileType, isIndexResult.getString("fileFingerprint"), isIndexResult.getBoolean("isIndex"));

        if (isIndexResult.getBoolean("isIndex")) break;
        //1. 给ES中索引(保存)一个文档
        Article article = new Article();
        article.setTitle(file.getName());
        article.setAuthor(file.getParent());
        article.setPath(file.getPath());
        article.setContent(readToString(file, fileType));
        article.setFileFingerprint(isIndexResult.getString("fileFingerprint"));
        //2. 构建一个索引
        Index index = new Index.Builder(article).index("diskfile").type("files").build();
        try {
          //3. 执行
          if (!jestClient.execute(index).getId().isEmpty()) {
            log.info("构建索引成功！");
          }
        } catch (IOException e) {
          log.error("{}", e.getLocalizedMessage());
        }
        break;
    }
  }

  public void find(String pathName) throws IOException {
    //获取pathName的File对象
    File dirFile = new File(pathName);

    //判断该文件或目录是否存在，不存在时在控制台输出提醒
    if (!dirFile.exists()) {
      log.info("do not exit");
      return;
    }

    //判断如果不是一个目录，就判断是不是一个文件，时文件则输出文件路径
    if (!dirFile.isDirectory()) {
      if (dirFile.isFile()) {
        createIndex(dirFile, "ext");
      }
      return;
    }

    //获取此目录下的所有文件名与目录名
    String[] fileList = dirFile.list();

    for (int i = 0; i < fileList.length; i++) {
      //遍历文件目录
      String string = fileList[i];
      File file = new File(dirFile.getPath(), string);
      //如果是一个目录，输出目录名后，进行递归
      if (file.isDirectory()) {
        //递归
        find(file.getCanonicalPath());
      } else {
        createIndex(file, "ext");
      }
    }
  }
}

扫描任务

这里采用定时任务的方式来扫描指定目录以实现动态增量创建索引。

package com.crazyice.lee.accumulation.search.service;

import lombok.extern.slf4j.Slf4j;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Configuration;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.stereotype.Component;

import java.io.IOException;

@Configuration
@Component
@Slf4j
public class CreateIndexTask {
  @Autowired
  private DirectoryRecurse directoryRecurse;

  @Value("${index-root}")
  private String indexRoot;

  @Scheduled(cron = "* 0/5 * * * ?")
  private void addIndex(){
    try {
      directoryRecurse.find(indexRoot);
      directoryRecurse.writeIndexStatus();
    } catch (IOException e) {
      log.error("{}",e.getLocalizedMessage());
    }
  }
}

搜索服务

这里以restFul的方式提供搜索服务，将关键字以高亮度模式提供给前端UI，浏览器端可以根据返回的JSON进行展示。

package com.crazyice.lee.accumulation.search.web;

import com.alibaba.fastjson.JSONObject;
import com.crazyice.lee.accumulation.search.data.Article;
import io.searchbox.client.JestClient;
import io.searchbox.core.Search;
import io.searchbox.core.SearchResult;
import io.swagger.annotations.ApiImplicitParam;
import io.swagger.annotations.ApiImplicitParams;
import io.swagger.annotations.ApiOperation;
import lombok.extern.slf4j.Slf4j;
import org.elasticsearch.index.query.BoolQueryBuilder;
import org.elasticsearch.index.query.QueryBuilders;
import org.elasticsearch.search.builder.SearchSourceBuilder;
import org.elasticsearch.search.fetch.subphase.highlight.HighlightBuilder;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.lang.NonNull;
import org.springframework.web.bind.annotation.PathVariable;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RequestMethod;
import org.springframework.web.bind.annotation.RestController;

import java.io.IOException;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

@RestController
@Slf4j
public class Controller {
  @Autowired
  private JestClient jestClient;

  @RequestMapping(value = "/search/{keyword}",method = RequestMethod.GET)
  @ApiOperation(value = "全部字段搜索关键字",notes = "es验证")
  @ApiImplicitParams(
      @ApiImplicitParam(name = "keyword",value = "全文检索关键字",required = true,paramType = "path",dataType = "String")
  )
  public List search(@PathVariable String keyword){
    SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
    searchSourceBuilder.query(QueryBuilders.queryStringQuery(keyword));

    HighlightBuilder highlightBuilder = new HighlightBuilder();
    //path属性高亮度
    HighlightBuilder.Field highlightPath = new HighlightBuilder.Field("path");
    highlightPath.highlighterType("unified");
    highlightBuilder.field(highlightPath);
    //title字段高亮度
    HighlightBuilder.Field highlightTitle = new HighlightBuilder.Field("title");
    highlightTitle.highlighterType("unified");
    highlightBuilder.field(highlightTitle);
    //content字段高亮度
    HighlightBuilder.Field highlightContent = new HighlightBuilder.Field("content");
    highlightContent.highlighterType("unified");
    highlightBuilder.field(highlightContent);

    //高亮度配置生效
    searchSourceBuilder.highlighter(highlightBuilder);

    log.info("搜索条件{}",searchSourceBuilder.toString());

    //构建搜索功能
    Search search = new Search.Builder(searchSourceBuilder.toString()).addIndex( "gf" ).addType( "news" ).build();
    try {
      //执行
      SearchResult result = jestClient.execute( search );
      return result.getHits(Article.class);
    } catch (IOException e) {
      log.error("{}",e.getLocalizedMessage());
    }
    return null;
  }
}

搜索restFul结果测试

这里以swagger的方式进行API测试。其中keyword是全文检索中要搜索的关键字。

搜索结果

使用thymeleaf生成UI

集成thymeleaf的模板引擎直接将搜索结果以web方式呈现。模板包括主搜索页和搜索结果页，通过@Controller注解及Model对象实现。

<body>
  <div class="container">
    <div class="header">
      <form action="./search" class="parent">
        <input type="keyword" name="keyword" th:value="${keyword}">
        <input type="submit" value="搜索">
      </form>
    </div>

    <div class="content" th:each="article,memberStat:${articles}">
      <div class="c_left">
        <p class="con-title" th:text="${article.title}"/>
        <p class="con-path" th:text="${article.path}"/>
        <p class="con-preview" th:utext="${article.highlightContent}"/>
        <a class="con-more">更多</a>
      </div>
      <div class="c_right">
        <p class="con-all" th:utext="${article.content}"/>
      </div>
    </div>

    <script language="JavaScript">
      document.querySelectorAll('.con-more').forEach(item => {
        item.onclick = () => {
        item.style.cssText = 'display: none';
        item.parentNode.querySelector('.con-preview').style.cssText = 'max-height: none;';
      }});
    </script>
  </div>

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持我们。

python django使用haystack:全文检索的框架(实例讲解)

haystack:全文检索的框架 whoosh:纯Python编写的全文搜索引擎 jieba:一款免费的中文分词包首先安装这三个包 pip install django-haystack pip install whoosh pip install jieba 1.修改settings.py文件,安装应用haystack, 2.在settings.py文件中配置搜索引擎 HAYSTACK_CONNECTIONS = { 'default': { # 使用whoosh引擎 'ENGINE': '
Python中使用haystack实现django全文检索搜索引擎功能

前言 django是python语言的一个web框架,功能强大.配合一些插件可为web网站很方便地添加搜索功能. 搜索引擎使用whoosh,是一个纯python实现的全文搜索引擎,小巧简单. 中文搜索需要进行中文分词,使用jieba. 直接在django项目中使用whoosh需要关注一些基础细节问题,而通过haystack这一搜索框架,可以方便地在django中直接添加搜索功能,无需关注索引建立.搜索解析等细节问题. haystack支持多种搜索引擎,不仅仅是whoosh,使用solr.elas
如何实现全文检索？

<%@ LANGUAGE="VBscript" %> <html> <head> <meta NAME="GENERATOR" Content="Microsoft FrontPage 3.0"> <meta HTTP-EQUIV="Content-Type" content="text/html; charset=gb_2312-80"> <
SQL Server全文检索查询浅析

方案概要: 1. 改变文件存储时的文件名 2. 配置索引服务器,并将索引服务器与MS SQL Server关联. 3. 修改SQL语句,将进行全文查询语句的内容加入查询条件中文件的存储方式: 为了方便存储以及方便索引,我们将上传的文件存储到一个目录里面,为了保证上传的文件名不重复,采用GUID作为文件名,并且通过这个GUID于数据库记录相关联.同时,文件的后缀还保持原始文件的后缀,让索引服务能够识别该文档. 配置索引服务进入计算机管理(Computer Management)程序(右键"我的
python 全文检索引擎详解

python 全文检索引擎详解最近一直在探索着如何用Python实现像百度那样的关键词检索功能.说起关键词检索,我们会不由自主地联想到正则表达式.正则表达式是所有检索的基础,python中有个re类,是专门用于正则匹配.然而,光光是正则表达式是不能很好实现检索功能的. python有一个whoosh包,是专门用于全文搜索引擎. whoosh在国内使用的比较少,而它的性能还没有sphinx/coreseek成熟,不过不同于前者,这是一个纯python库,对python的爱好者更为方便使用.具体的
Java实现AC自动机全文检索示例

第一步,构建Trie树,定义Node类型: /** * Created by zhaoyy on 2017/2/7. */ interface Node { char value(); boolean exists(); boolean isRoot(); Node parent(); Node childOf(char c); Node fail(); void setFail(Node node); void setExists(boolean exists); void add(Node
springboot集成ES实现磁盘文件全文检索的示例代码

最近有个朋友咨询如何实现对海量磁盘资料进行目录.文件名及文件正文进行搜索,要求实现简单高效.维护方便.成本低廉.我想了想利用ES来实现文档的索引及搜索是适当的选择,于是就着手写了一些代码来实现,下面就将设计思路及实现方法作以介绍. 整体架构考虑到磁盘文件分布到不同的设备上,所以采用磁盘扫瞄代理的模式构建系统,即把扫描服务以代理的方式部署到目标磁盘所在的服务器上,作为定时任务执行,索引统一建立到ES中,当然ES采用分布式高可用部署方法,搜索服务和扫描代理部署到一起来简化架构并实现分布式能力. 磁
springboot集成CAS实现单点登录的示例代码

最近新参与的项目用到了cas单点登录,我还不会,这怎么能容忍!空了学习并搭建了一个spring-boot 集成CAS 的demo.实现了单点登录与登出. 单点登录英文全称是:Single Sign On,简称SSO. 含义:在多个相互信任的系统中,只要登录一个系统其他系统均可访问. CAS 是一种使用广泛的单点登录实现,分为客户端CAS Client和服务端 CAS Service,客户端就是我们的系统,服务端是认证中心,由CAS提供,我们需要稍作修改,启动起来就可以用.~~~~ 效果演示 ht
SpringBoot集成redis实现分布式锁的示例代码

1.准备使用redis实现分布式锁,需要用的setnx(),所以需要集成Jedis 需要引入jar,jar最好和redis的jar版本对应上,不然会出现版本冲突,使用的时候会报异常redis.clients.jedis.Jedis.set(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;I)Ljava/lang/String; 我使用的redis版本是2.3.0,Jedis使用的是3.3.0 <de
Springboot集成JUnit5优雅进行单元测试的示例

为什么使用JUnit5 JUnit4被广泛使用,但是许多场景下使用起来语法较为繁琐,JUnit5中支持lambda表达式,语法简单且代码不冗余. JUnit5易扩展,包容性强,可以接入其他的测试引擎. 功能更强大提供了新的断言机制.参数化测试.重复性测试等新功能. ps:开发人员为什么还要测试,单测写这么规范有必要吗?其实单测是开发人员必备技能,只不过很多开发人员开发任务太重导致调试完就不管了,没有系统化得单元测试,单元测试在系统重构时能发挥巨大的作用,可以在重构后快速测试新的接口是否与重构前有
springboot集成es详解

1.导入 maven依赖 <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-dataelasticsearch</artifactId> <dependency> 注意保持版本一致我用的是7.6.2版本的 <properties> <java.version>1.8</jav
SpringBoot集成nacos动态刷新数据源的实现示例

前言因为项目需要,需要在项目运行过程中能够动态修改数据源(即:数据源的热更新).这里以com.alibaba.druid.pool.DruidDataSource数据源为例第一步:重写DruidAbstractDataSource类这里为什么要重写这个类:因为DruidDataSource数据源在初始化后,就不允许再重新设置数据库的url和userName public void setUrl(String jdbcUrl) { if (StringUtils.equals(this.jd
springboot集成mybatisPlus+多数据源的实现示例

该项目主要实现mybatisplus.多数据源.lombok.druid的集成主要参考 https://mp.baomidou.com/guide/quick-start.html 项目地址:https://github.com/Blankwhiter/mybatisplus-springboot release1.0 项目结构: 一.创建表以及测试数据 CREATE TABLE user ( id VARCHAR(32) NOT NULL COMMENT '主键ID', name VARCH
SpringBoot集成支付宝沙箱支付的实现示例

目录开发前准备 1.密钥工具 2.沙箱环境 3.内网穿透工具代码集成 1.Java SDK 2.支付宝配置 3.支付和回调接口 4.前端Vue调用开发前准备 1.密钥工具在线工具地址:https://miniu.alipay.com/keytool/create 无需下载,直接在线生成你的应用私钥点击生成即可生成自己的公钥和私钥这个公钥后面会用到叫做alipayPublicKey 这个私钥后面会用到叫做appPrivateKey 如果遇到生成失败点击链接选择Web在线加密https:
Springboot 2.x集成kafka 2.2.0的示例代码

目录引言基本环境代码编写 1.基本引用pom 2.基本配置 3.实体类 4.生产者端 5.消费者 6.测试效果展示遇到的问题引言 kafka近几年更新非常快,也可以看出kafka在企业中是用的频率越来越高,在springboot中集成kafka还是比较简单的,但是应该注意使用的版本和kafka中基本配置,这个地方需要信心,防止进入坑中. 版本对应地址:https://spring.io/projects/spring-kafka 基本环境 springboot版本2.1.4 kafk
SpringBoot集成本地缓存性能之王Caffeine示例详解

目录引言 Spring Cache 是什么集成 Caffeine 核心原理引言使用缓存的目的就是提高性能,今天码哥带大家实践运用 spring-boot-starter-cache 抽象的缓存组件去集成本地缓存性能之王 Caffeine. 大家需要注意的是:in-memeory 缓存只适合在单体应用,不适合与分布式环境. 分布式环境的情况下需要将缓存修改同步到每个节点,需要一个同步机制保证每个节点缓存数据最终一致. Spring Cache 是什么不使用 Spring Cache 抽象

springboot集成ES实现磁盘文件全文检索的示例代码

相关推荐

随机推荐