Java实现AC自动机全文检索示例

第一步,构建Trie树,定义Node类型:

/**
 * Created by zhaoyy on 2017/2/7.
 */
interface Node {

  char value();

  boolean exists();

  boolean isRoot();

  Node parent();

  Node childOf(char c);

  Node fail();

  void setFail(Node node);

  void setExists(boolean exists);

  void add(Node child);

  List<Node> children();
}

第二步,实现两种Node,如果词汇全是可打印的ASCII字符,就采用AsciiNode,否则(比如包含汉字),使用基于hash表的MapNode;这两种Node均集成自AbstractNode:

/**
 * Created by zhaoyy on 2017/2/8.
 */
abstract class AbstractNode implements Node {

  private static final char EMPTY = '\0';
  private final char value;
  private final Node parent;
  private boolean exists;
  private Node fail;

  public AbstractNode(Node parent, char value) {
    this.parent = parent;
    this.value = value;
    this.exists = false;
    this.fail = null;
  }

  public AbstractNode() {
    this(null, EMPTY);
  }

  private static String tab(int n) {
    StringBuilder builder = new StringBuilder();
    for (int i = 0; i < n; i++) {
      builder.append('\t');
    }
    return builder.toString();
  }

  private static String toString(Node node, int depth) {
    StringBuilder builder = new StringBuilder();
    String tab = tab(depth);
    Node fail = node.fail();
    Node parent = node.parent();
    builder
        .append(tab)
        .append('<')
        .append(node.value())
        .append(" exists=\"")
        .append(node.exists())
        .append('"')
        .append(" parent=\"")
        .append(parent == null ? "null" : parent.value())
        .append('"')
        .append(" fail=\"")
        .append(fail == null ? "null" : fail.value())
        .append("\">\r\n");
    for (Node child : node.children())
      builder.append(toString(child, depth + 1));
    builder
        .append(tab)
        .append("</")
        .append(node.value())
        .append(">\r\n");

    return builder.toString();
  }

  @Override
  public char value() {
    return value;
  }

  @Override
  public boolean exists() {
    return exists;
  }

  @Override
  public boolean isRoot() {
    return value == EMPTY;
  }

  @Override
  public Node parent() {
    return parent;
  }

  @Override
  public Node fail() {
    return fail;
  }

  @Override
  public void setFail(Node node) {
    this.fail = node;
  }

  @Override
  public void setExists(boolean exists) {
    this.exists = exists;
  }

  @Override
  public String toString() {
    return toString(this, 0);
  }
}

/////////////////////////////////////////////////////////////////////////////////////////////

/**
 * Created by zhaoyy on 2017/2/8.
 */
final class AsciiNode extends AbstractNode implements Node {

  private static final char FROM = 32;
  private static final char TO = 126;
  private final Node[] children;

  public AsciiNode() {
    super();
    this.children = new Node[TO - FROM + 1];
  }

  public AsciiNode(Node parent, char value) {
    super(parent, value);
    this.children = new Node[TO - FROM + 1];
  }

  @Override
  public Node childOf(char c) {
    if (c >= FROM && c <= TO)
      return children[(int) c - FROM];
    else return null;
  }

  @Override
  public void add(Node child) {
    int index = (int) child.value();
    children[index - FROM] = child;
  }

  @Override
  public List<Node> children() {
    List<Node> nodes = new ArrayList<Node>();
    for (Node child : children)
      if (child != null)
        nodes.add(child);
    return nodes;
  }
}

//////////////////////////////////////////////////////////////////////////////////////////////

/**
 * Created by zhaoyy on 2017/2/8.
 */
final class MapNode extends AbstractNode implements Node {

  private final Map<Character, Node> children;

  public MapNode() {
    super();
    this.children = new HashMap<Character, Node>();
  }

  public MapNode(Node parent, char value) {
    super(parent, value);
    this.children = new HashMap<Character, Node>();
  }

  @Override
  public Node childOf(char c) {
    return children.get(c);
  }

  @Override
  public void add(Node child) {
    children.put(child.value(), child);
  }

  @Override
  public List<Node> children() {
    List<Node> nodes = new ArrayList<Node>();
    nodes.addAll(children.values());
    return nodes;
  }
}

第三步,

首先定义一个Node构造器:

/**
 * Created by zhaoyy on 2017/2/8.
 */
public interface NodeMaker {

  Node make(Node parent, char value);

  Node makeRoot();
}

然后构建AC自动机,实现创建及查找方法:

/**
 * Created by zhaoyy on 2017/2/7.
 */
public final class WordTable {

  private final Node root;

  private WordTable(Collection<? extends CharSequence> words, NodeMaker maker) {
    Node root = buildTrie(words, maker);
    setFailNode(root);
    this.root = root;
  }

  public static WordTable compile(Collection<? extends CharSequence> words) {
    if (words == null || words.isEmpty())
      throw new IllegalArgumentException();
    final NodeMaker maker;
    if (isAscii(words))
      maker = new NodeMaker() {
        @Override
        public Node make(Node parent, char value) {
          return new AsciiNode(parent, value);
        }

        @Override
        public Node makeRoot() {
          return new AsciiNode();
        }
      };
    else maker = new NodeMaker() {
      @Override
      public Node make(Node parent, char value) {
        return new MapNode(parent, value);
      }

      @Override
      public Node makeRoot() {
        return new MapNode();
      }
    };
    return new WordTable(words, maker);
  }

  private static boolean isAscii(Collection<? extends CharSequence> words) {
    for (CharSequence word : words) {
      int len = word.length();
      for (int i = 0; i < len; i++) {
        int c = (int) word.charAt(i);
        if (c < 32 || c > 126)
          return false;
      }
    }
    return true;
  }

  private static Node buildTrie(Collection<? extends CharSequence> sequences, NodeMaker maker) {
    Node root = maker.makeRoot();
    for (CharSequence sequence : sequences) {
      int len = sequence.length();
      Node current = root;
      for (int i = 0; i < len; i++) {
        char c = sequence.charAt(i);
        Node node = current.childOf(c);
        if (node == null) {
          node = maker.make(current, c);
          current.add(node);
        }
        current = node;
        if (i == len - 1)
          node.setExists(true);
      }
    }
    return root;
  }

  private static void setFailNode(final Node root) {
    root.setFail(null);
    Queue<Node> queue = new LinkedList<Node>();
    queue.add(root);
    while (!queue.isEmpty()) {
      Node parent = queue.poll();
      Node temp;
      for (Node child : parent.children()) {
        if (parent.isRoot())
          child.setFail(root);
        else {
          temp = parent.fail();
          while (temp != null) {
            Node node = temp.childOf(child.value());
            if (node != null) {
              child.setFail(node);
              break;
            }
            temp = temp.fail();
          }
          if (temp == null)
            child.setFail(root);
        }
        queue.add(child);
      }
    }
  }

  public boolean findAnyIn(CharSequence cs) {
    int len = cs.length();
    Node node = root;
    for (int i = 0; i < len; i++) {
      Node next = node.childOf(cs.charAt(i));
      if (next == null) {
        next = node.fail();
        if (next == null) {
          node = root;
          continue;
        }
      }
      if (next.exists())
        return true;
    }

    return false;
  }

  public List<MatchInfo> search(CharSequence cs) {
    if (cs == null || cs.length() == 0)
      return Collections.emptyList();
    List<MatchInfo> result = new ArrayList<MatchInfo>();
    int len = cs.length();
    Node node = root;
    for (int i = 0; i < len; i++) {
      Node next = node.childOf(cs.charAt(i));
      if (next == null) {
        next = node.fail();
        if (next == null) {
          node = root;
          continue;
        }
      }
      if (next.exists()) {
        MatchInfo info = new MatchInfo(i, next);
        result.add(info);
        node = root;
        continue;
      }
      node = next;
    }
    return result;
  }

  @Override
  public String toString() {
    return root.toString();
  }
}

定义一个保存查找结果的实体:

/**
 * Created by zhaoyy on 2017/2/7.
 */
public final class MatchInfo {

  private final int index;
  private final String word;

  public MatchInfo(int index, String word) {
    this.index = index;
    this.word = word;
  }

  public MatchInfo(int index, Node node) {
    StringBuilder builder = new StringBuilder();
    while (node != null) {
      if (!node.isRoot())
        builder.append(node.value());
      node = node.parent();
    }
    String word = builder.reverse().toString();
    this.index = index + 1 - word.length();
    this.word = word;
  }

  public int getIndex() {
    return index;
  }

  public String getWord() {
    return word;
  }

  @Override
  public String toString() {
    return index + ":" + word;
  }
}

第四步,调用Demo:

public static void main(String[] args) {
    List<String> list = Arrays.asList("say", "her", "he", "she", "shr", "alone");
    WordTable table = WordTable.compile(list);
    System.out.println(table);
    System.out.println(table.search("1shesaynothingabouthislivinghimalone"));
  }

以下是输出结果:

< exists="false" parent="null" fail="null">
 <s exists="false" parent=" " fail=" ">
 <a exists="false" parent="s" fail="a">
  <y exists="true" parent="a" fail=" ">
  </y>
 </a>
 <h exists="false" parent="s" fail="h">
  <e exists="true" parent="h" fail="e">
  </e>
  <r exists="true" parent="h" fail=" ">
  </r>
 </h>
 </s>
 <h exists="false" parent=" " fail=" ">
 <e exists="true" parent="h" fail=" ">
  <r exists="true" parent="e" fail=" ">
  </r>
 </e>
 </h>
 <a exists="false" parent=" " fail=" ">
 <l exists="false" parent="a" fail=" ">
  <o exists="false" parent="l" fail=" ">
  <n exists="false" parent="o" fail=" ">
   <e exists="true" parent="n" fail=" ">
   </e>
  </n>
  </o>
 </l>
 </a>
</ >

[1:she, 4:say, 31:alone]

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持我们。

(0)

相关推荐

  • java servlet手机app访问接口(三)高德地图云存储及检索

    这篇关于高德地图的随笔内容会多一点, 一.业务说明 对应APP业务中的成员有两类,一是服务人员,二是被服务人员, 主要实现功能, 对APP中的服务人员位置进行时时定位, 然后通过被服务人员登录APP时提供的一个经纬度来计算服务人员与被服务人员之间的距离 单位m. 下面是整个详细流程,从创建高德对应应用(这里注册我就不说了)------最后完成此功能. 二.创建servlet对应的高德地图应用,创建自己的云图数据库表 注册帐号后登录点击右上角的控制台,会出现下面这个界面,我截图 这里当然是我已经注

  • 使用Java的Lucene搜索工具对检索结果进行分组和分页

    使用GroupingSearch对搜索结果进行分组 Package org.apache.lucene.search.grouping Description 这个模块可以对Lucene的搜索结果进行分组,指定的单值域被聚集到一起.比如,根据"author"域进行分组,"author"域值相同的的文档分成一个组. 进行分组的时候需要输入一些必要的信息: 1.groupField:根据这个域进行分组.比如,如果你使用"author"域进行分组,那么

  • Java采用setAsciiStream方法检索数据库指定内容实例解析

    本文实例展示了Java采用setAsciiStream()方法检索数据库的实例代码.使用参数查询必须在SQL 语句执行之前对参数进行赋值,赋值是使用PreparedStatement 对象的SetBoolean().SetInt().SetString().SetObject().SetNull()等方法来实现.这些方法建立了Java数据类型和SQL 数据类型的映射.JDBC 可以使用输入流作为SQL 语句的输入参数,设置输入流的方法有三个:setAsciiStream().setUnicode

  • Java实现AC自动机全文检索示例

    第一步,构建Trie树,定义Node类型: /** * Created by zhaoyy on 2017/2/7. */ interface Node { char value(); boolean exists(); boolean isRoot(); Node parent(); Node childOf(char c); Node fail(); void setFail(Node node); void setExists(boolean exists); void add(Node

  • 详解Java中AC自动机的原理与实现

    目录 简介 工作过程 数据结构 初始化 构建字典树 构建失败指针 匹配 执行结果 简介 AC自动机是一个多模式匹配算法,在模式匹配领域被广泛应用,举一个经典的例子,违禁词查找并替换为***.AC自动机其实是Trie树和KMP 算法的结合,首先将多模式串建立一个Tire树,然后结合KMP算法前缀与后缀匹配可以减少不必要比较的思想达到高效找到字符串中出现的匹配串. 如果不知道什么是Tire树,可以先查看:详解Java中字典树(Trie树)的图解与实现 如果不知道KMP算法,可以先查看:详解Java中

  • java编程之AC自动机工作原理与实现代码

    在阅读本文之前,大家可以先参考下<多模字符串匹配算法原理及Java实现代码> 简介: 本文是博主自身对AC自动机的原理的一些理解和看法,主要以举例的方式讲解,同时又配以相应的图片.代码实现部分也予以明确的注释,希望给大家不一样的感受.AC自动机主要用于多模式字符串的匹配,本质上是KMP算法的树形扩展.这篇文章主要介绍AC自动机的工作原理,并在此基础上用Java代码实现一个简易的AC自动机. 1.应用场景-多模字符串匹配 我们现在考虑这样一个问题,在一个文本串text中,我们想找出多个目标字符串

  • Java数据结构之AC自动机算法的实现

    目录 1 概念和原理 2 节点定义 3 构建Trie前缀树 4 构建fail失配指针 5 匹配文本 6 案例演示 7 总结 1 概念和原理 一般的字符串匹配算法都是匹配一个子串,例如KMP.Trie,那么如果同时匹配多个子串呢?此时就需要用到AC自动机了. AC自动机算法是一个多模式字符串匹配算法,在模式匹配领域被广泛应用,例如违禁词查找替换.搜索关键词查找等等. 关于Trie树和KMP算法,我们此前已经讲解过了: 前缀树Trie的实现原理以及Java代码的实现 KMP算法详解以及Java代码实

  • Java设计模式之适配器模式的示例详解

    目录 定义 分类 案例 需求 方案一:类适配器 方案二:对象适配器 方案三:接口适配器 对比分析 方案一:类适配器 方案二:对象适配器 方案三:接口适配器 总结 定义 适配器模式,即将某个类的接口转换成客户端期望的另一个接口的表示,主要目的是实现兼容性,让原本因为接口不匹配,没办法一起工作的两个类,可以协同工作. 分类 类适配器 对象适配器 接口适配器 案例 需求 手机充电,通过手机充电器将220V电压适配为5V 方案一:类适配器 定义220V交流电(被适配者的角色) /** * 220V交流电

  • Java算法之堆排序代码示例

    堆是一种特殊的完全二叉树,其特点是所有父节点都比子节点要小,或者所有父节点都比字节点要大.前一种称为最小堆,后一种称为最大堆. 比如下面这两个: 那么这个特性有什么作用?既然题目是堆排序,那么肯定能用来排序.想要用堆排序首先要创建一个堆,如果对4 3 6 2 7 1 5这七个数字做从小到大排序,需要用这七个数创建一个最大堆,来看代码: public class HeapSort { private int[] numbers; private int length; public HeapSor

  • Java单例模式实现静态内部类方法示例

    Singleton是众多设计模式中最容易理解的一种,也是众多设计模式中较为重要的一种设计模式.接下来我们看看具体介绍. Singleton模式实现的重点在于将构造函数私有化(private),并通过提供静态公有函数(public synchronized static xxx getInstance)来获取定义在类中的静态私有成员(private static xxx instance),通过一个简单的判断静态实例是否为空来控制这个类只能够new一次,即控制了一个类只能有单个实例,一般的实现如下

  • java实现网页爬虫的示例讲解

    这一篇目的就是在于网页爬虫的实现,对数据的获取,以便分析. 目录: 1.爬虫原理 2.本地文件数据提取及分析 3.单网页数据的读取 4.运用正则表达式完成超连接的连接匹配和提取 5.广度优先遍历,多网页的数据爬取 6.多线程的网页爬取 7.总结 爬虫实现原理 网络爬虫基本技术处理 网络爬虫是数据采集的一种方法,实际项目开发中,通过爬虫做数据采集一般只有以下几种情况: 1) 搜索引擎 2) 竞品调研 3) 舆情监控 4) 市场分析 网络爬虫的整体执行流程: 1) 确定一个(多个)种子网页 2) 进

  • java tostring方法重写代码示例

    当需要将一个对象输出到显示器时,通常要调用他的toString()方法,将对象的内容转换为字符串.java中的所有类默认都有一个toString()方法 默认情况下 System.out.println(对象名)或者System.out.println(对象名.toString())输出的是此对象的类名和此对象对应内存的首地址 如果想自定义输出信息必须重写toString()方法 注意事项 1.必须被声明为public 2.返回类型为String 3.方法的名称必须为toString,且无参数

  • Java连接postgresql数据库的示例代码

    本文介绍了Java连接postgresql数据库的示例代码,分享给大家,具体如下: 1.下载驱动jar 下载地址:https://jdbc.postgresql.org/download.html 2.导入jar包 新建lib文件夹,将下载的jar驱动包拖到文件夹中. 将jar驱动包添加到Libraries 3.程序代码如下:HelloWorld.java package test; import java.sql.Connection; import java.sql.DriverManage

随机推荐