Pytorch-mlu 实现添加逐层算子方法详解

2025-06-04 17:13:42

pytorch-mlu 逐层模式中算子间数据传递和存储的基本单元是 tensor。pytorch-mlu 根据 tensor 中的 device 属性值将算子分发到不同设备。以 abs() 算子为例，在 dispatch 阶段会根据 input_tensor 的设备属性值将算子调用分发到具体设备，逻辑如下图所示：

Catch 通过注册添加 MLU 算子方式与 pytorch 源码解耦，下面介绍在 Catch 中添加 MLU 算子的具体步骤。

1、注册算子

在 catch/torch_mlu/csrc/generated/aten_mlu_type_default.cpp 中注册算子：

.op(torch::RegisterOperators::options().schema("aten::add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor")  // NOLINT 

  .impl_unboxedOnlyKernel<at::Tensor(const at::Tensor &, const at::Tensor &, at::Scalar), &AtenMluType::add>(at::TensorTypeId::MLUTensorId)

  aliasAnalysis(c10::AliasAnalysisKind::FROM_SCHEMA))

2、算子分发

AtenMluType 和 AtenMluCustomType 是 Catch 模块中算子的入口。AtenMluType 类主要包含框架中的标准算子；而 AtenMluCustomType 类包含客制化的算子。根据算子属性选择在 AtenMluType 还是 AtenMluCustomType 中添加相应算子声明和实现。

标准算子分发

在 catch/torch_mlu/csrc/aten/aten_mlu_type.h 和 catch/torch_mlu/csrc/aten/aten_mlu_type.cpp 中添加算子声明和实现：

aten_mlu_type.h
static at::Tensor add(const at::Tensor& self, const at::Tensor& other, at::Scalar alpha);
aten_mlu_type.cpp
at::Tensor AtenMluType::add(const at::Tensor& self, const at::Tensor& other, at::Scalar alpha){
  return OP_DISPATCH(add, self, other, alpha);
}

客制化算子分发

对于 MLU 特有算子，在 catch/torch_mlu/csrc/aten/aten_mlu_type.h和 catch/torch_mlu/csrc/aten/aten_mlu_custom_type.cpp 中添加算子申明和实现：

aten_mlu_type.h
static at::Tensor linear(const at::Tensor& input,
                         const at::Tensor& weight,
                         const at::Tensor& bias,
                         const at::Tensor& q_scale,
                         const at::Tensor& q_mode);
aten_mlu_custom_type.cpp
at::Tensor AtenMluCustomType::linear(const at::Tensor& input,
                                     const at::Tensor& weight,
                                     const at::Tensor& bias,
                                     const at::Tensor& q_scale,
                                     const at::Tensor& q_mode){
    return OP_DISPATCH(linear, input, weight, bias, q_scale, q_mode);
}

3、修改 OpMethods 基类

从 AtenMluType 和 AtenMluCustomType 中都会通过 OpMethods 下发到推理算子或训练算子。在 catch/torch_mlu/csrc/aten/operators/op_methods.h 和 catch/torch_mlu/csrc/aten/operators/op_methods.cpp 中添加算子申明和实现。OpMethods 中的实现部分为该算子的 CPU 实现。

op_methods.h
virtual at::Tensor add(const at::Tensor& self, const at::Tensor& other, at::Scalar alpha);
op_methods.cpp
at::Tensor OpMethods::add(const at::Tensor& self,
                          const at::Tensor& other,
                          at::Scalar alpha){
   auto input_cpu = self.cpu();
   auto other_cpu = other.cpu();
   auto output = at::add(input_cpu, other_cpu, alpha);
   return output.to(at::Device(at::Device::Type::MLU));
}

4、下发算子

在 catch/torch_mlu/csrc/aten/operators/cnml_ops.h 和 catch/torch_mlu/csrc/aten/operators/cnml_ops.cpp 中添加推理算子申明和实现。

cnml_ops.h
at::Tensor add(const at::Tensor& self, const at::Tensor& other, at::Scalar alpha);
cnml_ops.cpp
at::Tensor CnmlOps::add(const at::Tensor& self, const at::Tensor& other, at::Scalar alpha){
  CNML_DISPATCH(add, cnml_add, self, other, alpha);  // CNML_DISPATCH 宏第一个参数是该接口名，第二个参数是wrapper个名字，其余
}

5、添加 wrapper

wrapper 是对算子 kernel 的封装，每个算子对应一个 wrapper。这里以 add 算子为例，添加 wrapper 如下所示：

cnml_kernel.h
at::Tensor cnml_add(const at::Tensor& input, const at::Tensor& other, at::Scalar alpha);
add.cpp
at::Tensor cnml_add(const at::Tensor& input, const at::Tensor& other, at::Scalar alpha_scalar){
  TORCH_CHECK(input.dim() >= 0 || other.dim() >= 0, "dimension not support");
  at::Tensor input_ = input;
  at::Tensor other_ = other;
  auto alpha_data = alpha_scalar.to<scalar_t>();
  if(alpha_data != 1){
    // scale_t
    other_ = cnml::ops::cnml_scale(other_, alpha_data, 0);
  }
  if(other_.dim() < 1 && other_.device().type() == c10::DeviceType::CPU){
    auto other_scalar = other_.item();
    return cnml_add_internal(input_, other_scalar);   // 调用kernel
  }
  if(input_.dim() < 1 && input_.device().type() == c10::DeviceType::CPU){
    auto input_scalar = input_.item();
    return cnml_add_internal(other_, input_scalar);   // 调用 kernel
  }

  bool broadcast = input_.sizes() != other_.sizes();
  if(broadcast){
    auto broadcast_size = at::infer_size(input.sizes(), other.sizes());
    at::Tensor broadcast1 = cnml::ops::cnml_expand(input_, broadcast_size, false);
    at::Tensor broadcast2 = cnml::ops::cnml_expand(other_, broadcast_size, false);
    return cnml_add_internal(broadcast1, broadcast2);  // 调用 kernel
  }else{
    return cnml_add_internal(input_, other_);  //调用 kernel
  }
  return cnml_add_internal(input_, other_);   //调用 kernel
}

6、添加 wrapper

Wrapper 中通过调用 kernel 实现算子功能。示例中调用的是 cnml_add_internal。算子的具体实现主要通过调用 CNML 库的接口来完成，下面是 CNML 库的逻辑：

kernel 实现就是按照上述编程逻辑调用 CNML 库接口完成的，在 catch/torch_mlu/csrc/aten/operators/cnml/internal/cnml_internal.h 和 catch/torch_mlu/csrc/aten/operators/cnml/internal/add_internal/cpp 中添加 kernel 函数的声明和实现。

cnml_internal.h
at::Tensor cnml_add_internal(const at::Tensor& input1, const at::Tensor& input2);
add_internal.cpp
at::Tensor cnml_add_internal(const at::Tensor& input1, const at::Tensor& input2){
  auto output = at::native::empty_like(input1);
  // prepare input cnml tensor
  auto* input1_impl = getMluTensorImpl(input1);  // 获取MluTensorImpl
  auto input1_cnml = input1_impl->CreateCnmlTensor(
       CNML_TENSOR, toCnmlDataType(input1.dtype()));  // 类型自适应：toCnmlDataType()

  auto* input2_impl = getMluTensorImpl(input2);
  auto input2_cnml = input2_impl->CreateCnmlTensor(
      CNML_TENSOR, toCnmlDataType(input2.dtype()));

  // prepare output cnml tensor
  auto* output_impl = getMluTensorImpl(output);
  auto output_cnml = output_impl->CreateCnmlTensor(
      CNML_TENSOR, toCnmlDataType(output.dtype()));

  // End the execution flow if not MLU device
  CHECK_MLU_DEVICE(output);

  // setup operator
  cnmlBaseOp_t add_op;
  TORCH_CNML_CHECK(cnmlCreateAddOp(&add_op, input1_cnml, input2_cnml, output_cnml));

  // return to JIT if running mode is fuse
  CHEXK_RETURN_TO_FUSE(add_op, output);

  // compile op
  TORCH_CNML_CHECK(cnmlCompileBaseOp(add_op, GET_CORE_VERSION, GET_CORE_NUMBER));

  auto queue = getCurQueue();
  TORCH_CNML_CHECK(cnmlComputeAddOpForward_V4(add_op,
                                              NULL,
                                              input1_impl->raw_mutable_data(),
                                              NULL,
                                              input2_impl->raw_mutable_data(),
                                              NULL,
                                              output_impl->raw_mutable_data(),
                                              queue,
                                              NULL));
   syncQueue(queue);
   TORCH_CNML_CHECK(cnmlDestroyBaseOp(&add_op));

  return output;
}

对 MLU 不支持算子的处理

对于 MLU 暂不支持的操作，输入数据将会拷贝到 CPU 上，然后调用 CPU 相关操作，使其在 CPU 上运行，最后再将输出结果拷会到 MLU 上。具体实现，可以查询 op_methods.cp，该文件在 catch/torch_mlu/csrc/aten/operators/ 目录下。

op_methods.cpp
at::Tensor OpMethods::add(const at::Tensor& self,
                          const at::Tensor& other,
                          at::Scalar alpha){
  auto input_cpu = self.cpu();
  auto other_cpu = other.cpu();
  auto output = at::add(input_cpu, other_cpu, alpha);
  return output.to(at::Device(at::Device::Type::MLU));
}

对于新增的算子在执行过程中抛出异常时，如果 CPU 上没有对应的算子操作，那么该操作无法切换到 CPU 上运行；

Wrapper一般以 cnml_算子名命名，kernel一般以cnml_算子名_internal命名

7、算子测试

使用基于 python 的 unittest 模块编写算子单元测试。测试时需提供相同的参数和输入数据，分别在 MLU 和 CPU 上执行算子，对比两者的输出结果。MLU 和 CPU 计算结果可能会有差异，一般情况下两者的相对误差在 2% 以内均是可以接受的。

def test_add(self):
  # "Tensor + Tensor" mode testing
  for shape1, shape2 in [((1,3,224,224),(1,3,224,224)),((2,30,80),(2,30,80)),((3,20),(3,20)),((10),(10))]:
    input1_cpu = torch.rand(shape1, dtype=torch.float)
    input2_cpu = torch.rand(shape2, dtype=torch.float)
    input1_mlu = input1_cpu.to(xm.mlu_device())
    input2_mlu = input2_cpu.to(xm.mlu_device())
    # 在 CPU 上计算
    output_cpu = input1_cpu + input2_cpu
    # 在 MLU 上计算
    output_mlu = input1_mlu + input2_mlu
    # 计算 MLU 的误差，并确保相对误差在 2% 以内
    self.assertTensorsEqual(output_cpu, output_mlu.cpu(), 0.02, use_MSE=True)

以上分享了在寒武纪设备 pytorch-mlu 中添加逐层算子的方法，并以 add() 算子为例进行了示例编写，希望我的分享会对你的学习有一点帮助。

到此这篇关于Pytorch-mlu 实现添加逐层算子方法详解的文章就介绍到这了,更多相关Pytorch内容请搜索我们以前的文章或继续浏览下面的相关文章希望大家以后多多支持我们！

pytorch 在网络中添加可训练参数,修改预训练权重文件的方法

实践中,针对不同的任务需求,我们经常会在现成的网络结构上做一定的修改来实现特定的目的. 假如我们现在有一个简单的两层感知机网络: # -*- coding: utf-8 -*- import torch from torch.autograd import Variable import torch.optim as optim x = Variable(torch.FloatTensor([1, 2, 3])).cuda() y = Variable(torch.FloatTensor([4,
Pytorch 实现sobel算子的卷积操作详解

卷积在pytorch中有两种实现,一种是torch.nn.Conv2d(),一种是torch.nn.functional.conv2d(),这两种方式本质都是执行卷积操作,对输入的要求也是一样的,首先需要输入的是一个torch.autograd.Variable()的类型,大小是(batch,channel, H,W),其中batch表示输入的一批数据的数目,channel表示输入的通道数. 一般一张彩色的图片是3,灰度图片是1,而卷积网络过程中的通道数比较大,会出现几十到几百的通道数.H和W表
pytorch之添加BN的实现

pytorch之添加BN层批标准化模型训练并不容易,特别是一些非常复杂的模型,并不能非常好的训练得到收敛的结果,所以对数据增加一些预处理,同时使用批标准化能够得到非常好的收敛结果,这也是卷积网络能够训练到非常深的层的一个重要原因. 数据预处理目前数据预处理最常见的方法就是中心化和标准化,中心化相当于修正数据的中心位置,实现方法非常简单,就是在每个特征维度上减去对应的均值,最后得到 0 均值的特征.标准化也非常简单,在数据变成 0 均值之后,为了使得不同的特征维度有着相同的规模,可以除以标准
Pytorch-mlu 实现添加逐层算子方法详解

目录 1.注册算子 2.算子分发 3.修改 OpMethods 基类 4.下发算子 5.添加 wrapper 6.添加 wrapper 7.算子测试本教程分享了在寒武纪设备上 pytorch-mlu 中添加逐层算子的方法. pytorch-mlu 逐层模式中算子间数据传递和存储的基本单元是 tensor.pytorch-mlu 根据 tensor 中的 device 属性值将算子分发到不同设备.以 abs() 算子为例,在 dispatch 阶段会根据 input_tensor 的设备属性值将
利用Pytorch实现获取特征图的方法详解

目录简单加载官方预训练模型图片预处理提取单个特征图提取多个特征图简单加载官方预训练模型 torchvision.models预定义了很多公开的模型结构如果pretrained参数设置为False,那么仅仅设定模型结构:如果设置为True,那么会启动一个下载流程,下载预训练参数如果只想调用模型,不想训练,那么设置model.eval()和model.requires_grad_(False) 想查看模型参数可以使用modules和named_modules,其中named_modul
python编程之requests在网络请求中添加cookies参数方法详解

哎,好久没有学习爬虫了,现在想要重新拾起来.发现之前学习爬虫有些粗糙,竟然连requests中添加cookies都没有掌握,惭愧.废话不宜多,直接上内容. 我们平时使用requests获取网络内容很简单,几行代码搞定了,例如: import requests res=requests.get("https://cloud.flyme.cn/browser/index.jsp") print res.content 你没有看错,真的只有三行代码.但是简单归简单,问题还是不少的. 首先,这
对python字典元素的添加与修改方法详解

1.字典中的键存在时,可以通过字典名+下标的方式访问字典中改键对应的值,若键不存在则会抛出异常.如果想直接向字典中添加元素可以直接用字典名+下标+值的方式添加字典元素,只写键想后期对键赋值这种方式会抛出异常. >>>a=['apple','banana','pear','orange'] >>> a ['apple', 'banana', 'pear', 'orange'] >>> a={1:'apple',2:'banana',3:'pear',4:
pytorch对可变长度序列的处理方法详解

主要是用函数torch.nn.utils.rnn.PackedSequence()和torch.nn.utils.rnn.pack_padded_sequence()以及torch.nn.utils.rnn.pad_packed_sequence()来进行的,分别来看看这三个函数的用法. 1.torch.nn.utils.rnn.PackedSequence() NOTE: 这个类的实例不能手动创建.它们只能被 pack_padded_sequence() 实例化. PackedSequence
pytorch的梯度计算以及backward方法详解

基础知识 tensors: tensor在pytorch里面是一个n维数组.我们可以通过指定参数reuqires_grad=True来建立一个反向传播图,从而能够计算梯度.在pytorch中一般叫做dynamic computation graph(DCG)--即动态计算图. import torch import numpy as np # 方式一 x = torch.randn(2,2, requires_grad=True) # 方式二 x = torch.autograd.Variabl
python为QT程序添加图标的方法详解

Qt是一种基于C++的跨平台图形用户界面应用程序开发框架.如何跨平台?上到服务器上位机,下到嵌入式GUI,上天入地无所不能.Qt最早是由1991年由Qt Company开发,但是到2008年,Qt Company科技被诺基亚公司收购,是的,就是拥有着我们很多情怀的诺基亚.但在2012年,Qt又被Digia收购.等到了2014年,跨平台集成开发环境Qt Creator 3.1.0正式发布出来,至此,全面支持iOS.Android.WP,QT的时代开始逐步展开. 本文重点给大家介绍python为QT
C#给Word中的字符添加着重号的方法详解

目录前言引入dll 方法1 方法2 添加强调符号 C# vb.net 前言在Word中添加着重号,即强调符号,可以在选中字符后,鼠标右键点击,选择“字体”,在窗口中可直接选择“着重号”添加到文字,用以对重要文字内容起加强提醒的目的,如下图: 通过C#,我们可以查找到需要添加着重号的字符串,然后通过字符串格式的属性值来添加符号.下面,将对此做详细介绍. 引入dll 方法1 手动引入将 Free Spire.Doc for .NET 下载到本地,解压,安装.安装完成后,找到安装路径下BIN文
Java为实体类动态添加属性的方法详解

目录添加依赖代码测试可以给已有实体类动态的添加字段并返回新的实体对象,不影响原来的实体对象结构. 添加依赖 <dependency> <groupId>cglib</groupId> <artifactId>cglib</artifactId> <version>2.2.2</version> </dependency> <dependency> <groupId>commons
jQuery给元素添加样式的方法详解

本文实例讲述了jQuery给元素添加样式的方法.分享给大家供大家参考,具体如下: 1.获取和设置样式 $("#tow").attr("class")//获取ID为tow的class属性 $("#two").attr("class","divClass")//设置Id为two的class属性. 2.追加样式复制代码代码如下: $("#two").addClass("divCl