Android neon 优化实践示例

2025-02-17 07:08:21

搭建实验环境

首先新建一个包含native代码的项目：

然后在gradle中添加对neon的支持：

       externalNativeBuild {
            cmake {
                cppFlags "-std=c++14"
                arguments "-DANDROID_ARM_NEON=TRUE"
            }
        }

这样，项目就可以支持neon加速了。

小试牛刀

一个最简单的neon编程的流程大致是这样的： 1、装载数据到neon寄存器 2、执行运算 3、从neon寄存器中把结果写回内存。

没有例子不知从何说起，先上一个超级简单的例子吧：

#include <jni.h>
#include <string>
#include <arm_neon.h>
#include <android/log.h>
#define LOG_TAG "TEST_NEON"
#define LOGD(...) __android_log_print(ANDROID_LOG_DEBUG, LOG_TAG, __VA_ARGS__)
#define LOGI(...) __android_log_print(ANDROID_LOG_INFO, LOG_TAG, __VA_ARGS__)
extern "C"{
void test()
{
    int16_t result[8];
    int8x8_t a = vdup_n_s8(121);
    int8x8_t b = vdup_n_s8(2);
    int16x8_t c;
    c = vmull_s8(a,b);
    vst1q_s16(result,c);
    for(int i=0;i<8;i++){
        LOGD("data[%d] is %d ",i,result[i]);
    }
}
JNIEXPORT jstring
JNICALL
Java_com_example_javer_myapplication_MainActivity_stringFromJNI(
        JNIEnv *env,
        jobject /* this */) {
    std::string hello = "Hello from C++";
    test();
    return env->NewStringUTF(hello.c_str());
}
}

执行结果：

09-07 12:03:08.335 11709-11709/? D/TEST_NEON:
data[0] is 242
data[1] is 242
data[2] is 242
data[3] is 242
data[4] is 242
data[5] is 242
data[6] is 242
data[7] is 242

代码中，test函数中实现了两个64位neon寄存器的乘法。

vdup是数据复制指令，这里把128这个8位的数复制到一个64位的寄存器中，64位能存放8个8位的数，因此，此时a指向的neon寄存器存放了8个128。

两个8位的数相乘，结果可能是16位的，因此，结果需要用一个128位的寄存器来保存。int16x8就表示的是一个128位的寄存器。

vmull_s8把a,b相乘，并将结果保存在c中。c指向的是neon的128位寄存器，因此，我们需要把结果写回内存。

vst1q_s16把c中的数据协会result指向的内存中。

这是一个简单的测试neon指令的代码，通过这个代码我们能清晰的认识到neon加速的原理：一次装载8个8位的数到64位寄存器，一条指令能把实现两个8*8的数据块的乘法。

这样效率不就接近提升8倍么？当然没有这么理想，毕竟装载数据和写回数据也是需要时间的。

实战尝试

接下来，尝试一个比较简单的rgb转灰度图的code:

void normal_convert (uint8_t * __restrict dest, uint8_t * __restrict src, int n)
{
    int i;
    for (i=0; i<n; i++)
    {
        int r = *src++; // load red
        int g = *src++; // load green
        int b = *src++; // load blue
        // build weighted average:
        int y = (r*77)+(g*151)+(b*28);
        // undo the scale by 256 and write to memory:
        *dest++ = (y>>8);
    }
}
void neon_convert (uint8_t * __restrict dest, uint8_t * __restrict src, int n)
{
    int i;
    uint8x8_t rfac = vdup_n_u8 (77);
    uint8x8_t gfac = vdup_n_u8 (151);
    uint8x8_t bfac = vdup_n_u8 (28);
    n/=8;
    for (i=0; i<n; i++)
    {
        uint16x8_t  temp;
        uint8x8x3_t rgb  = vld3_u8 (src);
        uint8x8_t result;
        temp = vmull_u8 (rgb.val[0],      rfac);
        temp = vmlal_u8 (temp,rgb.val[1], gfac);
        temp = vmlal_u8 (temp,rgb.val[2], bfac);
        result = vshrn_n_u16 (temp, 8);
        vst1_u8 (dest, result);
        src  += 8*3;
        dest += 8;
    }
}
void test1()
{
    //准备一张图片，使用软件模拟生成，格式为rgb rgb ..
    uint32_t const array_size = 2048*2048;
    uint8_t * rgb = new uint8_t[array_size*3];
    for(int i=0;i<array_size;i++){
        rgb[i*3]=234;
        rgb[i*3+1]=94;
        rgb[i*3+2]=23;
    }
    //灰度图大小为rgb的1/3
    uint8_t * gray = new uint8_t[array_size];
    struct timeval tv1,tv2;
    gettimeofday(&tv1,NULL);
    normal_convert(gray,rgb,array_size);
    gettimeofday(&tv2,NULL);
    LOGD("pure cpu cost time:%ld",(tv2.tv_sec-tv1.tv_sec)*1000000+(tv2.tv_usec-tv1.tv_usec));
    gettimeofday(&tv1,NULL);
    neon_convert(gray,rgb,array_size);
    gettimeofday(&tv2,NULL);
    LOGD("neon cost time:%ld",(tv2.tv_sec-tv1.tv_sec)*1000000+(tv2.tv_usec-tv1.tv_usec));
    delete[] rgb;
    delete[] gray;
}
JNIEXPORT jstring
JNICALL
Java_com_example_javer_myapplication_MainActivity_stringFromJNI(
        JNIEnv *env,
        jobject /* this */) {
    std::string hello = "Hello from C++";
    test1();
    return env->NewStringUTF(hello.c_str());
}

具体的指令就不一一说明了，大家参考neon汇编指令集，对照着看就好。

纯cpu耗时53ms,neon优化后耗时43ms,提升非常有限，跟提升近8倍的预期相差甚远。这主要是因为c转换为汇编后，生成的汇编指令不够简洁，使得效率大大降低。因此，接下来，使用汇编对代码进行优化。

CMake添加汇编支持

为了在Cmake中编译汇编文件，我们需要在CMakeLists.txt文件中申明对汇编语言的支持，添加ENABLE_LANGUAGE(ASM)即可实现对汇编的支持，接着将汇编文件添加进来，此处贴出完整的CMakeLists.txt文件供大家参考：

# For more information about using CMake with Android Studio, read the
# documentation: https://d.android.com/studio/projects/add-native-code.html
# Sets the minimum version of CMake required to build the native library.
cmake_minimum_required(VERSION 3.4.1)
# Creates and names a library, sets it as either STATIC
# or SHARED, and provides the relative paths to its source code.
# You can define multiple libraries, and CMake builds them for you.
# Gradle automatically packages shared libraries with your APK.
ENABLE_LANGUAGE(ASM)
add_library( # Sets the name of the library.
             native-lib
             # Sets the library as a shared library.
             SHARED
             # Provides a relative path to your source file(s).
             src/main/cpp/Neon.S
             src/main/cpp/native-lib.cpp
             )
# Searches for a specified prebuilt library and stores the path as a
# variable. Because CMake includes system libraries in the search path by
# default, you only need to specify the name of the public NDK library
# you want to add. CMake verifies that the library exists before
# completing its build.
find_library( # Sets the name of the path variable.
              log-lib
              # Specifies the name of the NDK library that
              # you want CMake to locate.
              log )
# Specifies libraries CMake should link to your target library. You
# can link multiple libraries, such as libraries you define in this
# build script, prebuilt third-party libraries, or system libraries.
target_link_libraries( # Specifies the target library.
                       native-lib
                       # Links the target library to the log library
                       # included in the NDK.
                       ${log-lib} )

实现汇编Neon优化

然后在cpp文件中申明：

void neon_asm_convert(uint8_t * dest, uint8_t * src,int n);

注意，这个申明是包含在extern “C”中的。然后在Neon.S中实现neon_asm_convert函数：

.globl neon_asm_convert
neon_asm_convert:
      # r0: Ptr to destination data
      # r1: Ptr to source data
      # r2: Iteration count:
      push        {r4-r5,lr}
      lsr         r2, r2, #3
      # build the three constants:
      mov         r3, #77
      mov         r4, #151
      mov         r5, #28
      vdup.8      d3, r3
      vdup.8      d4, r4
      vdup.8      d5, r5
  .loop:
      # load 8 pixels:
      vld3.8      {d0-d2}, [r1]!
      # do the weight average:
      vmull.u8    q3, d0, d3
      vmlal.u8    q3, d1, d4
      vmlal.u8    q3, d2, d5
      # shift and store:
      vshrn.u16   d6, q3, #8
      vst1.8      {d6}, [r0]!
      subs        r2, r2, #1
      bne         .loop
      pop         { r4-r5, pc }

为了对比结果的正确性，专门写了个比对函数：

int compare(uint8_t *a,uint8_t* b,int n)
{
    for(int i=0;i<n;i++){
        if(a[i]!=b[i]){
            return -1;
        }
    }
    return 0;
}

并将结果打印在时间后面：

LOGD("neon c cost time:%ld,result is %d",(tv2.tv_sec-tv1.tv_sec)*1000000+(tv2.tv_usec-tv1.tv_usec),result);

三者对比：

09-07 17:12:19.946 25861-25861/com.example.javer.myapplication D/TEST_NEON: pure cpu cost time:57073
09-07 17:12:20.012 25861-25861/com.example.javer.myapplication D/TEST_NEON: neon c cost time:45460,result is 0
09-07 17:12:20.034 25861-25861/com.example.javer.myapplication D/TEST_NEON: neon asm cost time:3397,result is 0
09-07 17:12:25.271 25861-25861/com.example.javer.myapplication D/TEST_NEON: pure cpu cost time:57404
09-07 17:12:25.336 25861-25861/com.example.javer.myapplication D/TEST_NEON: neon c cost time:45166,result is 0
09-07 17:12:25.359 25861-25861/com.example.javer.myapplication D/TEST_NEON: neon asm cost time:3493,result is 0

最终发现，汇编执行的结果完全正确，时间提升超过了16倍！！！！！！！！！！！我甚至不敢相信能提升这么多。。。可对比的结果是完全一样啊！！这…….

如果程序有问题，感谢大神指出。

最后附完整代码： native_lib.cpp:

#include <jni.h>
#include <string>
#include <arm_neon.h>
#include <android/log.h>
#define LOG_TAG "TEST_NEON"
#define LOGD(...) __android_log_print(ANDROID_LOG_DEBUG, LOG_TAG, __VA_ARGS__)
#define LOGI(...) __android_log_print(ANDROID_LOG_INFO, LOG_TAG, __VA_ARGS__)
extern "C"{
void neon_asm_convert(uint8_t * dest, uint8_t * src,int n);
void test()
{
    int16_t result[8];
    int8x8_t a = vdup_n_s8(121);
    int8x8_t b = vdup_n_s8(2);
    int16x8_t c;
    c = vmull_s8(a,b);
    vst1q_s16(result,c);
    for(int i=0;i<8;i++){
        LOGD("data[%d] is %d ",i,result[i]);
    }
}
void normal_convert (uint8_t * __restrict dest, uint8_t * __restrict src, int n)
{
    int i;
    for (i=0; i<n; i++)
    {
        int r = *src++; // load red
        int g = *src++; // load green
        int b = *src++; // load blue
        // build weighted average:
        int y = (r*77)+(g*151)+(b*28);
        // undo the scale by 256 and write to memory:
        *dest++ = (y>>8);
    }
}
void neon_convert (uint8_t * __restrict dest, uint8_t * __restrict src, int n)
{
    int i;
    uint8x8_t rfac = vdup_n_u8 (77);
    uint8x8_t gfac = vdup_n_u8 (151);
    uint8x8_t bfac = vdup_n_u8 (28);
    n/=8;
    for (i=0; i<n; i++)
    {
        uint16x8_t  temp;
        uint8x8x3_t rgb  = vld3_u8 (src);
        uint8x8_t result;
        temp = vmull_u8 (rgb.val[0],      rfac);
        temp = vmlal_u8 (temp,rgb.val[1], gfac);
        temp = vmlal_u8 (temp,rgb.val[2], bfac);
        result = vshrn_n_u16 (temp, 8);
        vst1_u8 (dest, result);
        src  += 8*3;
        dest += 8;
    }
}
int compare(uint8_t *a,uint8_t* b,int n)
{
    for(int i=0;i<n;i++){
        if(a[i]!=b[i]){
            return -1;
        }
    }
    return 0;
}
void test1()
{
    //准备一张图片，使用软件模拟生成，格式为rgb rgb ..
    uint32_t const array_size = 2048*2048;
    uint8_t * rgb = new uint8_t[array_size*3];
    for(int i=0;i<array_size;i++){
        rgb[i*3]=234;
        rgb[i*3+1]=94;
        rgb[i*3+2]=23;
    }
    //灰度图大小为rgb的1/3
    uint8_t * gray_cpu = new uint8_t[array_size];
    uint8_t * gray_neon = new uint8_t[array_size];
    uint8_t * gray_neon_asm = new uint8_t[array_size];
    struct timeval tv1,tv2;
    gettimeofday(&tv1,NULL);
    normal_convert(gray_cpu,rgb,array_size);
    gettimeofday(&tv2,NULL);
    LOGD("pure cpu cost time:%ld",(tv2.tv_sec-tv1.tv_sec)*1000000+(tv2.tv_usec-tv1.tv_usec));
    gettimeofday(&tv1,NULL);
    neon_convert(gray_neon,rgb,array_size);
    gettimeofday(&tv2,NULL);
    bool result = compare(gray_cpu,gray_neon,array_size);
    LOGD("neon c cost time:%ld,result is %d",(tv2.tv_sec-tv1.tv_sec)*1000000+(tv2.tv_usec-tv1.tv_usec),result);
    gettimeofday(&tv1,NULL);
    neon_asm_convert(gray_neon_asm,rgb,array_size);
    gettimeofday(&tv2,NULL);
    result = compare(gray_cpu,gray_neon_asm,array_size);
    LOGD("neon asm cost time:%ld,result is %d",(tv2.tv_sec-tv1.tv_sec)*1000000+(tv2.tv_usec-tv1.tv_usec),result);
    delete[] rgb;
    delete[] gray_cpu;
    delete[] gray_neon;
    delete[] gray_neon_asm;
}
JNIEXPORT jstring
JNICALL
Java_com_example_javer_myapplication_MainActivity_stringFromJNI(
        JNIEnv *env,
        jobject /* this */) {
    std::string hello = "Hello from C++";
    test1();
    return env->NewStringUTF(hello.c_str());
}
}

Neon.S

.globl neon_asm_convert
neon_asm_convert:
      # r0: Ptr to destination data
      # r1: Ptr to source data
      # r2: Iteration count:
      push        {r4-r5,lr}
      lsr         r2, r2, #3
      # build the three constants:
      mov         r3, #77
      mov         r4, #151
      mov         r5, #28
      vdup.8      d3, r3
      vdup.8      d4, r4
      vdup.8      d5, r5
  .loop:
      # load 8 pixels:
      vld3.8      {d0-d2}, [r1]!
      # do the weight average:
      vmull.u8    q3, d0, d3
      vmlal.u8    q3, d1, d4
      vmlal.u8    q3, d2, d5
      # shift and store:
      vshrn.u16   d6, q3, #8
      vst1.8      {d6}, [r0]!
      subs        r2, r2, #1
      bne         .loop
      pop         { r4-r5, pc }

以上就是Android neon 优化实践示例的详细内容，更多关于Android neon 优化的资料请关注我们其它相关文章！

Android开发OkHttp执行流程源码分析

目录前言介绍执行流程 OkHttpClient client.newCall(request): RealCall.enqueue() Dispatcher.enqueue() Interceptor RetryAndFollowUpInterceptor BridgeInterceptor CacheInterceptor 前言 OkHttp 是一套处理 HTTP 网络请求的依赖库,由 Square 公司设计研发并开源,目前可以在 Java 和 Kotlin 中使用. 对于 Androi
Android数据缓存框架内置ORM功能使用教程

目录使用教程如下配置初始化注解详解 CRUD操作其他注意事项使用教程如下配置初始化 Orm.init(this, OrmConfig.Builder() .database("dcache_sample") .tables(Account::class.java) .version(1) .build()) 在自定义的Application类的入口加入一行配置,database为数据库名,version从1开始每次递增1,tables用来配置需要初始化的表,dcache中所
Android 性能优化实现全量编译提速的黑科技

目录一.背景描述二.效果展示 2.1.测试项目介绍三.思路问题分析与模块搭建: 3.1.思路问题分析 3.2.模块搭建四.问题解决与实编译流程启动,需要找到哪一个 module做了修改 module 依赖关系获取 module 依赖关系 project 替换成 aar 技术方案 hook 编译流程五.一天一个小惊喜( bug 较多) 5.1 output 没有打包出 aar 5.2 发现运行起来后存在多个 jar 包重复问题 5.3 发现 aar/jar 存在多种依赖方式 5.4 发
Android NDK 开发中 SO 包大小压缩方法详解

目录背景 1.STL的使用方式 2.不使用Exception和RTTI RTTI Exception 3.使用 gc-sections去除没有用到的函数 4.去除冗余代码 5.设置编译器的优化flag 6.设置编译器的 Visibility Feature 7.设置编译器的Strip选项 8.去除C++代码中的iostream相关代码总结背景这周在做Yoga包的压缩工作.Yoga本身是用BUCK脚本编译的,而最终编译出几个包大小大总共约为7M,不能满足项目中对于APK大小的限制,因此需要
Android 动态加载 so实现示例详解

目录背景 so动态加载介绍从一个例子出发 so库检索与删除动态加载so 结束了吗? ELF文件扩展总结背景对于一个普通的android应用来说,so库的占比通常都是巨高不下的,因为我们无可避免的在开发中遇到各种各样需要用到native的需求,所以so库的动态化可以减少极大的包体积,自从2020腾讯的bugly团队发部关于动态化so的相关文章后,已经过去两年了,相关文章,经过两年的考验,实际上so动态加载也是非常成熟的一项技术了. 但是很遗憾,许多公司都还没有这方面的涉略又或者说不知
Android中FileProvider的各种场景应用详解

目录前言一.常规使用与定义二.能不能自定义接收文件? 三.能不能主动查询对方的沙盒? 总结前言有部分同学只要是上传或者下载,只要用到了文件,不管三七二十一写个 FileProvider 再说. 不是每一种情况都需要使用 FileProvider 的,啥?你问行不行?有没有毛病? 这... 写了确实可以,没毛病!但是这没有必要啊. 如果不需要FileProvider就不需要定义啊,如果定义了重复的 FileProvider,还会导致清单文件合并失败,需要处理冲突,从而引出又一个问题,解决
Android neon 优化实践示例

目录搭建实验环境小试牛刀实战尝试 CMake添加汇编支持实现汇编Neon优化搭建实验环境首先新建一个包含native代码的项目: 然后在gradle中添加对neon的支持: externalNativeBuild { cmake { cppFlags "-std=c++14" arguments "-DANDROID_ARM_NEON=TRUE" } } 这样,项目就可以支持neon加速了. 小试牛刀一个最简单的neon编程的流程大致是这样的: 1.装
Android使用ViewStub实现布局优化方法示例

目录实践过程实现方式知识点实践过程 Hello,大家好啊,我是小空,今天带大家了解下动态加载控件ViewStub. 在平时开发中经常会遇到复杂布局,而每一个view都是会占据内存和消耗cpu的(即使再小,累计成多,一般嵌套7级以上就有明显的卡顿了),布局优化就是我们常做的任务之一,甚至是一块心病.所以我们工作中就要留意布局优化的手段,ViewStub就是其中之一. 大家应该听过merge标签,将某个布局文件的根布局写成merge的,然后对应的布局include引用,会默认不会引入merg
Android性能优化大图治理示例详解

目录引言 1 自定义大图View 1.1 准备工作 1.2 图片宽高适配 1.3 BitmapRegionDecoder 2 大图View的手势事件处理 2.1 GestureDetector 2.2 双击放大效果处理 2.3 手指放大效果处理引言在实际的Android项目开发中,图片是必不可少的元素,几乎所有的界面都是由图片构成的:像列表页.查看大图页等,都是需要展示图片,而且这两者是有共同点的,列表展示的Item数量多,如果全部加载进来势必会造成OOM,因此列表页通常采用分页加载,加上
Android性能优化之捕获java crash示例解析

目录背景 java层crash由来为什么java层异常会导致crash 捕获crash 总结背景 crash一直是影响app稳定性的大头,同时在随着项目逐渐迭代,复杂性越来越提高的同时,由于主观或者客观的的原因,都会造成意想不到的crash出现.同样的,在android的历史化过程中,就算是android系统本身,在迭代中也会存在着隐含的crash.我们常说的crash包括java层(虚拟机层)crash与native层crash,本期我们着重讲一下java层的crash. java层cr
Javascript中JSON数据分组优化实践及JS操作JSON总结

现有一堆数据,我需要按时间进行分组,以便前端视图呈现 [ {"date":"2017-12-22","start_time":"10:00:00","end_time":"10:00:00","status":"Performance Time"}, {"date":"2017-12-22","st
Android性能优化之图片大小，尺寸压缩综合解决方案

目录前言常见的图片压缩方法质量压缩尺寸压缩 libjpeg 图片压缩流程总结前言在Android中我们经常会遇到图片压缩的场景,比如给服务端上传图片,包括个人信息的用户头像,有时候人脸识别也需要捕获图片等等.这种情况下,我们都需要对图片做一定的处理,比如大小,尺寸等的压缩. 常见的图片压缩方法质量压缩尺寸压缩 libjpeg 质量压缩首先我们要介绍一个api--Bitmap.compress() @WorkerThread public boolean compress(Co
Android 线程优化知识点学习

目录前言一.线程调度原理解析线程调度的原理线程调度模型 Android 的线程调度线程调度小结二.Android 异步方式汇总 Thread HandlerThread IntentService AsyncTask 线程池 RxJava 三.Android线程优化实战线程使用准则线程池优化实战四.定位线程创建者如何确定线程创建者 Epic实战五.优雅实现线程收敛线程收敛常规方案基础库如何使用线程基础库优雅使用线程前言在实际项目开发中会频繁的用到线程,线程使用起来
Android 内存优化知识点梳理总结

目录 RAM 和 ROM 常见内存问题内存溢出内存泄漏常见内存泄漏场景静态变量或单例持有对象非静态内部类的实例生命周期比外部类更长导致的内存泄漏 Handler 导致的内存泄漏 postDelayed 导致的内存泄漏 View 的生命周期大于 Activity 时导致的内存泄漏集合中的对象未释放导致内存泄漏 WebView 导致的内存泄漏内存抖动解决方案其他优化点 App 内存过低时主动清理前言: Android 操作系统给每个进程都会分配指定额度的内存空间,App 使用内存
Android性能优化之弱网优化详解

目录弱网优化 1.Serializable原理 1.1 分析过程 1.2 Serializable接口 1.3 ObjectOutputStream 1.4 序列化后二进制文件的一点解读 1.5 常见的集合类的序列化问题 1.5.1 HashMap 1.5.2 ArrayList 2.Parcelable 2.1 Parcel的简介 2.2 Parcelable的三大过程介绍(序列化.反序列化.描述) 2.2.1 描述 2.2.2 序列化 2.2.3 反序列化 2.3 Parcelable的实