loadavg数据异常引发问题起源分析

2025-04-03 05:29:05

proc

NAME (名称解释):

proc - process information pseudo-filesystem (存储进程信息的伪文件系统)

DESCRIPTION (详细)

The proc filesystem is a pseudo-filesystem which provides an interface to kernel data structures.
It is commonly mounted at /proc. Most of it is read-only, but some files allow kernel variables to
be changed

pooc文件系统是一个伪装的文件系统,它提供接口给内核来存储数据,通常挂载在设备的/proc目录,
大部分文件是只读的,但是有些文件可以被内和变量给改变.

具体代表的含义可以通过man proc去查看. 以上信息就是通过man获取.翻译不一定精确.

loadavg

cat /proc/loadavg

/proc/loadavg
The first three fields in this file are load average figures giving the number of
jobs in the run queue (state R) or waiting for disk I/O (state D) averaged over 1, 5,
and 15 minutes.

这个文件的前三个数字是平均负载的数值,计算平均1分钟,5分钟,15分钟内的运行队列中(R状态)或等待磁盘I/O(D状态)的任务数.

The first of these is the number of cur‐rently runnable kernel scheduling entities
(processes, threads). The value after the slash is the number of kernel scheduling
entities that currently exist on the system.

第四个参数/前面是可运行的内核调度实体的数量(调度实体指进程,线程), /后的值是系统中存在的内核调度实体的数量.

The fifth field is the PID of the process that was most recently created on the system.

第五个参数是系统最新创建进程的PID

1: 问题起源

在从事的大屏领域遇到一个问题,就是loadavg中的数值其高无比,对比8核手机的3+,4+,目前的手头的设备loadavg竟然高达70+,这个问题一直困扰了我很久,最近腾出一个整块的时间来研究一下这个数值的计算规则.

在kernel中的loadvg.c文件中有这样的一个函数.我们看到它就是最终的输出函数.

static int loadavg_proc_show(struct seq_file *m, void *v)
{
   unsigned long avnrun[3];
   get_avenrun(avnrun, FIXED_1/200, 0);
   seq_printf(m, "%lu.%02lu %lu.%02lu %lu.%02lu %ld/%d %d\n",
      LOAD_INT(avnrun[0]), LOAD_FRAC(avnrun[0]),  // 1分钟平均值
      LOAD_INT(avnrun[1]), LOAD_FRAC(avnrun[1]),  // 5分钟平均值
      LOAD_INT(avnrun[2]), LOAD_FRAC(avnrun[2]),  // 15分钟平均值
      // 可运行实体使用  nr_running()获取, nr_threads 是存在的所有实体
      nr_running() , nr_threads,
      // 获取最新创建的进程PID
      task_active_pid_ns(current)->last_pid);
   return 0;
}

看过上面的代码获取具体平均负载的函数是get_avenrun(),我们接着找一下它的具体实现.

unsigned long avenrun[3];
EXPORT_SYMBOL(avenrun); /* should be removed */
/**
 * get_avenrun - get the load average array
 * @loads: pointer to dest load array
 * @offset:    offset to add
 * @shift: shift count to shift the result left
 *
 * These values are estimates at best, so no need for locking.
 */
void get_avenrun(unsigned long *loads, unsigned long offset, int shift)
{
    //数据来源主要是avenrun数组
   loads[0] = (avenrun[0] + offset) << shift;
   loads[1] = (avenrun[1] + offset) << shift;
   loads[2] = (avenrun[2] + offset) << shift;
}

2: 数据来源

接着我们接着寻找avenrun[]在哪里赋值,我们先看数据的来源问题.

kernel版本4.9 代码路径kernel/sched/core.c,kernel/sched/loadavg.c.

2.1:scheduler_tick

/*
 * This function gets called by the timer code, with HZ frequency.
 * We call it with interrupts disabled.
 * 这里注释就比较清楚了,由计时器调度,调度的频率为HZ
 */
void scheduler_tick(void)
{
   int cpu = smp_processor_id();
   struct rq *rq = cpu_rq(cpu);
   struct task_struct *curr = rq->curr;
   sched_clock_tick();
   raw_spin_lock(&rq->lock);
   walt_set_window_start(rq);
   walt_update_task_ravg(rq->curr, rq, TASK_UPDATE,
         walt_ktime_clock(), 0);
   update_rq_clock(rq);
   curr->sched_class->task_tick(rq, curr, 0);
   cpu_load_update_active(rq);
   calc_global_load_tick(rq); // 这里调度
   raw_spin_unlock(&rq->lock);
   perf_event_task_tick();
#ifdef CONFIG_SMP
   rq->idle_balance = idle_cpu(cpu);
   trigger_load_balance(rq);
#endif
   rq_last_tick_reset(rq);
   if (curr->sched_class == &fair_sched_class)
      check_for_migration(rq, curr);
}

2.2: calc_global_load_tick

/*
 * Called from scheduler_tick() to periodically update this CPU's
 * active count.
 */
void calc_global_load_tick(struct rq *this_rq)
{
   long delta;
    //过滤系统负载重复更新,这里是同过jiffies进行过滤,jiffies也在下面统一介绍
   if (time_before(jiffies, this_rq->calc_load_update))
      return;
   // 更新数据
   delta  = calc_load_fold_active(this_rq, 0);
   if (delta)
       // 将数据同步到calc_load_tasks, atomic_long_add 是kernel中的一个原子操作函数
      atomic_long_add(delta, &calc_load_tasks);
    // 下一次系统更新系统负载的时间 LOAD_FREQ定义在include/linux/sched.h
    //   #define LOAD_FREQ   (5*HZ+1)   /* 5 sec intervals */
   this_rq->calc_load_update += LOAD_FREQ;
}

2.3: calc_load_fold_active

long calc_load_fold_active(struct rq *this_rq, long adjust)
{
   long nr_active, delta = 0;
   nr_active = this_rq->nr_running - adjust; //统计调度器中nr_running的task数量 adjust传入为0,不做讨论.
   nr_active += (long)this_rq->nr_uninterruptible; //统计调度器中nr_uninterruptible的task的数量.
    // calc_load_active代表了nr_running和nr_uninterruptible的数量,如果存在差值就计算差值
   if (nr_active != this_rq->calc_load_active) {
      delta = nr_active - this_rq->calc_load_active;
      this_rq->calc_load_active = nr_active;
   }
    // 统计完成,return后,将数据更新到 calc_load_tasks.
   return delta;
}

3: 数据计算

看完数据来源的逻辑,我们接着梳理数据计算的逻辑

这里前半部分的逻辑设计的底层驱动的高分辨率定时器模块,我并不是十分了解.简单的介绍一下,感兴趣的可以自己去研究一下.(类名:tick-sched.c,因为planuml不支持类名存在-)

3.1: tick_sched_timer

/*
 * High resolution timer specific code
 */
 //这里要看下内核是否开启了高分辨率定时器+ CONFIG_HIGH_RES_TIMERS = y
#ifdef CONFIG_HIGH_RES_TIMERS
/*
 * We rearm the timer until we get disabled by the idle code.
 * Called with interrupts disabled.
 */
 // tick_sched_timer函数是高分辨率定时器的到期函数,也就是定时的每个周期结束都会执行
static enum hrtimer_restart tick_sched_timer(struct hrtimer *timer)
{
   struct tick_sched *ts =
      container_of(timer, struct tick_sched, sched_timer);
   struct pt_regs *regs = get_irq_regs();
   ktime_t now = ktime_get();
   tick_sched_do_timer(now);
    ...
   return HRTIMER_RESTART;
}

3.2: calc_global_load

中间的定时器模块的函数就跳过了,已经超出本文的范围,我也并不是完全了解其中的逻辑.

/*
 * calc_load - update the avenrun load estimates 10 ticks after the
 * CPUs have updated calc_load_tasks.
 *
 * Called from the global timer code.
 */
void calc_global_load(unsigned long ticks)
{
   long active, delta;
    // 在前文出现过的时间,这里有加上了10个tick,总间隔就是5s + 10 tick
   if (time_before(jiffies, calc_load_update + 10))
      return;
   /*
    * Fold the 'old' idle-delta to include all NO_HZ cpus.
    */
    // 统计NO_HZ模式下,cpu陷入空闲时间段错过统计的task数据
   delta = calc_load_fold_idle();
   if (delta)
      atomic_long_add(delta, &calc_load_tasks); // 更新数据
   active = atomic_long_read(&calc_load_tasks); // 原子的方式读取前面存入的全局变量
   active = active > 0 ? active * FIXED_1 : 0; // 乘FIXED_1
   avenrun[0] = calc_load(avenrun[0], EXP_1, active); // 1分钟负载
   avenrun[1] = calc_load(avenrun[1], EXP_5, active); // 5分钟负载
   avenrun[2] = calc_load(avenrun[2], EXP_15, active); // 15分钟负载
   calc_load_update += LOAD_FREQ; //更新时间
   /*
    * In case we idled for multiple LOAD_FREQ intervals, catch up in bulk.
    */
    //统计了NO_HZ模式下的task数据,也要将NO_HZ模式下的tick数重新计算,要不然数据会不准.
   calc_global_nohz();
}

这里出现了一个NO_HZ模式,这个是CPU的一个概念,后文专门介绍一下.下面就是负载的计算规则了

3.3:计算规则 calc_load

/*
 * a1 = a0 * e + a * (1 - e)
 */
static unsigned long
calc_load(unsigned long load, unsigned long exp, unsigned long active)
{
   unsigned long newload;
   newload = load * exp + active * (FIXED_1 - exp);
   if (active >= load)
      newload += FIXED_1-1;
   return newload / FIXED_1;
}

具体的计算规则注释也是非常清晰了,并不复杂,整体下来就和使用man proc获取到的信息一样,系统负载统计的是nr_running和nr_uninterruptible的数量.这两个数据的来源就是core.c的struct rq,rq是CPU运行队列中重要的存储结构之一.

问题解析

回到最初的问题,我司的设备系统负载达到70+还没有卡爆炸的原因,通过上面的代码逻辑还是没有直接给出答案.不过已经有了逻辑,其他就很简单了.

1: 我输出了nr_running和nr_uninterruptible的task数量发现,nr_running的数据是正常的,出问题的在与nr_uninterruptible的数量.
2:出问题的是nr_uninterruptibletask数量,那么我司的设备真的有那么多任务在等待I/O么,真的有怎么多任务在等待I/O,设备依然会十分卡顿,我抓取了systrace查看后,一切是正常的.
3: 事情到了这里,就只能借助搜索引擎了.根据nr_uninterruptible的关键字,我查到了一些蛛丝马迹.

简述结果

首先在UNIX系统上是没有统计nr_uninterruptible的,Linux在引入后,有人提出不统计I/O等待的任务数量,无法体现真正体现系统的负载状况.

后面在很多Linux大佬的文章中看到一个信息,NFS系统出现问题的的时候,会将所有访问这个文件系统的线程都标识为nr_uninterruptible,这部分的知识太贴近内核了.(ps:如果有大佬有相关的内核书籍推荐的话,请务必推荐一下).

结论: 因为nr_uninterruptible的数据异常,导致系统负载数据并没有体现出目前设备的真实状况.

收获和总结

1: scheduler_tick这个函数注释中提到的HZ,应该是软中断,软中断和内核配置中的CONFIG_HZ_250,CONFIG_HZ_1000是关联的,例如CONFIG_HZ_1000=y,CONFIG_HZ=1000,就是每秒内核会发出1000的软中断信号. 对应的时间就是 1s/1000. (通常CONFIG_HZ=250)
2: jiffies它就是时钟中断次数, jiffies = 1s / HZ
3:rq结构体太长了,就不全部贴出来了,结构体定义在kernel/sched/sched.h中,有兴趣的自行查看.

   struct rq *rq = cpu_rq(cpu);
/*
 * This is the main, per-CPU runqueue data structure.
 *
 * Locking rule: those places that want to lock multiple runqueues
 * (such as the load balancing or the thread migration code), lock
 * acquire operations must be ordered by ascending &amp;runqueue.
 */
struct rq {
   /* runqueue lock: */
   raw_spinlock_t lock;
   /*
    * nr_running and cpu_load should be in the same cacheline because
    * remote CPUs use both these fields when doing load calculation.
    */
   unsigned int nr_running; // 这里
#ifdef CONFIG_NUMA_BALANCING
   unsigned int nr_numa_running;
   unsigned int nr_preferred_running;
#endif
   #define CPU_LOAD_IDX_MAX 5
   unsigned long cpu_load[CPU_LOAD_IDX_MAX];
   unsigned int misfit_task;
#ifdef CONFIG_NO_HZ_COMMON
#ifdef CONFIG_SMP
   unsigned long last_load_update_tick;
#endif /* CONFIG_SMP */
   unsigned long nohz_flags;
#endif /* CONFIG_NO_HZ_COMMON */
#ifdef CONFIG_NO_HZ_FULL
   unsigned long last_sched_tick;
#endif
#ifdef CONFIG_CPU_QUIET
   /* time-based average load */
   u64 nr_last_stamp;
   u64 nr_running_integral;
   seqcount_t ave_seqcnt;
#endif
   /* capture load from *all* tasks on this cpu: */
   struct load_weight load;
   unsigned long nr_load_updates;
   u64 nr_switches;
   struct cfs_rq cfs;
   struct rt_rq rt;
   struct dl_rq dl;
#ifdef CONFIG_FAIR_GROUP_SCHED
   /* list of leaf cfs_rq on this cpu: */
   struct list_head leaf_cfs_rq_list;
   struct list_head *tmp_alone_branch;
#endif /* CONFIG_FAIR_GROUP_SCHED */
   /*
    * This is part of a global counter where only the total sum
    * over all CPUs matters. A task can increase this counter on
    * one CPU and if it got migrated afterwards it may decrease
    * it on another CPU. Always updated under the runqueue lock:
    */
   unsigned long nr_uninterruptible; // 这里
   struct task_struct *curr, *idle, *stop;
   unsigned long next_balance;
   struct mm_struct *prev_mm;
   unsigned int clock_skip_update;
   u64 clock;
   u64 clock_task;
   atomic_t nr_iowait;
#ifdef CONFIG_SMP
   struct root_domain *rd;
   struct sched_domain *sd;
   unsigned long cpu_capacity;
   unsigned long cpu_capacity_orig;
   struct callback_head *balance_callback;
   unsigned char idle_balance;
   /* For active balancing */
   int active_balance;
   int push_cpu;
   struct task_struct *push_task;
   struct cpu_stop_work active_balance_work;
   /* cpu of this runqueue: */
   int cpu;
   int online;
    ...
};

4高分辨率定时器针对单处理器系统,可以为CPU提供的纳米级定时精度.内核配置CONFIG_HIGH_RES_TIMERS=y
5:NO_HZ就是在CPU进入休眠状态时,不再持续的发送软中断信号,来减少设备功耗与耗电.内核配置CONFIG_NO_HZ=y&CONFIG_NO_HZ_IDLE=y,那么相反,如果设备对功耗并不敏感,需要外部输入电源,可以关闭这个模式,来提高性能.
6:Android提取内核配置:

adb pull /proc/config.gz .

以上就是loadavg数据异常引发问题起源分析的详细内容，更多关于loadavg 数据异常的资料请关注我们其它相关文章！

Android audio音频流数据异常问题解决分析

目录一.背景二.Android Audio 音频系统 1. 音频链路 2. 音频链路关键节点: 3. 音频库的选择三.案例分析 1. 声音忽大忽小问题具体分析 2. 应用卡顿问题具体分析四.总结一.背景在 Android 系统的开发过程当中,音频异常问题通常有如下几类,无声,调节不了声音,爆音,声音卡顿,声音效果异常(忽大忽小,低音缺失等)等. 尤其声音效果这部分问题通常从日志上信息量较少,相对难定位根因.想要分析此类问题,便需要对声音传输链路有一定的了解,能够在链路中对各节点的
Android BadTokenException异常解决案例详解

目录解决办法1 解决方法2 总结线上出现了如上的 crash,第一解决反应是在 show dialog 之前做个 isFinish 和 isDestroyed 判断,当我翻开代码正要解决时,我惊了,原来已经做过了如上的判断检测,示例伪代码如下: public void showDialog(Activity activity){ new OkHttp().call(new Callback(){ void onSucess(Response resp){ if(activity!=null
Android Studio 4.0新特性及升级异常问题的解决方案

一.升级问题 1. dataBinding开启配置修改升级到AS 4.0以后,出现如下的预警,对于我这种有代码洁癖的人是不能忍的,必须解决 DSL element 'android.dataBinding.enabled' is obsolete and has been replaced with 'android.buildFeatures.dataBinding' 解决方法: dataBinding { enabled = true } 这是原有的DataBinding开启方式,在升级后
解决android viewmodel 数据刷新异常的问题

3年的wpf开发经验,自认为对数据驱动UI开发模式的使用不是问题,但当开始研究android的mvvm模式开发时,发现两年多的android开发经验已经将之前的wpf开发忘得7788了.感慨一下:人老了,记忆力就这么脆弱. 谈正题:adroid mvvm开发模式之 viewmodel使用小麻烦. viewmodel public class MyViewModel extends ViewModel { private MutableLiveData<List<User>> mU
loadavg数据异常引发问题起源分析

目录 proc loadavg 1: 问题起源 2: 数据来源 2.1:scheduler_tick 2.2: calc_global_load_tick 2.3: calc_load_fold_active 3: 数据计算 3.1: tick_sched_timer 3.2: calc_global_load 3.3:计算规则 calc_load 问题解析简述结果收获和总结 proc NAME (名称解释): proc - process information pseudo-filesy
用Python实现网易云音乐的数据进行数据清洗和可视化分析

目录 Python实现对网易云音乐的数据进行一个数据清洗和可视化分析对音乐数据进行数据清洗与可视化分析对音乐数据进行数据清洗与可视化分析歌词文本分析总结 Python实现对网易云音乐的数据进行一个数据清洗和可视化分析对音乐数据进行数据清洗与可视化分析关于数据的清洗,实际上在上一一篇文章关于抓取数据的过程中已经做了一部分,后面我又做了一下用户数据的抓取歌曲评论: 包括后台返回的空用户信息.重复数据的去重等.除此之外,还要进行一些清洗:用户年龄错误.用户城市编码转换等. 关于数据的去重
java高并发InterruptedException异常引发思考

目录前言程序案例问题分析问题解决总结前言 InterruptedException异常可能没你想的那么简单! 当我们在调用Java对象的wait()方法或者线程的sleep()方法时,需要捕获并处理InterruptedException异常.如果我们对InterruptedException异常处理不当,则会发生我们意想不到的后果! 程序案例例如,下面的程序代码,InterruptedTask类实现了Runnable接口,在run()方法中,获取当前线程的句柄,并在while(t
MySQL延迟问题和数据刷盘策略流程分析

一.MySQL复制流程官方文档流程如下: MySQL延迟问题和数据刷盘策略 1.绝对的延时,相对的同步 2.纯写操作,线上标准配置下,从库压力大于主库,最起码从库有relaylog的写入. 二.MySQL延迟问题分析 1.主库DML请求频繁原因:主库并发写入数据,而从库为单线程应用日志,很容易造成relaylog堆积,产生延迟. 解决思路:做sharding,打散写请求.考虑升级到MySQL5.7+,开启基于逻辑时钟的并行复制. 2.主库执行大事务原因:类似主库花费很长时间更新了一张大表,
Vue3使用Proxy实现数据监听的原因分析

vue 数据双向绑定原理,而这个方法有缺点,并且不能实现数组和对象的部分监听情况;具体也可以看我之前写的一篇博客: 关于 Vue 不能 watch 数组和对象变化的解决方案,最新的 Proxy,相比 vue2 的 Object.defineProperty,能达到速度加倍.内存减半的成效.具体是怎么实现.以及对比旧的实现方法为啥能有速度加倍.内存减半的特性,下面我们来聊聊: Vue 初始化过程 Vue 的初始化过程,分别有Observer.Compiler和Watcher,当我们 new V
j2Cache线上异常排查问题解决记录分析

目录问题背景问题分析假设问题小心求证问题重现问题解决问题后记-下面才是真正的原因重新假设最终解决问题背景开发反馈,线上有个服务在运行一段时间后,就会抛异常导致redis缓存不可用.项目使用了j2Caceh,异常是j2Cache的RedisCacheProvider抛出来的,如: Exception in thread "main" redis.clients.jedis.exceptions.JedisException: Could not get a reso
C++读取访问权限冲突引发异常问题的原因分析

用C++写代码时经常会遇到"引发了异常: 读取访问权限冲突."这样的错误提示,这种情况产生原因主要有两点: 一.访问数组越界当采用线性表的顺序结构,例如顺序表.队列.栈等,用数组存储数据时,若将要读取数据的位置超出了当前数组的长度,就会发生数组访问越界的状况. 可这并不会造成编译错误,也就是说,编译器并不会在你编译的时候就指出你访问数组越界了,这个时候可能还是"0 errors,0 warnings" 你还在暗暗庆幸自己的代码没有bug,但是当你运行之后就会抛出访
.NET中的异常和异常处理用法分析

本文较为详细的分析了.NET中的异常和异常处理用法.分享给大家供大家参考.具体分析如下: .NET中的异常(Exception) .net中的中异常的父类是Exception,大多数异常一般继承自Exception. 可以通过编写一个继承自Exception的类的方式,自定义异常类! 异常处理机制: 复制代码代码如下: Try { //可能发生异常的代码 //后续代码 } //Try以外的代码 catch(Exception e) { } finally { } 上述代
MYSQL数据表损坏的原因分析和修复方法小结(推荐)

1.表损坏的原因分析以下原因是导致mysql 表毁坏的常见原因: 1. 服务器突然断电导致数据文件损坏. 2. 强制关机,没有先关闭mysql 服务. 3. mysqld 进程在写表时被杀掉. 4. 使用myisamchk 的同时,mysqld 也在操作表. 5. 磁盘故障. 6. 服务器死机. 7. mysql 本身的bug . 2.表损坏的症状一个损坏的表的典型症状如下: 1 .当在从表中选择数据之时,你得到如下错误: Incorrect key file for table: '...