Skip to content
Snippets Groups Projects
  1. Oct 17, 2021
  2. Oct 13, 2021
  3. Oct 06, 2021
    • Lorenz Bauer's avatar
      bpf: Exempt CAP_BPF from checks against bpf_jit_limit · 59efda50
      Lorenz Bauer authored
      
      [ Upstream commit 8a98ae12 ]
      
      When introducing CAP_BPF, bpf_jit_charge_modmem() was not changed to treat
      programs with CAP_BPF as privileged for the purpose of JIT memory allocation.
      This means that a program without CAP_BPF can block a program with CAP_BPF
      from loading a program.
      
      Fix this by checking bpf_capable() in bpf_jit_charge_modmem().
      
      Fixes: 2c78ee89 ("bpf: Implement CAP_BPF")
      Signed-off-by: default avatarLorenz Bauer <lmb@cloudflare.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210922111153.19843-1-lmb@cloudflare.com
      
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      59efda50
    • Hou Tao's avatar
      bpf: Handle return value of BPF_PROG_TYPE_STRUCT_OPS prog · d93f6558
      Hou Tao authored
      
      [ Upstream commit 356ed649 ]
      
      Currently if a function ptr in struct_ops has a return value, its
      caller will get a random return value from it, because the return
      value of related BPF_PROG_TYPE_STRUCT_OPS prog is just dropped.
      
      So adding a new flag BPF_TRAMP_F_RET_FENTRY_RET to tell bpf trampoline
      to save and return the return value of struct_ops prog if ret_size of
      the function ptr is greater than 0. Also restricting the flag to be
      used alone.
      
      Fixes: 85d33df3 ("bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS")
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20210914023351.3664499-1-houtao1@huawei.com
      
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      d93f6558
    • Sean Christopherson's avatar
      KVM: rseq: Update rseq when processing NOTIFY_RESUME on xfer to KVM guest · 249e5e5a
      Sean Christopherson authored
      
      commit 8646e536 upstream.
      
      Invoke rseq's NOTIFY_RESUME handler when processing the flag prior to
      transferring to a KVM guest, which is roughly equivalent to an exit to
      userspace and processes many of the same pending actions.  While the task
      cannot be in an rseq critical section as the KVM path is reachable only
      by via ioctl(KVM_RUN), the side effects that apply to rseq outside of a
      critical section still apply, e.g. the current CPU needs to be updated if
      the task is migrated.
      
      Clearing TIF_NOTIFY_RESUME without informing rseq can lead to segfaults
      and other badness in userspace VMMs that use rseq in combination with KVM,
      e.g. due to the CPU ID being stale after task migration.
      
      Fixes: 72c3c0fe ("x86/kvm: Use generic xfer to guest work function")
      Reported-by: default avatarPeter Foley <pefoley@google.com>
      Bisected-by: default avatarDoug Evans <dje@google.com>
      Acked-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210901203030.1292304-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      [sean: Resolve benign conflict due to unrelated access_ok() check in 5.10]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      249e5e5a
    • Kevin Hao's avatar
      cpufreq: schedutil: Use kobject release() method to free sugov_tunables · a7d4fc84
      Kevin Hao authored
      
      [ Upstream commit e5c6b312 ]
      
      The struct sugov_tunables is protected by the kobject, so we can't free
      it directly. Otherwise we would get a call trace like this:
        ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x30
        WARNING: CPU: 3 PID: 720 at lib/debugobjects.c:505 debug_print_object+0xb8/0x100
        Modules linked in:
        CPU: 3 PID: 720 Comm: a.sh Tainted: G        W         5.14.0-rc1-next-20210715-yocto-standard+ #507
        Hardware name: Marvell OcteonTX CN96XX board (DT)
        pstate: 40400009 (nZcv daif +PAN -UAO -TCO BTYPE=--)
        pc : debug_print_object+0xb8/0x100
        lr : debug_print_object+0xb8/0x100
        sp : ffff80001ecaf910
        x29: ffff80001ecaf910 x28: ffff00011b10b8d0 x27: ffff800011043d80
        x26: ffff00011a8f0000 x25: ffff800013cb3ff0 x24: 0000000000000000
        x23: ffff80001142aa68 x22: ffff800011043d80 x21: ffff00010de46f20
        x20: ffff800013c0c520 x19: ffff800011d8f5b0 x18: 0000000000000010
        x17: 6e6968207473696c x16: 5f72656d6974203a x15: 6570797420746365
        x14: 6a626f2029302065 x13: 303378302f307830 x12: 2b6e665f72656d69
        x11: ffff8000124b1560 x10: ffff800012331520 x9 : ffff8000100ca6b0
        x8 : 000000000017ffe8 x7 : c0000000fffeffff x6 : 0000000000000001
        x5 : ffff800011d8c000 x4 : ffff800011d8c740 x3 : 0000000000000000
        x2 : ffff0001108301c0 x1 : ab3c90eedf9c0f00 x0 : 0000000000000000
        Call trace:
         debug_print_object+0xb8/0x100
         __debug_check_no_obj_freed+0x1c0/0x230
         debug_check_no_obj_freed+0x20/0x88
         slab_free_freelist_hook+0x154/0x1c8
         kfree+0x114/0x5d0
         sugov_exit+0xbc/0xc0
         cpufreq_exit_governor+0x44/0x90
         cpufreq_set_policy+0x268/0x4a8
         store_scaling_governor+0xe0/0x128
         store+0xc0/0xf0
         sysfs_kf_write+0x54/0x80
         kernfs_fop_write_iter+0x128/0x1c0
         new_sync_write+0xf0/0x190
         vfs_write+0x2d4/0x478
         ksys_write+0x74/0x100
         __arm64_sys_write+0x24/0x30
         invoke_syscall.constprop.0+0x54/0xe0
         do_el0_svc+0x64/0x158
         el0_svc+0x2c/0xb0
         el0t_64_sync_handler+0xb0/0xb8
         el0t_64_sync+0x198/0x19c
        irq event stamp: 5518
        hardirqs last  enabled at (5517): [<ffff8000100cbd7c>] console_unlock+0x554/0x6c8
        hardirqs last disabled at (5518): [<ffff800010fc0638>] el1_dbg+0x28/0xa0
        softirqs last  enabled at (5504): [<ffff8000100106e0>] __do_softirq+0x4d0/0x6c0
        softirqs last disabled at (5483): [<ffff800010049548>] irq_exit+0x1b0/0x1b8
      
      So split the original sugov_tunables_free() into two functions,
      sugov_clear_global_tunables() is just used to clear the global_tunables
      and the new sugov_tunables_free() is used as kobj_type::release to
      release the sugov_tunables safely.
      
      Fixes: 9bdcb44e ("cpufreq: schedutil: New governor based on scheduler utilization data")
      Cc: 4.7+ <stable@vger.kernel.org> # 4.7+
      Signed-off-by: default avatarKevin Hao <haokexin@gmail.com>
      Acked-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      a7d4fc84
  4. Sep 30, 2021
    • Bixuan Cui's avatar
      bpf: Add oversize check before call kvcalloc() · 6345a0be
      Bixuan Cui authored
      
      [ Upstream commit 0e6491b5 ]
      
      Commit 7661809d ("mm: don't allow oversized kvmalloc() calls") add the
      oversize check. When the allocation is larger than what kmalloc() supports,
      the following warning triggered:
      
      WARNING: CPU: 0 PID: 8408 at mm/util.c:597 kvmalloc_node+0x108/0x110 mm/util.c:597
      Modules linked in:
      CPU: 0 PID: 8408 Comm: syz-executor221 Not tainted 5.14.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:kvmalloc_node+0x108/0x110 mm/util.c:597
      Call Trace:
       kvmalloc include/linux/mm.h:806 [inline]
       kvmalloc_array include/linux/mm.h:824 [inline]
       kvcalloc include/linux/mm.h:829 [inline]
       check_btf_line kernel/bpf/verifier.c:9925 [inline]
       check_btf_info kernel/bpf/verifier.c:10049 [inline]
       bpf_check+0xd634/0x150d0 kernel/bpf/verifier.c:13759
       bpf_prog_load kernel/bpf/syscall.c:2301 [inline]
       __sys_bpf+0x11181/0x126e0 kernel/bpf/syscall.c:4587
       __do_sys_bpf kernel/bpf/syscall.c:4691 [inline]
       __se_sys_bpf kernel/bpf/syscall.c:4689 [inline]
       __x64_sys_bpf+0x78/0x90 kernel/bpf/syscall.c:4689
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Reported-by: default avatar <syzbot+f3e749d4c662818ae439@syzkaller.appspotmail.com>
      Signed-off-by: default avatarBixuan Cui <cuibixuan@huawei.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20210911005557.45518-1-cuibixuan@huawei.com
      
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      6345a0be
    • Zhihao Cheng's avatar
      blktrace: Fix uaf in blk_trace access after removing by sysfs · 3815fe73
      Zhihao Cheng authored
      
      [ Upstream commit 5afedf67 ]
      
      There is an use-after-free problem triggered by following process:
      
            P1(sda)				P2(sdb)
      			echo 0 > /sys/block/sdb/trace/enable
      			  blk_trace_remove_queue
      			    synchronize_rcu
      			    blk_trace_free
      			      relay_close
      rcu_read_lock
      __blk_add_trace
        trace_note_tsk
        (Iterate running_trace_list)
      			        relay_close_buf
      				  relay_destroy_buf
      				    kfree(buf)
          trace_note(sdb's bt)
            relay_reserve
              buf->offset <- nullptr deference (use-after-free) !!!
      rcu_read_unlock
      
      [  502.714379] BUG: kernel NULL pointer dereference, address:
      0000000000000010
      [  502.715260] #PF: supervisor read access in kernel mode
      [  502.715903] #PF: error_code(0x0000) - not-present page
      [  502.716546] PGD 103984067 P4D 103984067 PUD 17592b067 PMD 0
      [  502.717252] Oops: 0000 [#1] SMP
      [  502.720308] RIP: 0010:trace_note.isra.0+0x86/0x360
      [  502.732872] Call Trace:
      [  502.733193]  __blk_add_trace.cold+0x137/0x1a3
      [  502.733734]  blk_add_trace_rq+0x7b/0xd0
      [  502.734207]  blk_add_trace_rq_issue+0x54/0xa0
      [  502.734755]  blk_mq_start_request+0xde/0x1b0
      [  502.735287]  scsi_queue_rq+0x528/0x1140
      ...
      [  502.742704]  sg_new_write.isra.0+0x16e/0x3e0
      [  502.747501]  sg_ioctl+0x466/0x1100
      
      Reproduce method:
        ioctl(/dev/sda, BLKTRACESETUP, blk_user_trace_setup[buf_size=127])
        ioctl(/dev/sda, BLKTRACESTART)
        ioctl(/dev/sdb, BLKTRACESETUP, blk_user_trace_setup[buf_size=127])
        ioctl(/dev/sdb, BLKTRACESTART)
      
        echo 0 > /sys/block/sdb/trace/enable &
        // Add delay(mdelay/msleep) before kernel enters blk_trace_free()
      
        ioctl$SG_IO(/dev/sda, SG_IO, ...)
        // Enters trace_note_tsk() after blk_trace_free() returned
        // Use mdelay in rcu region rather than msleep(which may schedule out)
      
      Remove blk_trace from running_list before calling blk_trace_free() by
      sysfs if blk_trace is at Blktrace_running state.
      
      Fixes: c71a8961 ("blktrace: add ftrace plugin")
      Signed-off-by: default avatarZhihao Cheng <chengzhihao1@huawei.com>
      Link: https://lore.kernel.org/r/20210923134921.109194-1-chengzhihao1@huawei.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      3815fe73
  5. Sep 26, 2021
  6. Sep 22, 2021
  7. Sep 18, 2021
  8. Sep 16, 2021
  9. Sep 15, 2021
    • Andrey Ignatov's avatar
      bpf: Fix possible out of bound write in narrow load handling · b0491ab7
      Andrey Ignatov authored
      [ Upstream commit d7af7e49 ]
      
      Fix a verifier bug found by smatch static checker in [0].
      
      This problem has never been seen in prod to my best knowledge. Fixing it
      still seems to be a good idea since it's hard to say for sure whether
      it's possible or not to have a scenario where a combination of
      convert_ctx_access() and a narrow load would lead to an out of bound
      write.
      
      When narrow load is handled, one or two new instructions are added to
      insn_buf array, but before it was only checked that
      
      	cnt >= ARRAY_SIZE(insn_buf)
      
      And it's safe to add a new instruction to insn_buf[cnt++] only once. The
      second try will lead to out of bound write. And this is what can happen
      if `shift` is set.
      
      Fix it by making sure that if the BPF_RSH instruction has to be added in
      addition to BPF_AND then there is enough space for two more instructions
      in insn_buf.
      
      The full report [0] is below:
      
      kernel/bpf/verifier.c:12304 convert_ctx_accesses() warn: offset 'cnt' incremented past end of array
      kernel/bpf/verifier.c:12311 convert_ctx_accesses() warn: offset 'cnt' incremented past end of array
      
      kernel/bpf/verifier.c
          12282
          12283 			insn->off = off & ~(size_default - 1);
          12284 			insn->code = BPF_LDX | BPF_MEM | size_code;
          12285 		}
          12286
          12287 		target_size = 0;
          12288 		cnt = convert_ctx_access(type, insn, insn_buf, env->prog,
          12289 					 &target_size);
          12290 		if (cnt == 0 || cnt >= ARRAY_SIZE(insn_buf) ||
                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
      Bounds check.
      
          12291 		    (ctx_field_size && !target_size)) {
          12292 			verbose(env, "bpf verifier is misconfigured\n");
          12293 			return -EINVAL;
          12294 		}
          12295
          12296 		if (is_narrower_load && size < target_size) {
          12297 			u8 shift = bpf_ctx_narrow_access_offset(
          12298 				off, size, size_default) * 8;
          12299 			if (ctx_field_size <= 4) {
          12300 				if (shift)
          12301 					insn_buf[cnt++] = BPF_ALU32_IMM(BPF_RSH,
                                                               ^^^^^
      increment beyond end of array
      
          12302 									insn->dst_reg,
          12303 									shift);
      --> 12304 				insn_buf[cnt++] = BPF_ALU32_IMM(BPF_AND, insn->dst_reg,
                                                       ^^^^^
      out of bounds write
      
          12305 								(1 << size * 8) - 1);
          12306 			} else {
          12307 				if (shift)
          12308 					insn_buf[cnt++] = BPF_ALU64_IMM(BPF_RSH,
          12309 									insn->dst_reg,
          12310 									shift);
          12311 				insn_buf[cnt++] = BPF_ALU64_IMM(BPF_AND, insn->dst_reg,
                                              ^^^^^^^^^^^^^^^
      Same.
      
          12312 								(1ULL << size * 8) - 1);
          12313 			}
          12314 		}
          12315
          12316 		new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
          12317 		if (!new_prog)
          12318 			return -ENOMEM;
          12319
          12320 		delta += cnt - 1;
          12321
          12322 		/* keep walking new program and skip insns we just inserted */
          12323 		env->prog = new_prog;
          12324 		insn      = new_prog->insnsi + i + delta;
          12325 	}
          12326
          12327 	return 0;
          12328 }
      
      [0] https://lore.kernel.org/bpf/20210817050843.GA21456@kili/
      
      
      
      v1->v2:
      - clarify that problem was only seen by static checker but not in prod;
      
      Fixes: 46f53a65 ("bpf: Allow narrow loads with offset > 0")
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210820163935.1902398-1-rdna@fb.com
      
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      b0491ab7
    • Peter Zijlstra's avatar
      locking/lockdep: Mark local_lock_t · d5462a63
      Peter Zijlstra authored
      
      [ Upstream commit dfd5e3f5 ]
      
      The local_lock_t's are special, because they cannot form IRQ
      inversions, make sure we can tell them apart from the rest of the
      locks.
      
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      d5462a63
    • Valentin Schneider's avatar
      PM: cpu: Make notifier chain use a raw_spinlock_t · 4b7874a3
      Valentin Schneider authored
      
      [ Upstream commit b2f6662a ]
      
      Invoking atomic_notifier_chain_notify() requires acquiring a spinlock_t,
      which can block under CONFIG_PREEMPT_RT. Notifications for members of the
      cpu_pm notification chain will be issued by the idle task, which can never
      block.
      
      Making *all* atomic_notifiers use a raw_spinlock is too big of a hammer, as
      only notifications issued by the idle task are problematic.
      
      Special-case cpu_pm_notifier_chain by kludging a raw_notifier and
      raw_spinlock_t together, matching the atomic_notifier behavior with a
      raw_spinlock_t.
      
      Fixes: 70d93298 ("notifier: Fix broken error handling pattern")
      Signed-off-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Acked-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      4b7874a3
    • Waiman Long's avatar
      cgroup/cpuset: Fix violation of cpuset locking rule · 10dfcfda
      Waiman Long authored
      
      [ Upstream commit 6ba34d3c ]
      
      The cpuset fields that manage partition root state do not strictly
      follow the cpuset locking rule that update to cpuset has to be done
      with both the callback_lock and cpuset_mutex held. This is now fixed
      by making sure that the locking rule is upheld.
      
      Fixes: 3881b861 ("cpuset: Add an error state to cpuset.sched.partition")
      Fixes: 4b842da2 ("cpuset: Make CPU hotplug work with partition")
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      10dfcfda
    • Waiman Long's avatar
      cgroup/cpuset: Miscellaneous code cleanup · cbc97661
      Waiman Long authored
      
      [ Upstream commit 0f3adb8a ]
      
      Use more descriptive variable names for update_prstate(), remove
      unnecessary code and fix some typos. There is no functional change.
      
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      cbc97661
    • Lukasz Luba's avatar
      PM: EM: Increase energy calculation precision · d6337dfd
      Lukasz Luba authored
      
      [ Upstream commit 7fcc17d0 ]
      
      The Energy Model (EM) provides useful information about device power in
      each performance state to other subsystems like: Energy Aware Scheduler
      (EAS). The energy calculation in EAS does arithmetic operation based on
      the EM em_cpu_energy(). Current implementation of that function uses
      em_perf_state::cost as a pre-computed cost coefficient equal to:
      cost = power * max_frequency / frequency.
      The 'power' is expressed in milli-Watts (or in abstract scale).
      
      There are corner cases when the EAS energy calculation for two Performance
      Domains (PDs) return the same value. The EAS compares these values to
      choose smaller one. It might happen that this values are equal due to
      rounding error. In such scenario, we need better resolution, e.g. 1000
      times better. To provide this possibility increase the resolution in the
      em_perf_state::cost for 64-bit architectures. The cost of increasing
      resolution on 32-bit is pretty high (64-bit division) and is not justified
      since there are no new 32bit big.LITTLE EAS systems expected which would
      benefit from this higher resolution.
      
      This patch allows to avoid the rounding to milli-Watt errors, which might
      occur in EAS energy estimation for each PD. The rounding error is common
      for small tasks which have small utilization value.
      
      There are two places in the code where it makes a difference:
      1. In the find_energy_efficient_cpu() where we are searching for
      best_delta. We might suffer there when two PDs return the same result,
      like in the example below.
      
      Scenario:
      Low utilized system e.g. ~200 sum_util for PD0 and ~220 for PD1. There
      are quite a few small tasks ~10-15 util. These tasks would suffer for
      the rounding error. These utilization values are typical when running games
      on Android. One of our partners has reported 5..10mA less battery drain
      when running with increased resolution.
      
      Some details:
      We have two PDs: PD0 (big) and PD1 (little)
      Let's compare w/o patch set ('old') and w/ patch set ('new')
      We are comparing energy w/ task and w/o task placed in the PDs
      
      a) 'old' w/o patch set, PD0
      task_util = 13
      cost = 480
      sum_util_w/o_task = 215
      sum_util_w_task = 228
      scale_cpu = 1024
      energy_w/o_task = 480 * 215 / 1024 = 100.78 => 100
      energy_w_task = 480 * 228 / 1024 = 106.87 => 106
      energy_diff = 106 - 100 = 6
      (this is equal to 'old' PD1's energy_diff in 'c)')
      
      b) 'new' w/ patch set, PD0
      task_util = 13
      cost = 480 * 1000 = 480000
      sum_util_w/o_task = 215
      sum_util_w_task = 228
      energy_w/o_task = 480000 * 215 / 1024 = 100781
      energy_w_task = 480000 * 228 / 1024  = 106875
      energy_diff = 106875 - 100781 = 6094
      (this is not equal to 'new' PD1's energy_diff in 'd)')
      
      c) 'old' w/o patch set, PD1
      task_util = 13
      cost = 160
      sum_util_w/o_task = 283
      sum_util_w_task = 293
      scale_cpu = 355
      energy_w/o_task = 160 * 283 / 355 = 127.55 => 127
      energy_w_task = 160 * 296 / 355 = 133.41 => 133
      energy_diff = 133 - 127 = 6
      (this is equal to 'old' PD0's energy_diff in 'a)')
      
      d) 'new' w/ patch set, PD1
      task_util = 13
      cost = 160 * 1000 = 160000
      sum_util_w/o_task = 283
      sum_util_w_task = 293
      scale_cpu = 355
      energy_w/o_task = 160000 * 283 / 355 = 127549
      energy_w_task = 160000 * 296 / 355 =   133408
      energy_diff = 133408 - 127549 = 5859
      (this is not equal to 'new' PD0's energy_diff in 'b)')
      
      2. Difference in the 6% energy margin filter at the end of
      find_energy_efficient_cpu(). With this patch the margin comparison also
      has better resolution, so it's possible to have better task placement
      thanks to that.
      
      Fixes: 27871f7a ("PM: Introduce an Energy Model management framework")
      Reported-by: default avatarCCJ Yeh <CCj.Yeh@mediatek.com>
      Reviewed-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: default avatarLukasz Luba <lukasz.luba@arm.com>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      d6337dfd
    • Waiman Long's avatar
      cgroup/cpuset: Fix a partition bug with hotplug · e0f3de15
      Waiman Long authored
      
      [ Upstream commit 15d428e6 ]
      
      In cpuset_hotplug_workfn(), the detection of whether the cpu list
      has been changed is done by comparing the effective cpus of the top
      cpuset with the cpu_active_mask. However, in the rare case that just
      all the CPUs in the subparts_cpus are offlined, the detection fails
      and the partition states are not updated correctly. Fix it by forcing
      the cpus_updated flag to true in this particular case.
      
      Fixes: 4b842da2 ("cpuset: Make CPU hotplug work with partition")
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      e0f3de15
    • He Fengqing's avatar
      bpf: Fix potential memleak and UAF in the verifier. · 389dfd11
      He Fengqing authored
      
      [ Upstream commit 75f0fc7b ]
      
      In bpf_patch_insn_data(), we first use the bpf_patch_insn_single() to
      insert new instructions, then use adjust_insn_aux_data() to adjust
      insn_aux_data. If the old env->prog have no enough room for new inserted
      instructions, we use bpf_prog_realloc to construct new_prog and free the
      old env->prog.
      
      There have two errors here. First, if adjust_insn_aux_data() return
      ENOMEM, we should free the new_prog. Second, if adjust_insn_aux_data()
      return ENOMEM, bpf_patch_insn_data() will return NULL, and env->prog has
      been freed in bpf_prog_realloc, but we will use it in bpf_check().
      
      So in this patch, we make the adjust_insn_aux_data() never fails. In
      bpf_patch_insn_data(), we first pre-malloc memory for the new
      insn_aux_data, then call bpf_patch_insn_single() to insert new
      instructions, at last call adjust_insn_aux_data() to adjust
      insn_aux_data.
      
      Fixes: 8041902d ("bpf: adjust insn_aux_data when patching insns")
      Signed-off-by: default avatarHe Fengqing <hefengqing@huawei.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20210714101815.164322-1-hefengqing@huawei.com
      
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      389dfd11
    • Zhen Lei's avatar
      genirq/timings: Fix error return code in irq_timings_test_irqs() · e9a902f8
      Zhen Lei authored
      
      [ Upstream commit 290fdc4b ]
      
      Return a negative error code from the error handling case instead of 0, as
      done elsewhere in this function.
      
      Fixes: f52da98d ("genirq/timings: Add selftest for irqs circular buffer")
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarZhen Lei <thunder.leizhen@huawei.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Link: https://lore.kernel.org/r/20210811093333.2376-1-thunder.leizhen@huawei.com
      
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      e9a902f8
    • Yanfei Xu's avatar
      rcu: Fix stall-warning deadlock due to non-release of rcu_node ->lock · 497f3d9c
      Yanfei Xu authored
      
      [ Upstream commit dc87740c ]
      
      If rcu_print_task_stall() is invoked on an rcu_node structure that does
      not contain any tasks blocking the current grace period, it takes an
      early exit that fails to release that rcu_node structure's lock.  This
      results in a self-deadlock, which is detected by lockdep.
      
      To reproduce this bug:
      
      tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 3 --trust-make --configs "TREE03" --kconfig "CONFIG_PROVE_LOCKING=y" --bootargs "rcutorture.stall_cpu=30 rcutorture.stall_cpu_block=1 rcutorture.fwd_progress=0 rcutorture.test_boost=0"
      
      This will also result in other complaints, including RCU's scheduler
      hook complaining about blocking rather than preemption and an rcutorture
      writer stall.
      
      Only a partial RCU CPU stall warning message will be printed because of
      the self-deadlock.
      
      This commit therefore releases the lock on the rcu_print_task_stall()
      function's early exit path.
      
      Fixes: c583bcb8 ("rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled")
      Tested-by: default avatarQais Yousef <qais.yousef@arm.com>
      Signed-off-by: default avatarYanfei Xu <yanfei.xu@windriver.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      497f3d9c
    • Paul E. McKenney's avatar
      rcu: Add lockdep_assert_irqs_disabled() to rcu_sched_clock_irq() and callees · ea5e5bc8
      Paul E. McKenney authored
      [ Upstream commit a649d25d ]
      
      This commit adds a number of lockdep_assert_irqs_disabled() calls
      to rcu_sched_clock_irq() and a number of the functions that it calls.
      The point of this is to help track down a situation where lockdep appears
      to be insisting that interrupts are enabled within these functions, which
      should only ever be invoked from the scheduling-clock interrupt handler.
      
      Link: https://lore.kernel.org/lkml/20201111133813.GA81547@elver.google.com/
      
      
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      ea5e5bc8
    • Yanfei Xu's avatar
      rcu: Fix to include first blocked task in stall warning · 527b56d7
      Yanfei Xu authored
      
      [ Upstream commit e6a901a4 ]
      
      The for loop in rcu_print_task_stall() always omits ts[0], which points
      to the first task blocking the stalled grace period.  This in turn fails
      to count this first task, which means that ndetected will be equal to
      zero when all CPUs have passed through their quiescent states and only
      one task is blocking the stalled grace period.  This zero value for
      ndetected will in turn result in an incorrect "All QSes seen" message:
      
      rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
      rcu:    Tasks blocked on level-1 rcu_node (CPUs 12-23):
              (detected by 15, t=6504 jiffies, g=164777, q=9011209)
      rcu: All QSes seen, last rcu_preempt kthread activity 1 (4295252379-4295252378), jiffies_till_next_fqs=1, root ->qsmask 0x2
      BUG: sleeping function called from invalid context at include/linux/uaccess.h:156
      in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 70613, name: msgstress04
      INFO: lockdep is turned off.
      Preemption disabled at:
      [<ffff8000104031a4>] create_object.isra.0+0x204/0x4b0
      CPU: 15 PID: 70613 Comm: msgstress04 Kdump: loaded Not tainted
      5.12.2-yoctodev-standard #1
      Hardware name: Marvell OcteonTX CN96XX board (DT)
      Call trace:
       dump_backtrace+0x0/0x2cc
       show_stack+0x24/0x30
       dump_stack+0x110/0x188
       ___might_sleep+0x214/0x2d0
       __might_sleep+0x7c/0xe0
      
      This commit therefore fixes the loop to include ts[0].
      
      Fixes: c583bcb8 ("rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled")
      Tested-by: default avatarQais Yousef <qais.yousef@arm.com>
      Signed-off-by: default avatarYanfei Xu <yanfei.xu@windriver.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      527b56d7
    • Quentin Perret's avatar
      sched: Fix UCLAMP_FLAG_IDLE setting · e6778e1b
      Quentin Perret authored
      
      [ Upstream commit ca4984a7 ]
      
      The UCLAMP_FLAG_IDLE flag is set on a runqueue when dequeueing the last
      uclamp active task (that is, when buckets.tasks reaches 0 for all
      buckets) to maintain the last uclamp.max and prevent blocked util from
      suddenly becoming visible.
      
      However, there is an asymmetry in how the flag is set and cleared which
      can lead to having the flag set whilst there are active tasks on the rq.
      Specifically, the flag is cleared in the uclamp_rq_inc() path, which is
      called at enqueue time, but set in uclamp_rq_dec_id() which is called
      both when dequeueing a task _and_ in the update_uclamp_active() path. As
      a result, when both uclamp_rq_{dec,ind}_id() are called from
      update_uclamp_active(), the flag ends up being set but not cleared,
      hence leaving the runqueue in a broken state.
      
      Fix this by clearing the flag in update_uclamp_active() as well.
      
      Fixes: e496187d ("sched/uclamp: Enforce last task's UCLAMP_MAX")
      Reported-by: default avatarRick Yiu <rickyiu@google.com>
      Signed-off-by: default avatarQuentin Perret <qperret@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarQais Yousef <qais.yousef@arm.com>
      Tested-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Link: https://lore.kernel.org/r/20210805102154.590709-2-qperret@google.com
      
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      e6778e1b
    • Mika Penttilä's avatar
      sched/numa: Fix is_core_idle() · 718180c2
      Mika Penttilä authored
      
      [ Upstream commit 1c6829cf ]
      
      Use the loop variable instead of the function argument to test the
      other SMT siblings for idle.
      
      Fixes: ff7db0bf ("sched/numa: Prefer using an idle CPU as a migration target instead of comparing tasks")
      Signed-off-by: default avatarMika Penttilä <mika.penttila@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarPankaj Gupta <pankaj.gupta@ionos.com>
      Link: https://lkml.kernel.org/r/20210722063946.28951-1-mika.penttila@gmail.com
      
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      718180c2
    • Thomas Gleixner's avatar
      hrtimer: Ensure timerfd notification for HIGHRES=n · 3d12ccec
      Thomas Gleixner authored
      
      [ Upstream commit 8c3b5e6e ]
      
      If high resolution timers are disabled the timerfd notification about a
      clock was set event is not happening for all cases which use
      clock_was_set_delayed() because that's a NOP for HIGHRES=n, which is wrong.
      
      Make clock_was_set_delayed() unconditially available to fix that.
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Link: https://lore.kernel.org/r/20210713135158.196661266@linutronix.de
      
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      3d12ccec
    • Thomas Gleixner's avatar
      hrtimer: Avoid double reprogramming in __hrtimer_start_range_ns() · aadfa1d6
      Thomas Gleixner authored
      
      [ Upstream commit 627ef5ae ]
      
      If __hrtimer_start_range_ns() is invoked with an already armed hrtimer then
      the timer has to be canceled first and then added back. If the timer is the
      first expiring timer then on removal the clockevent device is reprogrammed
      to the next expiring timer to avoid that the pending expiry fires needlessly.
      
      If the new expiry time ends up to be the first expiry again then the clock
      event device has to reprogrammed again.
      
      Avoid this by checking whether the timer is the first to expire and in that
      case, keep the timer on the current CPU and delay the reprogramming up to
      the point where the timer has been enqueued again.
      
      Reported-by: default avatarLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Link: https://lore.kernel.org/r/20210713135157.873137732@linutronix.de
      
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      aadfa1d6
    • Frederic Weisbecker's avatar
      posix-cpu-timers: Force next expiration recalc after itimer reset · 13ccaef7
      Frederic Weisbecker authored
      
      [ Upstream commit 406dd42b ]
      
      When an itimer deactivates a previously armed expiration, it simply doesn't
      do anything. As a result the process wide cputime counter keeps running and
      the tick dependency stays set until it reaches the old ghost expiration
      value.
      
      This can be reproduced with the following snippet:
      
      	void trigger_process_counter(void)
      	{
      		struct itimerval n = {};
      
      		n.it_value.tv_sec = 100;
      		setitimer(ITIMER_VIRTUAL, &n, NULL);
      		n.it_value.tv_sec = 0;
      		setitimer(ITIMER_VIRTUAL, &n, NULL);
      	}
      
      Fix this with resetting the relevant base expiration. This is similar to
      disarming a timer.
      
      Signed-off-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20210726125513.271824-4-frederic@kernel.org
      
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      13ccaef7
    • Sergey Senozhatsky's avatar
      rcu/tree: Handle VM stoppage in stall detection · 4b680b3f
      Sergey Senozhatsky authored
      
      [ Upstream commit ccfc9dd6 ]
      
      The soft watchdog timer function checks if a virtual machine
      was suspended and hence what looks like a lockup in fact
      is a false positive.
      
      This is what kvm_check_and_clear_guest_paused() does: it
      tests guest PVCLOCK_GUEST_STOPPED (which is set by the host)
      and if it's set then we need to touch all watchdogs and bail
      out.
      
      Watchdog timer function runs from IRQ, so PVCLOCK_GUEST_STOPPED
      check works fine.
      
      There is, however, one more watchdog that runs from IRQ, so
      watchdog timer fn races with it, and that watchdog is not aware
      of PVCLOCK_GUEST_STOPPED - RCU stall detector.
      
      apic_timer_interrupt()
       smp_apic_timer_interrupt()
        hrtimer_interrupt()
         __hrtimer_run_queues()
          tick_sched_timer()
           tick_sched_handle()
            update_process_times()
             rcu_sched_clock_irq()
      
      This triggers RCU stalls on our devices during VM resume.
      
      If tick_sched_handle()->rcu_sched_clock_irq() runs on a VCPU
      before watchdog_timer_fn()->kvm_check_and_clear_guest_paused()
      then there is nothing on this VCPU that touches watchdogs and
      RCU reads stale gp stall timestamp and new jiffies value, which
      makes it think that RCU has stalled.
      
      Make RCU stall watchdog aware of PVCLOCK_GUEST_STOPPED and
      don't report RCU stalls when we resume the VM.
      
      Signed-off-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Signed-off-by: default avatarSigned-off-by: Paul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      4b680b3f
    • Dietmar Eggemann's avatar
      sched/deadline: Fix missing clock update in migrate_task_rq_dl() · 1cc05d71
      Dietmar Eggemann authored
      
      [ Upstream commit b4da13aa ]
      
      A missing clock update is causing the following warning:
      
      rq->clock_update_flags < RQCF_ACT_SKIP
      WARNING: CPU: 112 PID: 2041 at kernel/sched/sched.h:1453
      sub_running_bw.isra.0+0x190/0x1a0
      ...
      CPU: 112 PID: 2041 Comm: sugov:112 Tainted: G W 5.14.0-rc1 #1
      Hardware name: WIWYNN Mt.Jade Server System
      B81.030Z1.0007/Mt.Jade Motherboard, BIOS 1.6.20210526 (SCP:
      1.06.20210526) 2021/05/26
      ...
      Call trace:
        sub_running_bw.isra.0+0x190/0x1a0
        migrate_task_rq_dl+0xf8/0x1e0
        set_task_cpu+0xa8/0x1f0
        try_to_wake_up+0x150/0x3d4
        wake_up_q+0x64/0xc0
        __up_write+0xd0/0x1c0
        up_write+0x4c/0x2b0
        cppc_set_perf+0x120/0x2d0
        cppc_cpufreq_set_target+0xe0/0x1a4 [cppc_cpufreq]
        __cpufreq_driver_target+0x74/0x140
        sugov_work+0x64/0x80
        kthread_worker_fn+0xe0/0x230
        kthread+0x138/0x140
        ret_from_fork+0x10/0x18
      
      The task causing this is the `cppc_fie` DL task introduced by
      commit 1eb5dde6 ("cpufreq: CPPC: Add support for frequency
      invariance").
      
      With CONFIG_ACPI_CPPC_CPUFREQ_FIE=y and schedutil cpufreq governor on
      slow-switching system (like on this Ampere Altra WIWYNN Mt. Jade Arm
      Server):
      
      DL task `curr=sugov:112` lets `p=cppc_fie` migrate and since the latter
      is in `non_contending` state, migrate_task_rq_dl() calls
      
        sub_running_bw()->__sub_running_bw()->cpufreq_update_util()->
        rq_clock()->assert_clock_updated()
      
      on p.
      
      Fix this by updating the clock for a non_contending task in
      migrate_task_rq_dl() before calling sub_running_bw().
      
      Reported-by: default avatarBruno Goncalves <bgoncalv@redhat.com>
      Signed-off-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDaniel Bristot de Oliveira <bristot@kernel.org>
      Acked-by: default avatarJuri Lelli <juri.lelli@redhat.com>
      Link: https://lore.kernel.org/r/20210804135925.3734605-1-dietmar.eggemann@arm.com
      
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      1cc05d71
    • Quentin Perret's avatar
      sched/deadline: Fix reset_on_fork reporting of DL tasks · 3ebd7b38
      Quentin Perret authored
      
      [ Upstream commit f9509153 ]
      
      It is possible for sched_getattr() to incorrectly report the state of
      the reset_on_fork flag when called on a deadline task.
      
      Indeed, if the flag was set on a deadline task using sched_setattr()
      with flags (SCHED_FLAG_RESET_ON_FORK | SCHED_FLAG_KEEP_PARAMS), then
      p->sched_reset_on_fork will be set, but __setscheduler() will bail out
      early, which means that the dl_se->flags will not get updated by
      __setscheduler_params()->__setparam_dl(). Consequently, if
      sched_getattr() is then called on the task, __getparam_dl() will
      override kattr.sched_flags with the now out-of-date copy in dl_se->flags
      and report the stale value to userspace.
      
      To fix this, make sure to only copy the flags that are relevant to
      sched_deadline to and from the dl_se->flags field.
      
      Signed-off-by: default avatarQuentin Perret <qperret@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20210727101103.2729607-2-qperret@google.com
      
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      3ebd7b38
Loading