summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2021-09-09Merge tag 'for-5.15-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix max_inline mount option limit on 64k page system - lockdep fixes: - update bdev time in a safer way - move bdev put outside of sb write section when removing device - fix possible deadlock when mounting seed/sprout filesystem - zoned mode: fix split extent accounting - minor include fixup * tag 'for-5.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: zoned: fix double counting of split ordered extent btrfs: fix lockdep warning while mounting sprout fs btrfs: delay blkdev_put until after the device remove btrfs: update the bdev time directly when closing btrfs: use correct header for div_u64 in misc.h btrfs: fix upper limit for max_inline for page size 64K
2021-09-09cifs: properly invalidate cached root handle when closing itEnzo Matsumiya
Cached root file was not being completely invalidated sometimes. Reproducing: - With a DFS share with 2 targets, one disabled and one enabled - start some I/O on the mount # while true; do ls /mnt/dfs; done - at the same time, disable the enabled target and enable the disabled one - wait for DFS cache to expire - on reconnect, the previous cached root handle should be invalid, but open_cached_dir_by_dentry() will still try to use it, but throws a use-after-free warning (kref_get()) Make smb2_close_cached_fid() invalidate all fields every time, but only send an SMB2_close() when the entry is still valid. Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-09-09Merge tag 'for-linus-5.15-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml Pull UML updates from Richard Weinberger: - Support for VMAP_STACK - Support for splice_write in hostfs - Fixes for virt-pci - Fixes for virtio_uml - Various fixes * tag 'for-linus-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml: um: fix stub location calculation um: virt-pci: fix uapi documentation um: enable VMAP_STACK um: virt-pci: don't do DMA from stack hostfs: support splice_write um: virtio_uml: fix memory leak on init failures um: virtio_uml: include linux/virtio-uml.h lib/logic_iomem: fix sparse warnings um: make PCI emulation driver init/exit static
2021-09-09Merge tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-armLinus Torvalds
Pull ARM development updates from Russell King: - Rename "mod_init" and "mod_exit" so that initcall debug output is actually useful (Randy Dunlap) - Update maintainers entries for linux-arm-kernel to indicate it is moderated for non-subscribers (Randy Dunlap) - Move install rules to arch/arm/Makefile (Masahiro Yamada) - Drop unnecessary ARCH_NR_GPIOS definition (Linus Walleij) - Don't warn about atags_to_fdt() stack size (David Heidelberg) - Speed up unaligned copy_{from,to}_kernel_nofault (Arnd Bergmann) - Get rid of set_fs() usage (Arnd Bergmann) - Remove checks for GCC prior to v4.6 (Geert Uytterhoeven) * tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm: ARM: 9118/1: div64: Remove always-true __div64_const32_is_OK() duplicate ARM: 9117/1: asm-generic: div64: Remove always-true __div64_const32_is_OK() ARM: 9116/1: unified: Remove check for gcc < 4 ARM: 9110/1: oabi-compat: fix oabi epoll sparse warning ARM: 9113/1: uaccess: remove set_fs() implementation ARM: 9112/1: uaccess: add __{get,put}_kernel_nofault ARM: 9111/1: oabi-compat: rework fcntl64() emulation ARM: 9114/1: oabi-compat: rework sys_semtimedop emulation ARM: 9108/1: oabi-compat: rework epoll_wait/epoll_pwait emulation ARM: 9107/1: syscall: always store thread_info->abi_syscall ARM: 9109/1: oabi-compat: add epoll_pwait handler ARM: 9106/1: traps: use get_kernel_nofault instead of set_fs() ARM: 9115/1: mm/maccess: fix unaligned copy_{from,to}_kernel_nofault ARM: 9105/1: atags_to_fdt: don't warn about stack size ARM: 9103/1: Drop ARCH_NR_GPIOS definition ARM: 9102/1: move theinstall rules to arch/arm/Makefile ARM: 9100/1: MAINTAINERS: mark all linux-arm-kernel@infradead list as moderated ARM: 9099/1: crypto: rename 'mod_init' & 'mod_exit' functions to be module-specific
2021-09-09Merge tag 's390-5.15-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull more s390 updates from Heiko Carstens: "Except for the xpram device driver removal it is all about fixes and cleanups. - Fix topology update on cpu hotplug, so notifiers see expected masks. This bug was uncovered with SCHED_CORE support. - Fix stack unwinding so that the correct number of entries are omitted like expected by common code. This fixes KCSAN selftests. - Add kmemleak annotation to stack_alloc to avoid false positive kmemleak warnings. - Avoid layering violation in common I/O code and don't unregister subchannel from child-drivers. - Remove xpram device driver for which no real use case exists since the kernel is 64 bit only. Also all hypervisors got required support removed in the meantime, which means the xpram device driver is dead code. - Fix -ENODEV handling of clp_get_state in our PCI code. - Enable KFENCE in debug defconfig. - Cleanup hugetlbfs s390 specific Kconfig dependency. - Quite a lot of trivial fixes to get rid of "W=1" warnings, and and other simple cleanups" * tag 's390-5.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: hugetlbfs: s390 is always 64bit s390/ftrace: remove incorrect __va usage s390/zcrypt: remove incorrect kernel doc indicators scsi: zfcp: fix kernel doc comments s390/sclp: add __nonstring annotation s390/hmcdrv_ftp: fix kernel doc comment s390: remove xpram device driver s390/pci: read clp_list_pci_req only once s390/pci: fix clp_get_state() handling of -ENODEV s390/cio: fix kernel doc comment s390/ctrlchar: fix kernel doc comment s390/con3270: use proper type for tasklet function s390/cpum_cf: move array from header to C file s390/mm: fix kernel doc comments s390/topology: fix topology information when calling cpu hotplug notifiers s390/unwind: use current_frame_address() to unwind current task s390/configs: enable CONFIG_KFENCE in debug_defconfig s390/entry: make oklabel within CHKSTG macro local s390: add kmemleak annotation in stack_alloc() s390/cio: dont unregister subchannel from child-drivers
2021-09-09Merge branch 'work.gfs2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull gfs2 setattr updates from Al Viro: "Make it possible for filesystems to use a generic 'may_setattr()' and switch gfs2 to using it" * 'work.gfs2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: gfs2: Switch to may_setattr in gfs2_setattr fs: Move notify_change permission checks into may_setattr
2021-09-09Merge branch 'work.init' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull root filesystem type handling updates from Al Viro: "Teach init/do_mounts.c to handle non-block filesystems, hopefully preventing even more special-cased kludges (such as root=/dev/nfs, etc)" * 'work.init' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: simplify get_filesystem_list / get_all_fs_names init: allow mounting arbitrary non-blockdevice filesystems as root init: split get_fs_names
2021-09-09Merge branch 'work.iov_iter' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull iov_iter fixes from Al Viro: "Fixes for io-uring handling of iov_iter reexpands" * 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: io_uring: reexpand under-reexpanded iters iov_iter: track truncated size
2021-09-09Merge tag 'libnvdimm-for-5.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: - Fix a race condition in the teardown path of raw mode pmem namespaces. - Cleanup the code that filesystems use to detect filesystem-dax capabilities of their underlying block device. * tag 'libnvdimm-for-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: dax: remove bdev_dax_supported xfs: factor out a xfs_buftarg_is_dax helper dax: stub out dax_supported for !CONFIG_FS_DAX dax: remove __generic_fsdax_supported dax: move the dax_read_lock() locking into dax_supported dax: mark dax_get_by_host static dm: use fs_dax_get_by_bdev instead of dax_get_by_host dax: stop using bdevname fsdax: improve the FS_DAX Kconfig description and help text libnvdimm/pmem: Fix crash triggered when I/O in-flight during unbind
2021-09-09io_uring: fail links of cancelled timeoutsPavel Begunkov
When we cancel a timeout we should mark it with REQ_F_FAIL, so linked requests are cancelled as well, but not queued for further execution. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/fff625b44eeced3a5cae79f60e6acf3fbdf8f990.1631192135.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-09io-wq: fix memory leak in create_io_worker()Qiang.zhang
BUG: memory leak unreferenced object 0xffff888126fcd6c0 (size 192): comm "syz-executor.1", pid 11934, jiffies 4294983026 (age 15.690s) backtrace: [<ffffffff81632c91>] kmalloc_node include/linux/slab.h:609 [inline] [<ffffffff81632c91>] kzalloc_node include/linux/slab.h:732 [inline] [<ffffffff81632c91>] create_io_worker+0x41/0x1e0 fs/io-wq.c:739 [<ffffffff8163311e>] io_wqe_create_worker fs/io-wq.c:267 [inline] [<ffffffff8163311e>] io_wqe_enqueue+0x1fe/0x330 fs/io-wq.c:866 [<ffffffff81620b64>] io_queue_async_work+0xc4/0x200 fs/io_uring.c:1473 [<ffffffff8162c59c>] __io_queue_sqe+0x34c/0x510 fs/io_uring.c:6933 [<ffffffff8162c7ab>] io_req_task_submit+0x4b/0xa0 fs/io_uring.c:2233 [<ffffffff8162cb48>] io_async_task_func+0x108/0x1c0 fs/io_uring.c:5462 [<ffffffff816259e3>] tctx_task_work+0x1b3/0x3a0 fs/io_uring.c:2158 [<ffffffff81269b43>] task_work_run+0x73/0xb0 kernel/task_work.c:164 [<ffffffff812dcdd1>] tracehook_notify_signal include/linux/tracehook.h:212 [inline] [<ffffffff812dcdd1>] handle_signal_work kernel/entry/common.c:146 [inline] [<ffffffff812dcdd1>] exit_to_user_mode_loop kernel/entry/common.c:172 [inline] [<ffffffff812dcdd1>] exit_to_user_mode_prepare+0x151/0x180 kernel/entry/common.c:209 [<ffffffff843ff25d>] __syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline] [<ffffffff843ff25d>] syscall_exit_to_user_mode+0x1d/0x40 kernel/entry/common.c:302 [<ffffffff843fa4a2>] do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86 [<ffffffff84600068>] entry_SYSCALL_64_after_hwframe+0x44/0xae when create_io_thread() return error, and not retry, the worker object need to be freed. Reported-by: syzbot+65454c239241d3d647da@syzkaller.appspotmail.com Signed-off-by: Qiang.zhang <qiang.zhang@windriver.com> Link: https://lore.kernel.org/r/20210909115822.181188-1-qiang.zhang@windriver.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-09cifs: move SMB FSCTL definitions to common codeSteve French
The FSCTL definitions are in smbfsctl.h which should be shared by client and server. Move the updated version of smbfsctl.h into smbfs_common and have the client code use it (subsequent patch will change the server to use this common version of the header). Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-09-08cifs: rename cifs_common to smbfs_commonSteve French
As we move to common code between client and server, we have been asked to make the names less confusing, and refer less to "cifs" and more to words which include "smb" instead to e.g. "smbfs" for the client (we already have "ksmbd" for the kernel server, and "smbd" for the user space Samba daemon). So to be more consistent in the naming of common code between client and server and reduce the risk of merge conflicts as more common code is added - rename "cifs_common" to "smbfs_common" (in future releases we also will rename the fs/cifs directory to fs/smbfs) Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-09-08cifs: update FSCTL definitionsSteve French
Add some missing defines used by ksmbd to the client version of smbfsctl.h, and add a missing newer define mentioned in the protocol definitions (MS-FSCC). This will also make it easier to move to common code. Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-09-08io-wq: fix silly logic error in io_task_work_match()Jens Axboe
We check for the func with an OR condition, which means it always ends up being false and we never match the task_work we want to cancel. In the unexpected case that we do exit with that pending, we can trigger a hang waiting for a worker to exit, but it was never created. syzbot reports that as such: INFO: task syz-executor687:8514 blocked for more than 143 seconds. Not tainted 5.14.0-syzkaller #0 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:syz-executor687 state:D stack:27296 pid: 8514 ppid: 8479 flags:0x00024004 Call Trace: context_switch kernel/sched/core.c:4940 [inline] __schedule+0x940/0x26f0 kernel/sched/core.c:6287 schedule+0xd3/0x270 kernel/sched/core.c:6366 schedule_timeout+0x1db/0x2a0 kernel/time/timer.c:1857 do_wait_for_common kernel/sched/completion.c:85 [inline] __wait_for_common kernel/sched/completion.c:106 [inline] wait_for_common kernel/sched/completion.c:117 [inline] wait_for_completion+0x176/0x280 kernel/sched/completion.c:138 io_wq_exit_workers fs/io-wq.c:1162 [inline] io_wq_put_and_exit+0x40c/0xc70 fs/io-wq.c:1197 io_uring_clean_tctx fs/io_uring.c:9607 [inline] io_uring_cancel_generic+0x5fe/0x740 fs/io_uring.c:9687 io_uring_files_cancel include/linux/io_uring.h:16 [inline] do_exit+0x265/0x2a30 kernel/exit.c:780 do_group_exit+0x125/0x310 kernel/exit.c:922 get_signal+0x47f/0x2160 kernel/signal.c:2868 arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:865 handle_signal_work kernel/entry/common.c:148 [inline] exit_to_user_mode_loop kernel/entry/common.c:172 [inline] exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:209 __syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline] syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:302 do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x445cd9 RSP: 002b:00007fc657f4b308 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca RAX: 0000000000000001 RBX: 00000000004cb448 RCX: 0000000000445cd9 RDX: 00000000000f4240 RSI: 0000000000000081 RDI: 00000000004cb44c RBP: 00000000004cb440 R08: 000000000000000e R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 000000000049b154 R13: 0000000000000003 R14: 00007fc657f4b400 R15: 0000000000022000 While in there, also decrement accr->nr_workers. This isn't strictly needed as we're exiting, but let's make sure the accounting matches up. Fixes: 3146cba99aa2 ("io-wq: make worker creation resilient against signals") Reported-by: syzbot+f62d3e0a4ea4f38f5326@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-08io_uring: drop ctx->uring_lock before acquiring sqd->lockJens Axboe
The SQPOLL thread dictates the lock order, and we hold the ctx->uring_lock for all the registration opcodes. We also hold a ref to the ctx, and we do drop the lock for other reasons to quiesce, so it's fine to drop the ctx lock temporarily to grab the sqd->lock. This fixes the following lockdep splat: ====================================================== WARNING: possible circular locking dependency detected 5.14.0-syzkaller #0 Not tainted ------------------------------------------------------ syz-executor.5/25433 is trying to acquire lock: ffff888023426870 (&sqd->lock){+.+.}-{3:3}, at: io_register_iowq_max_workers fs/io_uring.c:10551 [inline] ffff888023426870 (&sqd->lock){+.+.}-{3:3}, at: __io_uring_register fs/io_uring.c:10757 [inline] ffff888023426870 (&sqd->lock){+.+.}-{3:3}, at: __do_sys_io_uring_register+0x10aa/0x2e70 fs/io_uring.c:10792 but task is already holding lock: ffff8880885b40a8 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_register+0x2e1/0x2e70 fs/io_uring.c:10791 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&ctx->uring_lock){+.+.}-{3:3}: __mutex_lock_common kernel/locking/mutex.c:596 [inline] __mutex_lock+0x131/0x12f0 kernel/locking/mutex.c:729 __io_sq_thread fs/io_uring.c:7291 [inline] io_sq_thread+0x65a/0x1370 fs/io_uring.c:7368 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 -> #0 (&sqd->lock){+.+.}-{3:3}: check_prev_add kernel/locking/lockdep.c:3051 [inline] check_prevs_add kernel/locking/lockdep.c:3174 [inline] validate_chain kernel/locking/lockdep.c:3789 [inline] __lock_acquire+0x2a07/0x54a0 kernel/locking/lockdep.c:5015 lock_acquire kernel/locking/lockdep.c:5625 [inline] lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5590 __mutex_lock_common kernel/locking/mutex.c:596 [inline] __mutex_lock+0x131/0x12f0 kernel/locking/mutex.c:729 io_register_iowq_max_workers fs/io_uring.c:10551 [inline] __io_uring_register fs/io_uring.c:10757 [inline] __do_sys_io_uring_register+0x10aa/0x2e70 fs/io_uring.c:10792 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&ctx->uring_lock); lock(&sqd->lock); lock(&ctx->uring_lock); lock(&sqd->lock); *** DEADLOCK *** Fixes: 2e480058ddc2 ("io-wq: provide a way to limit max number of workers") Reported-by: syzbot+97fa56483f69d677969f@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-08Merge branch 'for-5.15/fsdax-cleanups' into for-5.15/libnvdimmDan Williams
Include Christoph's rework of the dax_supported() helpers in the v5.15 libnvdimm update. This supports the ongoing dax-reflink enabling effort.
2021-09-08Merge tag 'ceph-for-5.15-rc1' of git://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph updates from Ilya Dryomov: - a set of patches to address fsync stalls caused by depending on periodic rather than triggered MDS journal flushes in some cases (Xiubo Li) - a fix for mtime effectively not getting updated in case of competing writers (Jeff Layton) - a couple of fixes for inode reference leaks and various WARNs after "umount -f" (Xiubo Li) - a new ceph.auth_mds extended attribute (Jeff Layton) - a smattering of fixups and cleanups from Jeff, Xiubo and Colin. * tag 'ceph-for-5.15-rc1' of git://github.com/ceph/ceph-client: ceph: fix dereference of null pointer cf ceph: drop the mdsc_get_session/put_session dout messages ceph: lockdep annotations for try_nonblocking_invalidate ceph: don't WARN if we're forcibly removing the session caps ceph: don't WARN if we're force umounting ceph: remove the capsnaps when removing caps ceph: request Fw caps before updating the mtime in ceph_write_iter ceph: reconnect to the export targets on new mdsmaps ceph: print more information when we can't find snaprealm ceph: add ceph_change_snap_realm() helper ceph: remove redundant initializations from mdsc and session ceph: cancel delayed work instead of flushing on mdsc teardown ceph: add a new vxattr to return auth mds for an inode ceph: remove some defunct forward declarations ceph: flush the mdlog before waiting on unsafe reqs ceph: flush mdlog before umounting ceph: make iterate_sessions a global symbol ceph: make ceph_create_session_msg a global symbol ceph: fix comment about short copies in ceph_write_end ceph: fix memory leak on decode error in ceph_handle_caps
2021-09-08ksmbd: fix control flow issues in sid_to_id()Namjae Jeon
Addresses-Coverity reported Control flow issues in sid_to_id() /fs/ksmbd/smbacl.c: 277 in sid_to_id() 271 272 if (sidtype == SIDOWNER) { 273 kuid_t uid; 274 uid_t id; 275 276 id = le32_to_cpu(psid->sub_auth[psid->num_subauth - 1]); >>> CID 1506810: Control flow issues (NO_EFFECT) >>> This greater-than-or-equal-to-zero comparison of an unsigned value >>> is always true. "id >= 0U". 277 if (id >= 0) { 278 /* 279 * Translate raw sid into kuid in the server's user 280 * namespace. 281 */ 282 uid = make_kuid(&init_user_ns, id); Addresses-Coverity: ("Control flow issues") Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-09-08ksmbd: fix read of uninitialized variable ret in set_file_basic_infoNamjae Jeon
Addresses-Coverity reported Uninitialized variables warninig : /fs/ksmbd/smb2pdu.c: 5525 in set_file_basic_info() 5519 if (!rc) { 5520 inode->i_ctime = ctime; 5521 mark_inode_dirty(inode); 5522 } 5523 inode_unlock(inode); 5524 } >>> CID 1506805: Uninitialized variables (UNINIT) >>> Using uninitialized value "rc". 5525 return rc; 5526 } 5527 5528 static int set_file_allocation_info(struct ksmbd_work *work, 5529 struct ksmbd_file *fp, char *buf) 5530 { Addresses-Coverity: ("Uninitialized variable") Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-09-08ksmbd: add missing assignments to ret on ndr_read_int64 read callsColin Ian King
Currently there are two ndr_read_int64 calls where ret is being checked for failure but ret is not being assigned a return value from the call. Static analyis is reporting the checks on ret as dead code. Fix this. Addresses-Coverity: ("Logical dead code") Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-09-08io_uring: fix missing mb() before waitqueue_activePavel Begunkov
In case of !SQPOLL, io_cqring_ev_posted_iopoll() doesn't provide a memory barrier required by waitqueue_active(&ctx->poll_wait). There is a wq_has_sleeper(), which does smb_mb() inside, but it's called only for SQPOLL. Fixes: 5fd4617840596 ("io_uring: be smarter about waking multiple CQ ring waiters") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2982e53bcea2274006ed435ee2a77197107d8a29.1631130542.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-08Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge more updates from Andrew Morton: "147 patches, based on 7d2a07b769330c34b4deabeed939325c77a7ec2f. Subsystems affected by this patch series: mm (memory-hotplug, rmap, ioremap, highmem, cleanups, secretmem, kfence, damon, and vmscan), alpha, percpu, procfs, misc, core-kernel, MAINTAINERS, lib, checkpatch, epoll, init, nilfs2, coredump, fork, pids, criu, kconfig, selftests, ipc, and scripts" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (94 commits) scripts: check_extable: fix typo in user error message mm/workingset: correct kernel-doc notations ipc: replace costly bailout check in sysvipc_find_ipc() selftests/memfd: remove unused variable Kconfig.debug: drop selecting non-existing HARDLOCKUP_DETECTOR_ARCH configs: remove the obsolete CONFIG_INPUT_POLLDEV prctl: allow to setup brk for et_dyn executables pid: cleanup the stale comment mentioning pidmap_init(). kernel/fork.c: unexport get_{mm,task}_exe_file coredump: fix memleak in dump_vma_snapshot() fs/coredump.c: log if a core dump is aborted due to changed file permissions nilfs2: use refcount_dec_and_lock() to fix potential UAF nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group nilfs2: fix NULL pointer in nilfs_##name##_attr_release nilfs2: fix memory leak in nilfs_sysfs_create_device_group trap: cleanup trap_init() init: move usermodehelper_enable() to populate_rootfs() ...
2021-09-08coredump: fix memleak in dump_vma_snapshot()QiuXi
dump_vma_snapshot() allocs memory for *vma_meta, when dump_vma_snapshot() returns -EFAULT, the memory will be leaked, so we free it correctly. Link: https://lkml.kernel.org/r/20210810020441.62806-1-qiuxi1@huawei.com Fixes: a07279c9a8cd7 ("binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot") Signed-off-by: QiuXi <qiuxi1@huawei.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Jann Horn <jannh@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08fs/coredump.c: log if a core dump is aborted due to changed file permissionsDavid Oberhollenzer
For obvious security reasons, a core dump is aborted if the filesystem cannot preserve ownership or permissions of the dump file. This affects filesystems like e.g. vfat, but also something like a 9pfs share in a Qemu test setup, running as a regular user, depending on the security model used. In those cases, the result is an empty core file and a confused user. To hopefully save other people a lot of time figuring out the cause, this patch adds a simple log message for those specific cases. [akpm@linux-foundation.org: s/|%s/%s/ in printk text] Link: https://lkml.kernel.org/r/20210701233151.102720-1-david.oberhollenzer@sigma-star.at Signed-off-by: David Oberhollenzer <david.oberhollenzer@sigma-star.at> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08nilfs2: use refcount_dec_and_lock() to fix potential UAFZhen Lei
When the refcount is decreased to 0, the resource reclamation branch is entered. Before CPU0 reaches the race point (1), CPU1 may obtain the spinlock and traverse the rbtree to find 'root', see nilfs_lookup_root(). Although CPU1 will call refcount_inc() to increase the refcount, it is obviously too late. CPU0 will release 'root' directly, CPU1 then accesses 'root' and triggers UAF. Use refcount_dec_and_lock() to ensure that both the operations of decrease refcount to 0 and link deletion are lock protected eliminates this risk. CPU0 CPU1 nilfs_put_root(): <-------- (1) spin_lock(&nilfs->ns_cptree_lock); rb_erase(&root->rb_node, &nilfs->ns_cptree); spin_unlock(&nilfs->ns_cptree_lock); kfree(root); <-------- use-after-free refcount_t: underflow; use-after-free. WARNING: CPU: 2 PID: 9476 at lib/refcount.c:28 \ refcount_warn_saturate+0x1cf/0x210 lib/refcount.c:28 Modules linked in: CPU: 2 PID: 9476 Comm: syz-executor.0 Not tainted 5.10.45-rc1+ #3 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ... RIP: 0010:refcount_warn_saturate+0x1cf/0x210 lib/refcount.c:28 ... ... Call Trace: __refcount_sub_and_test include/linux/refcount.h:283 [inline] __refcount_dec_and_test include/linux/refcount.h:315 [inline] refcount_dec_and_test include/linux/refcount.h:333 [inline] nilfs_put_root+0xc1/0xd0 fs/nilfs2/the_nilfs.c:795 nilfs_segctor_destroy fs/nilfs2/segment.c:2749 [inline] nilfs_detach_log_writer+0x3fa/0x570 fs/nilfs2/segment.c:2812 nilfs_put_super+0x2f/0xf0 fs/nilfs2/super.c:467 generic_shutdown_super+0xcd/0x1f0 fs/super.c:464 kill_block_super+0x4a/0x90 fs/super.c:1446 deactivate_locked_super+0x6a/0xb0 fs/super.c:335 deactivate_super+0x85/0x90 fs/super.c:366 cleanup_mnt+0x277/0x2e0 fs/namespace.c:1118 __cleanup_mnt+0x15/0x20 fs/namespace.c:1125 task_work_run+0x8e/0x110 kernel/task_work.c:151 tracehook_notify_resume include/linux/tracehook.h:188 [inline] exit_to_user_mode_loop kernel/entry/common.c:164 [inline] exit_to_user_mode_prepare+0x13c/0x170 kernel/entry/common.c:191 syscall_exit_to_user_mode+0x16/0x30 kernel/entry/common.c:266 do_syscall_64+0x45/0x80 arch/x86/entry/common.c:56 entry_SYSCALL_64_after_hwframe+0x44/0xa9 There is no reproduction program, and the above is only theoretical analysis. Link: https://lkml.kernel.org/r/1629859428-5906-1-git-send-email-konishi.ryusuke@gmail.com Fixes: ba65ae4729bf ("nilfs2: add checkpoint tree to nilfs object") Link: https://lkml.kernel.org/r/20210723012317.4146-1-thunder.leizhen@huawei.com Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_groupNanyong Sun
kobject_put() should be used to cleanup the memory associated with the kobject instead of kobject_del(). See the section "Kobject removal" of "Documentation/core-api/kobject.rst". Link: https://lkml.kernel.org/r/20210629022556.3985106-7-sunnanyong@huawei.com Link: https://lkml.kernel.org/r/1625651306-10829-7-git-send-email-konishi.ryusuke@gmail.com Signed-off-by: Nanyong Sun <sunnanyong@huawei.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_groupNanyong Sun
If kobject_init_and_add returns with error, kobject_put() is needed here to avoid memory leak, because kobject_init_and_add may return error without freeing the memory associated with the kobject it allocated. Link: https://lkml.kernel.org/r/20210629022556.3985106-6-sunnanyong@huawei.com Link: https://lkml.kernel.org/r/1625651306-10829-6-git-send-email-konishi.ryusuke@gmail.com Signed-off-by: Nanyong Sun <sunnanyong@huawei.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_groupNanyong Sun
The kobject_put() should be used to cleanup the memory associated with the kobject instead of kobject_del. See the section "Kobject removal" of "Documentation/core-api/kobject.rst". Link: https://lkml.kernel.org/r/20210629022556.3985106-5-sunnanyong@huawei.com Link: https://lkml.kernel.org/r/1625651306-10829-5-git-send-email-konishi.ryusuke@gmail.com Signed-off-by: Nanyong Sun <sunnanyong@huawei.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08nilfs2: fix memory leak in nilfs_sysfs_create_##name##_groupNanyong Sun
If kobject_init_and_add return with error, kobject_put() is needed here to avoid memory leak, because kobject_init_and_add may return error without freeing the memory associated with the kobject it allocated. Link: https://lkml.kernel.org/r/20210629022556.3985106-4-sunnanyong@huawei.com Link: https://lkml.kernel.org/r/1625651306-10829-4-git-send-email-konishi.ryusuke@gmail.com Signed-off-by: Nanyong Sun <sunnanyong@huawei.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08nilfs2: fix NULL pointer in nilfs_##name##_attr_releaseNanyong Sun
In nilfs_##name##_attr_release, kobj->parent should not be referenced because it is a NULL pointer. The release() method of kobject is always called in kobject_put(kobj), in the implementation of kobject_put(), the kobj->parent will be assigned as NULL before call the release() method. So just use kobj to get the subgroups, which is more efficient and can fix a NULL pointer reference problem. Link: https://lkml.kernel.org/r/20210629022556.3985106-3-sunnanyong@huawei.com Link: https://lkml.kernel.org/r/1625651306-10829-3-git-send-email-konishi.ryusuke@gmail.com Signed-off-by: Nanyong Sun <sunnanyong@huawei.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08nilfs2: fix memory leak in nilfs_sysfs_create_device_groupNanyong Sun
Patch series "nilfs2: fix incorrect usage of kobject". This patchset from Nanyong Sun fixes memory leak issues and a NULL pointer dereference issue caused by incorrect usage of kboject in nilfs2 sysfs implementation. This patch (of 6): Reported by syzkaller: BUG: memory leak unreferenced object 0xffff888100ca8988 (size 8): comm "syz-executor.1", pid 1930, jiffies 4294745569 (age 18.052s) hex dump (first 8 bytes): 6c 6f 6f 70 31 00 ff ff loop1... backtrace: kstrdup+0x36/0x70 mm/util.c:60 kstrdup_const+0x35/0x60 mm/util.c:83 kvasprintf_const+0xf1/0x180 lib/kasprintf.c:48 kobject_set_name_vargs+0x56/0x150 lib/kobject.c:289 kobject_add_varg lib/kobject.c:384 [inline] kobject_init_and_add+0xc9/0x150 lib/kobject.c:473 nilfs_sysfs_create_device_group+0x150/0x7d0 fs/nilfs2/sysfs.c:986 init_nilfs+0xa21/0xea0 fs/nilfs2/the_nilfs.c:637 nilfs_fill_super fs/nilfs2/super.c:1046 [inline] nilfs_mount+0x7b4/0xe80 fs/nilfs2/super.c:1316 legacy_get_tree+0x105/0x210 fs/fs_context.c:592 vfs_get_tree+0x8e/0x2d0 fs/super.c:1498 do_new_mount fs/namespace.c:2905 [inline] path_mount+0xf9b/0x1990 fs/namespace.c:3235 do_mount+0xea/0x100 fs/namespace.c:3248 __do_sys_mount fs/namespace.c:3456 [inline] __se_sys_mount fs/namespace.c:3433 [inline] __x64_sys_mount+0x14b/0x1f0 fs/namespace.c:3433 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae If kobject_init_and_add return with error, then the cleanup of kobject is needed because memory may be allocated in kobject_init_and_add without freeing. And the place of cleanup_dev_kobject should use kobject_put to free the memory associated with the kobject. As the section "Kobject removal" of "Documentation/core-api/kobject.rst" says, kobject_del() just makes the kobject "invisible", but it is not cleaned up. And no more cleanup will do after cleanup_dev_kobject, so kobject_put is needed here. Link: https://lkml.kernel.org/r/1625651306-10829-1-git-send-email-konishi.ryusuke@gmail.com Link: https://lkml.kernel.org/r/1625651306-10829-2-git-send-email-konishi.ryusuke@gmail.com Reported-by: Hulk Robot <hulkci@huawei.com> Link: https://lkml.kernel.org/r/20210629022556.3985106-2-sunnanyong@huawei.com Signed-off-by: Nanyong Sun <sunnanyong@huawei.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08fs/epoll: use a per-cpu counter for user's watches countNicholas Piggin
This counter tracks the number of watches a user has, to compare against the 'max_user_watches' limit. This causes a scalability bottleneck on SPECjbb2015 on large systems as there is only one user. Changing to a per-cpu counter increases throughput of the benchmark by about 30% on a 16-socket, > 1000 thread system. [rdunlap@infradead.org: fix build errors in kernel/user.c when CONFIG_EPOLL=n] [npiggin@gmail.com: move ifdefs into wrapper functions, slightly improve panic message] Link: https://lkml.kernel.org/r/1628051945.fens3r99ox.astroid@bobo.none [akpm@linux-foundation.org: tweak user_epoll_alloc(), per Guenter] Link: https://lkml.kernel.org/r/20210804191421.GA1900577@roeck-us.net Link: https://lkml.kernel.org/r/20210802032013.2751916-1-npiggin@gmail.com Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reported-by: Anton Blanchard <anton@ozlabs.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08connector: send event on write to /proc/[pid]/commOhhoon Kwon
While comm change event via prctl has been reported to proc connector by 'commit f786ecba4158 ("connector: add comm change event report to proc connector")', connector listeners were missing comm changes by explicit writes on /proc/[pid]/comm. Let explicit writes on /proc/[pid]/comm report to proc connector. Link: https://lkml.kernel.org/r/20210701133458epcms1p68e9eb9bd0eee8903ba26679a37d9d960@epcms1p6 Signed-off-by: Ohhoon Kwon <ohoono.kwon@samsung.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: David S. Miller <davem@davemloft.net> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08proc: stop using seq_get_buf in proc_task_nameChristoph Hellwig
Use seq_escape_str and seq_printf instead of poking holes into the seq_file abstraction. Link: https://lkml.kernel.org/r/20210810151945.1795567-1-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08hugetlbfs: s390 is always 64bitDavid Hildenbrand
No need to check for 64BIT. While at it, let's just select ARCH_SUPPORTS_HUGETLBFS from arch/s390/Kconfig. Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20210908154506.20764-1-david@redhat.com Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2021-09-08io-wq: fix cancellation on create-worker failurePavel Begunkov
WARNING: CPU: 0 PID: 10392 at fs/io_uring.c:1151 req_ref_put_and_test fs/io_uring.c:1151 [inline] WARNING: CPU: 0 PID: 10392 at fs/io_uring.c:1151 req_ref_put_and_test fs/io_uring.c:1146 [inline] WARNING: CPU: 0 PID: 10392 at fs/io_uring.c:1151 io_req_complete_post+0xf5b/0x1190 fs/io_uring.c:1794 Modules linked in: Call Trace: tctx_task_work+0x1e5/0x570 fs/io_uring.c:2158 task_work_run+0xe0/0x1a0 kernel/task_work.c:164 tracehook_notify_signal include/linux/tracehook.h:212 [inline] handle_signal_work kernel/entry/common.c:146 [inline] exit_to_user_mode_loop kernel/entry/common.c:172 [inline] exit_to_user_mode_prepare+0x232/0x2a0 kernel/entry/common.c:209 __syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline] syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:302 do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x44/0xae When io_wqe_enqueue() -> io_wqe_create_worker() fails, we can't just call io_run_cancel() to clean up the request, it's already enqueued via io_wqe_insert_work() and will be executed either by some other worker during cancellation (e.g. in io_wq_put_and_exit()). Reported-by: Hao Sun <sunhao.th@gmail.com> Fixes: 3146cba99aa28 ("io-wq: make worker creation resilient against signals") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/93b9de0fcf657affab0acfd675d4abcd273ee863.1631092071.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-07putname(): IS_ERR_OR_NULL() is wrong hereAl Viro
Mixing NULL and ERR_PTR() just in case is a Bad Idea(tm). For struct filename the former is wrong - failures are reported as ERR_PTR(...), not as NULL. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-09-07namei: Standardize callers of filename_create()Stephen Brennan
filename_create() has two variants, one which drops the caller's reference to filename (filename_create) and one which does not (__filename_create). This can be confusing as it's unusual to drop a caller's reference. Remove filename_create, rename __filename_create to filename_create, and convert all callers. Link: https://lore.kernel.org/linux-fsdevel/f6238254-35bd-7e97-5b27-21050c745874@oracle.com/ Cc: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-09-07namei: Standardize callers of filename_lookup()Stephen Brennan
filename_lookup() has two variants, one which drops the caller's reference to filename (filename_lookup), and one which does not (__filename_lookup). This can be confusing as it's unusual to drop a caller's reference. Remove filename_lookup, rename __filename_lookup to filename_lookup, and convert all callers. The cost is a few slightly longer functions, but the clarity is greater. [AV: consuming a reference is not at all unusual, actually; look at e.g. do_mkdirat(), for example. It's more that we want non-consuming variant for close relative of that function...] Link: https://lore.kernel.org/linux-fsdevel/YS+dstZ3xfcLxhoB@zeniv-ca.linux.org.uk/ Cc: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-09-07rename __filename_parentat() to filename_parentat()Al Viro
... in separate commit, to avoid noise in previous one Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-09-07namei: Fix use after free in kern_path_lockedStephen Brennan
In 0ee50b47532a ("namei: change filename_parentat() calling conventions"), filename_parentat() was made to always call putname() on the filename before returning, and kern_path_locked() was migrated to this calling convention. However, kern_path_locked() uses the "last" parameter to lookup and potentially create a new dentry. The last parameter contains the last component of the path and points within the filename, which was recently freed at the end of filename_parentat(). Thus, when kern_path_locked() calls __lookup_hash(), it is using the filename after it has already been freed. In other words, these calling conventions had been wrong for the only remaining caller of filename_parentat(). Everything else is using __filename_parentat(), which does not drop the reference; so should kern_path_locked(). Switch kern_path_locked() to use of __filename_parentat() and move getting/dropping struct filename into wrapper. Remove filename_parentat(), now that we have no remaining callers. Fixes: 0ee50b47532a ("namei: change filename_parentat() calling conventions") Link: https://lore.kernel.org/linux-fsdevel/YS9D4AlEsaCxLFV0@infradead.org/ Link: https://lore.kernel.org/linux-fsdevel/YS+csMTV2tTXKg3s@zeniv-ca.linux.org.uk/ Cc: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Reported-by: syzbot+fb0d60a179096e8c2731@syzkaller.appspotmail.com Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com> Co-authored-by: Dmitry Kadashev <dkadashev@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-09-07Merge tag 'fuse-update-5.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: - Allow mounting an active fuse device. Previously the fuse device would always be mounted during initialization, and sharing a fuse superblock was only possible through mount or namespace cloning - Fix data flushing in syncfs (virtiofs only) - Fix data flushing in copy_file_range() - Fix a possible deadlock in atomic O_TRUNC - Misc fixes and cleanups * tag 'fuse-update-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: remove unused arg in fuse_write_file_get() fuse: wait for writepages in syncfs fuse: flush extending writes fuse: truncate pagecache on atomic_o_trunc fuse: allow sharing existing sb fuse: move fget() to fuse_get_tree() fuse: move option checking into fuse_fill_super() fuse: name fs_context consistently fuse: fix use after free in fuse_read_interrupt()
2021-09-07Revert "memcg: enable accounting for pollfd and select bits arrays"Linus Torvalds
This reverts commit b655843444152c0a14b749308e4cb35d91cbcf0b. Just like with the memcg lock accounting, the kernel test robot reports a sizeable performance regression for this commit, and while it clearly does the rigth thing in theory, we'll need to look at just how to avoid or minimize the performance overhead of the memcg accounting. People already have suggestions on how to do that, but it's "future work". So revert it for now. [ Note: the first link below is for this same commit but a different commit ID, because it's the kernel test robot ended up noticing it in Andrew Morton's patch queue ] Link: https://lore.kernel.org/lkml/20210905132732.GC15026@xsang-OptiPlex-9020/ Link: https://lore.kernel.org/lkml/20210907150757.GE17617@xsang-OptiPlex-9020/ Acked-by: Jens Axboe <axboe@kernel.dk> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-07Revert "memcg: enable accounting for file lock caches"Linus Torvalds
This reverts commit 0f12156dff2862ac54235fc72703f18770769042. The kernel test robot reports a sizeable performance regression for this commit, and while it clearly does the rigth thing in theory, we'll need to look at just how to avoid or minimize the performance overhead of the memcg accounting. People already have suggestions on how to do that, but it's "future work". So revert it for now. Link: https://lore.kernel.org/lkml/20210907150757.GE17617@xsang-OptiPlex-9020/ Acked-by: Jens Axboe <axboe@kernel.dk> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-07Revert "mm/gup: remove try_get_page(), call try_get_compound_head() directly"Linus Torvalds
This reverts commit 9857a17f206ff374aea78bccfb687f145368be2e. That commit was completely broken, and I should have caught on to it earlier. But happily, the kernel test robot noticed the breakage fairly quickly. The breakage is because "try_get_page()" is about avoiding the page reference count overflow case, but is otherwise the exact same as a plain "get_page()". In contrast, "try_get_compound_head()" is an entirely different beast, and uses __page_cache_add_speculative() because it's not just about the page reference count, but also about possibly racing with the underlying page going away. So all the commentary about how "try_get_page() has fallen a little behind in terms of maintenance, try_get_compound_head() handles speculative page references more thoroughly" was just completely wrong: yes, try_get_compound_head() handles speculative page references, but the point is that try_get_page() does not, and must not. So there's no lack of maintainance - there are fundamentally different semantics. A speculative page reference would be entirely wrong in "get_page()", and it's entirely wrong in "try_get_page()". It's not about speculation, it's purely about "uhhuh, you can't get this page because you've tried to increment the reference count too much already". The reason the kernel test robot noticed this bug was that it hit the VM_BUG_ON() in __page_cache_add_speculative(), which is all about verifying that the context of any speculative page access is correct. But since that isn't what try_get_page() is all about, the VM_BUG_ON() tests things that are not correct to test for try_get_page(). Reported-by: kernel test robot <oliver.sang@intel.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-07block: move fs/block_dev.c to block/bdev.cChristoph Hellwig
Move it together with the rest of the block layer. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210907141303.1371844-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-07block: split out operations on block special filesChristoph Hellwig
Add a new block/fops.c for all the file and address_space operations that provide the block special file support. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210907141303.1371844-2-hch@lst.de [axboe: correct trailing whitespace while at it] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-07btrfs: zoned: fix double counting of split ordered extentNaohiro Aota
btrfs_add_ordered_extent_*() add num_bytes to fs_info->ordered_bytes. Then, splitting an ordered extent will call btrfs_add_ordered_extent_*() again for split extents, leading to double counting of the region of a split extent. These leaked bytes are finally reported at unmount time as follow: BTRFS info (device dm-1): at unmount dio bytes count 364544 Fix the double counting by subtracting split extent's size from fs_info->ordered_bytes. Fixes: d22002fd37bd ("btrfs: zoned: split ordered extent when bio is sent") CC: stable@vger.kernel.org # 5.12+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-07btrfs: fix lockdep warning while mounting sprout fsAnand Jain
Following test case reproduces lockdep warning. Test case: $ mkfs.btrfs -f <dev1> $ btrfstune -S 1 <dev1> $ mount <dev1> <mnt> $ btrfs device add <dev2> <mnt> -f $ umount <mnt> $ mount <dev2> <mnt> $ umount <mnt> The warning claims a possible ABBA deadlock between the threads initiated by [#1] btrfs device add and [#0] the mount. [ 540.743122] WARNING: possible circular locking dependency detected [ 540.743129] 5.11.0-rc7+ #5 Not tainted [ 540.743135] ------------------------------------------------------ [ 540.743142] mount/2515 is trying to acquire lock: [ 540.743149] ffffa0c5544c2ce0 (&fs_devs->device_list_mutex){+.+.}-{4:4}, at: clone_fs_devices+0x6d/0x210 [btrfs] [ 540.743458] but task is already holding lock: [ 540.743461] ffffa0c54a7932b8 (btrfs-chunk-00){++++}-{4:4}, at: __btrfs_tree_read_lock+0x32/0x200 [btrfs] [ 540.743541] which lock already depends on the new lock. [ 540.743543] the existing dependency chain (in reverse order) is: [ 540.743546] -> #1 (btrfs-chunk-00){++++}-{4:4}: [ 540.743566] down_read_nested+0x48/0x2b0 [ 540.743585] __btrfs_tree_read_lock+0x32/0x200 [btrfs] [ 540.743650] btrfs_read_lock_root_node+0x70/0x200 [btrfs] [ 540.743733] btrfs_search_slot+0x6c6/0xe00 [btrfs] [ 540.743785] btrfs_update_device+0x83/0x260 [btrfs] [ 540.743849] btrfs_finish_chunk_alloc+0x13f/0x660 [btrfs] <--- device_list_mutex [ 540.743911] btrfs_create_pending_block_groups+0x18d/0x3f0 [btrfs] [ 540.743982] btrfs_commit_transaction+0x86/0x1260 [btrfs] [ 540.744037] btrfs_init_new_device+0x1600/0x1dd0 [btrfs] [ 540.744101] btrfs_ioctl+0x1c77/0x24c0 [btrfs] [ 540.744166] __x64_sys_ioctl+0xe4/0x140 [ 540.744170] do_syscall_64+0x4b/0x80 [ 540.744174] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 540.744180] -> #0 (&fs_devs->device_list_mutex){+.+.}-{4:4}: [ 540.744184] __lock_acquire+0x155f/0x2360 [ 540.744188] lock_acquire+0x10b/0x5c0 [ 540.744190] __mutex_lock+0xb1/0xf80 [ 540.744193] mutex_lock_nested+0x27/0x30 [ 540.744196] clone_fs_devices+0x6d/0x210 [btrfs] [ 540.744270] btrfs_read_chunk_tree+0x3c7/0xbb0 [btrfs] [ 540.744336] open_ctree+0xf6e/0x2074 [btrfs] [ 540.744406] btrfs_mount_root.cold.72+0x16/0x127 [btrfs] [ 540.744472] legacy_get_tree+0x38/0x90 [ 540.744475] vfs_get_tree+0x30/0x140 [ 540.744478] fc_mount+0x16/0x60 [ 540.744482] vfs_kern_mount+0x91/0x100 [ 540.744484] btrfs_mount+0x1e6/0x670 [btrfs] [ 540.744536] legacy_get_tree+0x38/0x90 [ 540.744537] vfs_get_tree+0x30/0x140 [ 540.744539] path_mount+0x8d8/0x1070 [ 540.744541] do_mount+0x8d/0xc0 [ 540.744543] __x64_sys_mount+0x125/0x160 [ 540.744545] do_syscall_64+0x4b/0x80 [ 540.744547] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 540.744551] other info that might help us debug this: [ 540.744552] Possible unsafe locking scenario: [ 540.744553] CPU0 CPU1 [ 540.744554] ---- ---- [ 540.744555] lock(btrfs-chunk-00); [ 540.744557] lock(&fs_devs->device_list_mutex); [ 540.744560] lock(btrfs-chunk-00); [ 540.744562] lock(&fs_devs->device_list_mutex); [ 540.744564] *** DEADLOCK *** [ 540.744565] 3 locks held by mount/2515: [ 540.744567] #0: ffffa0c56bf7a0e0 (&type->s_umount_key#42/1){+.+.}-{4:4}, at: alloc_super.isra.16+0xdf/0x450 [ 540.744574] #1: ffffffffc05a9628 (uuid_mutex){+.+.}-{4:4}, at: btrfs_read_chunk_tree+0x63/0xbb0 [btrfs] [ 540.744640] #2: ffffa0c54a7932b8 (btrfs-chunk-00){++++}-{4:4}, at: __btrfs_tree_read_lock+0x32/0x200 [btrfs] [ 540.744708] stack backtrace: [ 540.744712] CPU: 2 PID: 2515 Comm: mount Not tainted 5.11.0-rc7+ #5 But the device_list_mutex in clone_fs_devices() is redundant, as explained below. Two threads [1] and [2] (below) could lead to clone_fs_device(). [1] open_ctree <== mount sprout fs btrfs_read_chunk_tree() mutex_lock(&uuid_mutex) <== global lock read_one_dev() open_seed_devices() clone_fs_devices() <== seed fs_devices mutex_lock(&orig->device_list_mutex) <== seed fs_devices [2] btrfs_init_new_device() <== sprouting mutex_lock(&uuid_mutex); <== global lock btrfs_prepare_sprout() lockdep_assert_held(&uuid_mutex) clone_fs_devices(seed_fs_device) <== seed fs_devices Both of these threads hold uuid_mutex which is sufficient to protect getting the seed device(s) freed while we are trying to clone it for sprouting [2] or mounting a sprout [1] (as above). A mounted seed device can not free/write/replace because it is read-only. An unmounted seed device can be freed by btrfs_free_stale_devices(), but it needs uuid_mutex. So this patch removes the unnecessary device_list_mutex in clone_fs_devices(). And adds a lockdep_assert_held(&uuid_mutex) in clone_fs_devices(). Reported-by: Su Yue <l@damenly.su> Tested-by: Su Yue <l@damenly.su> Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>