linux - Torvalds' mirror for benchmark reasons

Age	Commit message (Collapse)	Author
2020-05-07	exec: Rename flush_old_exec begin_new_exec	Eric W. Biederman
	There is and has been for a very long time been a lot more going on in flush_old_exec than just flushing the old state. After the movement of code from setup_new_exec there is a whole lot more going on than just flushing the old executables state. Rename flush_old_exec to begin_new_exec to more accurately reflect what this function does. Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Greg Ungerer <gerg@linux-m68k.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-05-07	exec: Move most of setup_new_exec into flush_old_exec	Eric W. Biederman
	The current idiom for the callers is: flush_old_exec(bprm); set_personality(...); setup_new_exec(bprm); In 2010 Linus split flush_old_exec into flush_old_exec and setup_new_exec. With the intention that setup_new_exec be what is called after the processes new personality is set. Move the code that doesn't depend upon the personality from setup_new_exec into flush_old_exec. This is to facilitate future changes by having as much code together in one function as possible. To see why it is safe to move this code please note that effectively this change moves the personality setting in the binfmt and the following three lines of code after everything except unlocking the mutexes: arch_pick_mmap_layout arch_setup_new_exec mm->task_size = TASK_SIZE The function arch_pick_mmap_layout at most sets: mm->get_unmapped_area mm->mmap_base mm->mmap_legacy_base mm->mmap_compat_base mm->mmap_compat_legacy_base which nothing in flush_old_exec or setup_new_exec depends on. The function arch_setup_new_exec only sets architecture specific state and the rest of the functions only deal in state that applies to all architectures. The last line just sets mm->task_size and again nothing in flush_old_exec or setup_new_exec depend on task_size. Ref: 221af7f87b97 ("Split 'flush_old_exec' into two functions") Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Greg Ungerer <gerg@linux-m68k.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-05-07	exec: In setup_new_exec cache current in the local variable me	Eric W. Biederman
	At least gcc 8.3 when generating code for x86_64 has a hard time consolidating multiple calls to current aka get_current(), and winds up unnecessarily rereading %gs:current_task several times in setup_new_exec. Caching the value of current in the local variable of me generates slightly better and shorter assembly. Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Greg Ungerer <gerg@linux-m68k.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-05-07	exec: Merge install_exec_creds into setup_new_exec	Eric W. Biederman
	The two functions are now always called one right after the other so merge them together to make future maintenance easier. Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Greg Ungerer <gerg@linux-m68k.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-05-07	exec: Rename the flag called_exec_mmap point_of_no_return	Eric W. Biederman
	Update the comments and make the code easier to understand by renaming this flag. Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Greg Ungerer <gerg@linux-m68k.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-05-07	exec: Make unlocking exec_update_mutex explict	Eric W. Biederman
	With install_exec_creds updated to follow immediately after setup_new_exec, the failure of unshare_sighand is the only code path where exec_update_mutex is held but not explicitly unlocked. Update that code path to explicitly unlock exec_update_mutex. Remove the unlocking of exec_update_mutex from free_bprm. Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Greg Ungerer <gerg@linux-m68k.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-05-07	binfmt: Move install_exec_creds after setup_new_exec to match binfmt_elf	Eric W. Biederman
	In 2016 Linus moved install_exec_creds immediately after setup_new_exec, in binfmt_elf as a cleanup and as part of closing a potential information leak. Perform the same cleanup for the other binary formats. Different binary formats doing the same things the same way makes exec easier to reason about and easier to maintain. Greg Ungerer reports: > I tested the the whole series on non-MMU m68k and non-MMU arm > (exercising binfmt_flat) and it all tested out with no problems, > so for the binfmt_flat changes: Tested-by: Greg Ungerer <gerg@linux-m68k.org> Ref: 9f834ec18def ("binfmt_elf: switch to new creds when switching to new mm") Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Greg Ungerer <gerg@linux-m68k.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-04-26	Merge tag '5.7-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6	Linus Torvalds
	Pull cifs fixes from Steve French: "Five cifs/smb3 fixes:two for DFS reconnect failover, one lease fix for stable and the others to fix a missing spinlock during reconnect" * tag '5.7-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: fix uninitialised lease_key in open_shroot() cifs: ensure correct super block for DFS reconnect cifs: do not share tcons with DFS cifs: minor update to comments around the cifs_tcp_ses_lock mutex cifs: protect updating server->dstaddr with a spinlock
2020-04-26	Merge tag 'driver-core-5.7-rc3' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core fixes from Greg KH: "Here are some small firmware/driver core/debugfs fixes for 5.7-rc3. The debugfs change is now possible as now the last users of debugfs_create_u32() have been fixed up in the different trees that got merged into 5.7-rc1, and I don't want it creeping back in. The firmware changes did cause a regression in linux-next, so the final patch here reverts part of that, re-exporting the symbol to resolve that issue. All of these patches, with the exception of the final one, have been in linux-next with only that one reported issue" * tag 'driver-core-5.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: firmware_loader: revert removal of the fw_fallback_config export debugfs: remove return value of debugfs_create_u32() firmware_loader: remove unused exports firmware: imx: fix compile-testing
2020-04-25	Merge branch 'for-linus' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull pid leak fix from Eric Biederman: "Oleg noticed that put_pid(thread_pid) was not getting called when proc was not compiled in. Let's get that fixed before 5.7 is released and causes problems for anyone" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: proc: Put thread_pid in release_task not proc_flush_pid
2020-04-24	proc: Put thread_pid in release_task not proc_flush_pid	Eric W. Biederman
	Oleg pointed out that in the unlikely event the kernel is compiled with CONFIG_PROC_FS unset that release_task will now leak the pid. Move the put_pid out of proc_flush_pid into release_task to fix this and to guarantee I don't make that mistake again. When possible it makes sense to keep get and put in the same function so it can easily been seen how they pair up. Fixes: 7bc3e6e55acf ("proc: Use a list of inodes to flush from proc") Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-04-24	Merge tag 'io_uring-5.7-2020-04-24' of git://git.kernel.dk/linux-block	Linus Torvalds
	Pull io_uring fix from Jens Axboe: "Single fixup for a change that went into -rc2" * tag 'io_uring-5.7-2020-04-24' of git://git.kernel.dk/linux-block: io_uring: only restore req->work for req that needs do completion
2020-04-24	Merge tag 'block-5.7-2020-04-24' of git://git.kernel.dk/linux-block	Linus Torvalds
	Pull block fixes from Jens Axboe: "A few fixes/changes that should go into this release: - null_blk zoned fixes (Damien) - blkdev_close() sync improvement (Douglas) - Fix regression in blk-iocost that impacted (at least) systemtap (Waiman) - Comment fix, header removal (Zhiqiang, Jianpeng)" * tag 'block-5.7-2020-04-24' of git://git.kernel.dk/linux-block: null_blk: Cleanup zoned device initialization null_blk: Fix zoned command handling block: remove unused header blk-iocost: Fix error on iocost_ioc_vrate_adj bdev: Reduce time holding bd_mutex in sync in blkdev_close() buffer: remove useless comment and WB_REASON_FREE_MORE_MEM, reason.
2020-04-24	Merge tag 'afs-fixes-20200424' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull misc AFS fixes from David Howells: "Three miscellaneous fixes to the afs filesystem: - Remove some struct members that aren't used, aren't set or aren't read, plus a wake up that nothing ever waits for. - Actually set the AFS_SERVER_FL_HAVE_EPOCH flag so that the code that depends on it can work. - Make a couple of waits uninterruptible if they're done for an operation that isn't supposed to be interruptible" * tag 'afs-fixes-20200424' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: afs: Make record checking use TASK_UNINTERRUPTIBLE when appropriate afs: Fix to actually set AFS_SERVER_FL_HAVE_EPOCH afs: Remove some unused bits
2020-04-24	afs: Make record checking use TASK_UNINTERRUPTIBLE when appropriate	David Howells
	When an operation is meant to be done uninterruptibly (such as FS.StoreData), we should not be allowing volume and server record checking to be interrupted. Fixes: d2ddc776a458 ("afs: Overhaul volume and server record caching and fileserver rotation") Signed-off-by: David Howells <dhowells@redhat.com>
2020-04-24	afs: Fix to actually set AFS_SERVER_FL_HAVE_EPOCH	David Howells
	AFS keeps track of the epoch value from the rxrpc protocol to note (a) when a fileserver appears to have restarted and (b) when different endpoints of a fileserver do not appear to be associated with the same fileserver (ie. all probes back from a fileserver from all of its interfaces should carry the same epoch). However, the AFS_SERVER_FL_HAVE_EPOCH flag that indicates that we've received the server's epoch is never set, though it is used. Fix this to set the flag when we first receive an epoch value from a probe sent to the filesystem client from the fileserver. Fixes: 3bf0fb6f33dd ("afs: Probe multiple fileservers simultaneously") Signed-off-by: David Howells <dhowells@redhat.com>
2020-04-24	afs: Remove some unused bits	David Howells
	Remove three bits: (1) afs_server::no_epoch is neither set nor used. (2) afs_server::have_result is set and a wakeup is applied to it, but nothing looks at it or waits on it. (3) afs_vl_dump_edestaddrreq() prints afs_addr_list::probed, but nothing sets it for VL servers. Signed-off-by: David Howells <dhowells@redhat.com>
2020-04-23	Merge tag 'nfsd-5.7-rc-1' of git://git.linux-nfs.org/projects/cel/cel-2.6	Linus Torvalds
	Pull nfsd fixes from Chuck Lever: "The first set of 5.7-rc fixes for NFS server issues. These were all unresolved at the time the 5.7 window opened, and needed some additional time to ensure they were correctly addressed. They are ready now. At the moment I know of one more urgent issue regarding the NFS server. A fix has been tested and is under review. I expect to send one more pull request, containing this fix (which now consists of 3 patches). Fixes: - Address several use-after-free and memory leak bugs - Prevent a backchannel livelock" * tag 'nfsd-5.7-rc-1' of git://git.linux-nfs.org/projects/cel/cel-2.6: svcrdma: Fix leak of svc_rdma_recv_ctxt objects svcrdma: Fix trace point use-after-free race SUNRPC: Fix backchannel RPC soft lockups SUNRPC/cache: Fix unsafe traverse caused double-free in cache_purge nfsd: memory corruption in nfsd4_lock()
2020-04-23	Merge tag 'for-5.7-rc3' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat Pull exfat fixes from Namjae Jeon: - several bug fixes(broken mount discard option, remount failure, memory leak) - add missing MODULE_ALIAS_FS for automatically loading exfat module. - set s_time_gran and truncate atime with exfat timestamp granularity. * tag 'for-5.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat: exfat: truncate atimes to 2s granularity exfat: properly set s_time_gran exfat: remove 'bps' mount-option exfat: Unify access to the boot sector exfat: add missing MODULE_ALIAS_FS() exfat: Fix discard support
2020-04-22	cifs: fix uninitialised lease_key in open_shroot()	Paulo Alcantara
	SMB2_open_init() expects a pre-initialised lease_key when opening a file with a lease, so set pfid->lease_key prior to calling it in open_shroot(). This issue was observed when performing some DFS failover tests and the lease key was never randomly generated. Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com> CC: Stable <stable@vger.kernel.org>
2020-04-22	cifs: ensure correct super block for DFS reconnect	Paulo Alcantara
	This patch is basically fixing the lookup of tcons (DFS specific) during reconnect (smb2pdu.c:__smb2_reconnect) to update their prefix paths. Previously, we relied on the TCP_Server_Info pointer (misc.c:tcp_super_cb) to determine which tcon to update the prefix path We could not rely on TCP server pointer to determine which super block to update the prefix path when reconnecting tcons since it might map to different tcons that share same TCP connection. Instead, walk through all cifs super blocks and compare their DFS full paths with the tcon being updated to. Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2020-04-22	cifs: do not share tcons with DFS	Paulo Alcantara
	This disables tcon re-use for DFS shares. tcon->dfs_path stores the path that the tcon should connect to when doing failing over. If that tcon is used multiple times e.g. 2 mounts using it with different prefixpath, each will need a different dfs_path but there is only one tcon. The other solution would be to split the tcon in 2 tcons during failover but that is much harder. tcons could not be shared with DFS in cifs.ko because in a DFS namespace like: //domain/dfsroot -> /serverA/dfsroot, /serverB/dfsroot //serverA/dfsroot/link -> /serverA/target1/aa/bb //serverA/dfsroot/link2 -> /serverA/target1/cc/dd you can see that link and link2 are two DFS links that both resolve to the same target share (/serverA/target1), so cifs.ko will only contain a single tcon for both link and link2. The problem with that is, if we (auto)mount "link" and "link2", cifs.ko will only contain a single tcon for both DFS links so we couldn't perform failover or refresh the DFS cache for both links because tcon->dfs_path was set to either "link" or "link2", but not both -- which is wrong. Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2020-04-22	exfat: truncate atimes to 2s granularity	Eric Sandeen
	The timestamp for access_time has double seconds granularity(There is no 10msIncrement field for access_time unlike create/modify_time). exfat's atimes are restricted to only 2s granularity so after we set an atime, round it down to the nearest 2s and set the sub-second component of the timestamp to 0. Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
2020-04-22	exfat: properly set s_time_gran	Eric Sandeen
	The s_time_gran superblock field indicates the on-disk nanosecond granularity of timestamps, and for exfat that seems to be 10ms, so set s_time_gran to 10000000ns. Without this, in-memory timestamps change when they get re-read from disk. Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
2020-04-22	exfat: remove 'bps' mount-option	Tetsuhiro Kohada
	remount fails because exfat_show_options() returns unsupported option 'bps'. > # mount -o ro,remount > exfat: Unknown parameter 'bps' To fix the problem, just remove 'bps' option from exfat_show_options(). Signed-off-by: Tetsuhiro Kohada <Kohada.Tetsuhiro@dc.MitsubishiElectric.co.jp> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
2020-04-22	exfat: Unify access to the boot sector	Tetsuhiro Kohada
	Unify access to boot sector via 'sbi->pbr_bh'. This fixes vol_flags inconsistency at read failed in fs_set_vol_flags(), and buffer_head leak in __exfat_fill_super(). Signed-off-by: Tetsuhiro Kohada <Kohada.Tetsuhiro@dc.MitsubishiElectric.co.jp> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
2020-04-22	exfat: add missing MODULE_ALIAS_FS()	Thomas Backlund
	This adds the necessary MODULE_ALIAS_FS() to exfat so the module gets automatically loaded when an exfat filesystem is mounted. Signed-off-by: Thomas Backlund <tmb@mageia.org> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
2020-04-22	exfat: Fix discard support	Pali Rohár
	Discard support was always unconditionally disabled. Now it is disabled only in the case when blk_queue_discard() returns false. Signed-off-by: Pali Rohár <pali@kernel.org> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
2020-04-21	cifs: minor update to comments around the cifs_tcp_ses_lock mutex	Steve French
	Update comment to note that it protects server->dstaddr Signed-off-by: Steve French <stfrench@microsoft.com>
2020-04-21	coredump: fix null pointer dereference on coredump	Sudip Mukherjee
	If the core_pattern is set to "\|" and any process segfaults then we get a null pointer derefernce while trying to coredump. The call stack shows: RIP: do_coredump+0x628/0x11c0 When the core_pattern has only "\|" there is no use of trying the coredump and we can check that while formating the corename and exit with an error. After this change I get: format_corename failed Aborting core Fixes: 315c69261dd3 ("coredump: split pipe command whitespace before expanding template") Reported-by: Matthew Ruffell <matthew.ruffell@canonical.com> Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Paul Wise <pabs3@bonedaddy.net> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200416194612.21418-1-sudipm.mukherjee@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-21	vmalloc: fix remap_vmalloc_range() bounds checks	Jann Horn
	remap_vmalloc_range() has had various issues with the bounds checks it promises to perform ("This function checks that addr is a valid vmalloc'ed area, and that it is big enough to cover the vma") over time, e.g.: - not detecting pgoff<<PAGE_SHIFT overflow - not detecting (pgoff<<PAGE_SHIFT)+usize overflow - not checking whether addr and addr+(pgoff<<PAGE_SHIFT) are the same vmalloc allocation - comparing a potentially wildly out-of-bounds pointer with the end of the vmalloc region In particular, since commit fc9702273e2e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY"), unprivileged users can cause kernel null pointer dereferences by calling mmap() on a BPF map with a size that is bigger than the distance from the start of the BPF map to the end of the address space. This could theoretically be used as a kernel ASLR bypass, by using whether mmap() with a given offset oopses or returns an error code to perform a binary search over the possible address range. To allow remap_vmalloc_range_partial() to verify that addr and addr+(pgoff<<PAGE_SHIFT) are in the same vmalloc region, pass the offset to remap_vmalloc_range_partial() instead of adding it to the pointer in remap_vmalloc_range(). In remap_vmalloc_range_partial(), fix the check against get_vm_area_size() by using size comparisons instead of pointer comparisons, and add checks for pgoff. Fixes: 833423143c3a ("[PATCH] mm: introduce remap_vmalloc_range()") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: stable@vger.kernel.org Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Martin KaFai Lau <kafai@fb.com> Cc: Song Liu <songliubraving@fb.com> Cc: Yonghong Song <yhs@fb.com> Cc: Andrii Nakryiko <andriin@fb.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: KP Singh <kpsingh@chromium.org> Link: http://lkml.kernel.org/r/20200415222312.236431-1-jannh@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-21	block: remove unused header	Ma, Jianpeng
	Dax related code already removed from this file. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-21	cifs: protect updating server->dstaddr with a spinlock	Ronnie Sahlberg
	We use a spinlock while we are reading and accessing the destination address for a server. We need to also use this spinlock to protect when we are modifying this address from reconn_set_ipaddr(). Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2020-04-20	bdev: Reduce time holding bd_mutex in sync in blkdev_close()	Douglas Anderson
	While trying to "dd" to the block device for a USB stick, I encountered a hung task warning (blocked for > 120 seconds). I managed to come up with an easy way to reproduce this on my system (where /dev/sdb is the block device for my USB stick) with: while true; do dd if=/dev/zero of=/dev/sdb bs=4M; done With my reproduction here are the relevant bits from the hung task detector: INFO: task udevd:294 blocked for more than 122 seconds. ... udevd D 0 294 1 0x00400008 Call trace: ... mutex_lock_nested+0x40/0x50 __blkdev_get+0x7c/0x3d4 blkdev_get+0x118/0x138 blkdev_open+0x94/0xa8 do_dentry_open+0x268/0x3a0 vfs_open+0x34/0x40 path_openat+0x39c/0xdf4 do_filp_open+0x90/0x10c do_sys_open+0x150/0x3c8 ... ... Showing all locks held in the system: ... 1 lock held by dd/2798: #0: ffffff814ac1a3b8 (&bdev->bd_mutex){+.+.}, at: __blkdev_put+0x50/0x204 ... dd D 0 2798 2764 0x00400208 Call trace: ... schedule+0x8c/0xbc io_schedule+0x1c/0x40 wait_on_page_bit_common+0x238/0x338 __lock_page+0x5c/0x68 write_cache_pages+0x194/0x500 generic_writepages+0x64/0xa4 blkdev_writepages+0x24/0x30 do_writepages+0x48/0xa8 __filemap_fdatawrite_range+0xac/0xd8 filemap_write_and_wait+0x30/0x84 __blkdev_put+0x88/0x204 blkdev_put+0xc4/0xe4 blkdev_close+0x28/0x38 __fput+0xe0/0x238 ____fput+0x1c/0x28 task_work_run+0xb0/0xe4 do_notify_resume+0xfc0/0x14bc work_pending+0x8/0x14 The problem appears related to the fact that my USB disk is terribly slow and that I have a lot of RAM in my system to cache things. Specifically my writes seem to be happening at ~15 MB/s and I've got ~4 GB of RAM in my system that can be used for buffering. To write 4 GB of buffer to disk thus takes ~4000 MB / ~15 MB/s = ~267 seconds. The 267 second number is a problem because in __blkdev_put() we call sync_blockdev() while holding the bd_mutex. Any other callers who want the bd_mutex will be blocked for the whole time. The problem is made worse because I believe blkdev_put() specifically tells other tasks (namely udev) to go try to access the device at right around the same time we're going to hold the mutex for a long time. Putting some traces around this (after disabling the hung task detector), I could confirm: dd: 437.608600: __blkdev_put() right before sync_blockdev() for sdb udevd: 437.623901: blkdev_open() right before blkdev_get() for sdb dd: 661.468451: __blkdev_put() right after sync_blockdev() for sdb udevd: 663.820426: blkdev_open() right after blkdev_get() for sdb A simple fix for this is to realize that sync_blockdev() works fine if you're not holding the mutex. Also, it's not the end of the world if you sync a little early (though it can have performance impacts). Thus we can make a guess that we're going to need to do the sync and then do it without holding the mutex. We still do one last sync with the mutex but it should be much, much faster. With this, my hung task warnings for my test case are gone. Signed-off-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Guenter Roeck <groeck@chromium.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-19	io_uring: only restore req->work for req that needs do completion	Xiaoguang Wang
	When testing io_uring IORING_FEAT_FAST_POLL feature, I got below panic: BUG: kernel NULL pointer dereference, address: 0000000000000030 PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI CPU: 5 PID: 2154 Comm: io_uring_echo_s Not tainted 5.6.0+ #359 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.1-0-g0551a4be2c-prebuilt.qemu-project.org 04/01/2014 RIP: 0010:io_wq_submit_work+0xf/0xa0 Code: ff ff ff be 02 00 00 00 e8 ae c9 19 00 e9 58 ff ff ff 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 54 49 89 fc 55 53 48 8b 2f <8b> 45 30 48 8d 9d 48 ff ff ff 25 01 01 00 00 83 f8 01 75 07 eb 2a RSP: 0018:ffffbef543e93d58 EFLAGS: 00010286 RAX: ffffffff84364f50 RBX: ffffa3eb50f046b8 RCX: 0000000000000000 RDX: ffffa3eb0efc1840 RSI: 0000000000000006 RDI: ffffa3eb50f046b8 RBP: 0000000000000000 R08: 00000000fffd070d R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffffa3eb50f046b8 R13: ffffa3eb0efc2088 R14: ffffffff85b69be0 R15: ffffa3eb0effa4b8 FS: 00007fe9f69cc4c0(0000) GS:ffffa3eb5ef40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000030 CR3: 0000000020410000 CR4: 00000000000006e0 Call Trace: task_work_run+0x6d/0xa0 do_exit+0x39a/0xb80 ? get_signal+0xfe/0xbc0 do_group_exit+0x47/0xb0 get_signal+0x14b/0xbc0 ? __x64_sys_io_uring_enter+0x1b7/0x450 do_signal+0x2c/0x260 ? __x64_sys_io_uring_enter+0x228/0x450 exit_to_usermode_loop+0x87/0xf0 do_syscall_64+0x209/0x230 entry_SYSCALL_64_after_hwframe+0x49/0xb3 RIP: 0033:0x7fe9f64f8df9 Code: Bad RIP value. task_work_run calls io_wq_submit_work unexpectedly, it's obvious that struct callback_head's func member has been changed. After looking into codes, I found this issue is still due to the union definition: union { /* * Only commands that never go async can use the below fields, * obviously. Right now only IORING_OP_POLL_ADD uses them, and * async armed poll handlers for regular commands. The latter * restore the work, if needed. / struct { struct callback_head task_work; struct hlist_node hash_node; struct async_poll apoll; }; struct io_wq_work work; }; When task_work_run has multiple work to execute, the work that calls io_poll_remove_all() will do req->work restore for non-poll request always, but indeed if a non-poll request has been added to a new callback_head, subsequent callback will call io_async_task_func() to handle this request, that means we should not do the restore work for such non-poll request. Meanwhile in io_async_task_func(), we should drop submit ref when req has been canceled. Fix both issues. Fixes: b1f573bd15fd ("io_uring: restore req->work when canceling poll request") Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Use io_double_put_req() Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-19	Merge tag 'timers-urgent-2020-04-19' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull time namespace fix from Thomas Gleixner: "An update for the proc interface of time namespaces: Use symbolic names instead of clockid numbers. The usability nuisance of numbers was noticed by Michael when polishing the man page" * tag 'timers-urgent-2020-04-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: proc, time/namespace: Show clock symbolic names in /proc/pid/timens_offsets
2020-04-19	Merge tag 'ext4_for_linus_stable' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Miscellaneous bug fixes and cleanups for ext4, including a fix for generic/388 in data=journal mode, removing some BUG_ON's, and cleaning up some compiler warnings" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: convert BUG_ON's to WARN_ON's in mballoc.c ext4: increase wait time needed before reuse of deleted inode numbers ext4: remove set but not used variable 'es' in ext4_jbd2.c ext4: remove set but not used variable 'es' ext4: do not zeroout extents beyond i_disksize ext4: fix return-value types in several function comments ext4: use non-movable memory for superblock readahead ext4: use matching invalidatepage in ext4_writepage
2020-04-19	Merge tag '5.7-rc-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6	Linus Torvalds
	Pull cifs fixes from Steve French: "Three small smb3 fixes: two debug related (helping network tracing for SMB2 mounts, and the other removing an unintended debug line on signing failures), and one fixing a performance problem with 64K pages" * tag '5.7-rc-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: smb3: remove overly noisy debug line in signing errors cifs: improve read performance for page size 64KB & cache=strict & vers=2.1+ cifs: dump the session id and keys also for SMB2 sessions
2020-04-18	Merge tag 'xfs-5.7-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux	Linus Torvalds
	Pull xfs fixes from Darrick Wong: "The three commits here fix some livelocks and other clashes with fsfreeze, a potential corruption problem, and a minor race between processes freeing and allocating space when the filesystem is near ENOSPC. Summary: - Fix a partially uninitialized variable. - Teach the background gc threads to apply for fsfreeze protection. - Fix some scaling problems when multiple threads try to flush the filesystem when we're about to hit ENOSPC" * tag 'xfs-5.7-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: move inode flush to the sync workqueue xfs: fix partially uninitialized structure in xfs_reflink_remap_extent xfs: acquire superblock freeze protection on eofblocks scans
2020-04-17	buffer: remove useless comment and WB_REASON_FREE_MORE_MEM, reason.	Zhiqiang Liu
	free_more_memory func has been completely removed in commit bc48f001de12 ("buffer: eliminate the need to call free_more_memory() in __getblk_slow()") So comment and `WB_REASON_FREE_MORE_MEM` reason about free_more_memory are no longer needed. Fixes: bc48f001de12 ("buffer: eliminate the need to call free_more_memory() in __getblk_slow()") Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-17	Merge branch 'for-linus' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull proc fix from Eric Biederman: "While running syzbot happened to spot one more oversight in my rework of proc_flush_task. The fields proc_self and proc_thread_self were not being reinitialized when proc was unmounted, which could cause problems if the mount of proc fails" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: proc: Handle umounts cleanly
2020-04-17	Merge tag 'io_uring-5.7-2020-04-17' of git://git.kernel.dk/linux-block	Linus Torvalds
	Pull io_uring fixes from Jens Axboe: - wrap up the init/setup cleanup (Pavel) - fix some issues around deferral sequences (Pavel) - fix splice punt check using the wrong struct file member - apply poll re-arm logic for pollable retry too - pollable retry should honor cancelation - fix setup time error handling syzbot reported crash - restore work state when poll is canceled * tag 'io_uring-5.7-2020-04-17' of git://git.kernel.dk/linux-block: io_uring: don't count rqs failed after current one io_uring: kill already cached timeout.seq_offset io_uring: fix cached_sq_head in io_timeout() io_uring: only post events in io_poll_remove_all() if we completed some io_uring: io_async_task_func() should check and honor cancelation io_uring: check for need to re-wait in polled async handling io_uring: correct O_NONBLOCK check for splice punt io_uring: restore req->work when canceling poll request io_uring: move all request init code in one place io_uring: keep all sqe->flags in req->flags io_uring: early submission req fail code io_uring: track mm through current->mm io_uring: remove obsolete @mm_fault
2020-04-17	Merge tag 'for-5.7-rc1-tag' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "A regression fix for a warning caused by running balance and snapshot creation in parallel" * tag 'for-5.7-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix setting last_trans for reloc roots
2020-04-17	SUNRPC: Fix backchannel RPC soft lockups	Chuck Lever
	Currently, after the forward channel connection goes away, backchannel operations are causing soft lockups on the server because call_transmit_status's SOFTCONN logic ignores ENOTCONN. Such backchannel Calls are aggressively retried until the client reconnects. Backchannel Calls should use RPC_TASK_NOCONNECT rather than RPC_TASK_SOFTCONN. If there is no forward connection, the server is not capable of establishing a connection back to the client, thus that backchannel request should fail before the server attempts to send it. Commit 58255a4e3ce5 ("NFSD: NFSv4 callback client should use RPC_TASK_SOFTCONN") was merged several years before RPC_TASK_NOCONNECT was available. Because setup_callback_client() explicitly sets NOPING, the NFSv4.0 callback connection depends on the first callback RPC to initiate a connection to the client. Thus NFSv4.0 needs to continue to use RPC_TASK_SOFTCONN. Suggested-by: Trond Myklebust <trondmy@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: <stable@vger.kernel.org> # v4.20+
2020-04-17	debugfs: remove return value of debugfs_create_u32()	Greg Kroah-Hartman
	No one checks the return value of debugfs_create_u32(), as it's not needed, so make the return value void, so that no one tries to do so in the future. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20200416145448.GA1380878@kroah.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-17	btrfs: fix setting last_trans for reloc roots	Josef Bacik
	I made a mistake with my previous fix, I assumed that we didn't need to mess with the reloc roots once we were out of the part of relocation where we are actually moving the extents. The subtle thing that I missed is that btrfs_init_reloc_root() also updates the last_trans for the reloc root when we do btrfs_record_root_in_trans() for the corresponding fs_root. I've added a comment to make sure future me doesn't make this mistake again. This showed up as a WARN_ON() in btrfs_copy_root() because our last_trans didn't == the current transid. This could happen if we snapshotted a fs root with a reloc root after we set rc->create_reloc_tree = 0, but before we actually merge the reloc root. Worth mentioning that the regression produced the following warning when running snapshot creation and balance in parallel: BTRFS info (device sdc): relocating block group 30408704 flags metadata\|dup ------------[ cut here ]------------ WARNING: CPU: 0 PID: 12823 at fs/btrfs/ctree.c:191 btrfs_copy_root+0x26f/0x430 [btrfs] CPU: 0 PID: 12823 Comm: btrfs Tainted: G W 5.6.0-rc7-btrfs-next-58 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_copy_root+0x26f/0x430 [btrfs] RSP: 0018:ffffb96e044279b8 EFLAGS: 00010202 RAX: 0000000000000009 RBX: ffff9da70bf61000 RCX: ffffb96e04427a48 RDX: ffff9da733a770c8 RSI: ffff9da70bf61000 RDI: ffff9da694163818 RBP: ffff9da733a770c8 R08: fffffffffffffff8 R09: 0000000000000002 R10: ffffb96e044279a0 R11: 0000000000000000 R12: ffff9da694163818 R13: fffffffffffffff8 R14: ffff9da6d2512000 R15: ffff9da714cdac00 FS: 00007fdeacf328c0(0000) GS:ffff9da735e00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055a2a5b8a118 CR3: 00000001eed78002 CR4: 00000000003606f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ? create_reloc_root+0x49/0x2b0 [btrfs] ? kmem_cache_alloc_trace+0xe5/0x200 create_reloc_root+0x8b/0x2b0 [btrfs] btrfs_reloc_post_snapshot+0x96/0x5b0 [btrfs] create_pending_snapshot+0x610/0x1010 [btrfs] create_pending_snapshots+0xa8/0xd0 [btrfs] btrfs_commit_transaction+0x4c7/0xc50 [btrfs] ? btrfs_mksubvol+0x3cd/0x560 [btrfs] btrfs_mksubvol+0x455/0x560 [btrfs] __btrfs_ioctl_snap_create+0x15f/0x190 [btrfs] btrfs_ioctl_snap_create_v2+0xa4/0xf0 [btrfs] ? mem_cgroup_commit_charge+0x6e/0x540 btrfs_ioctl+0x12d8/0x3760 [btrfs] ? do_raw_spin_unlock+0x49/0xc0 ? _raw_spin_unlock+0x29/0x40 ? __handle_mm_fault+0x11b3/0x14b0 ? ksys_ioctl+0x92/0xb0 ksys_ioctl+0x92/0xb0 ? trace_hardirqs_off_thunk+0x1a/0x1c __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x5c/0x280 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7fdeabd3bdd7 Fixes: 2abc726ab4b8 ("btrfs: do not init a reloc root if we aren't relocating") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-04-16	Merge tag 'nfs-for-5.7-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs	Linus Torvalds
	Pull NFS client bugfix from Trond Myklebust: "Fix an ABBA spinlock issue in pnfs_update_layout()" * tag 'nfs-for-5.7-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFS: Fix an ABBA spinlock issue in pnfs_update_layout()
2020-04-16	Merge tag 'ceph-for-5.7-rc2' of git://github.com/ceph/ceph-client	Linus Torvalds
	Pull ceph fixes from Ilya Dryomov: - a set of patches for a deadlock on "rbd map" error path - a fix for invalid pointer dereference and uninitialized variable use on asynchronous create and unlink error paths. * tag 'ceph-for-5.7-rc2' of git://github.com/ceph/ceph-client: ceph: fix potential bad pointer deref in async dirops cb's rbd: don't mess with a page vector in rbd_notify_op_lock() rbd: don't test rbd_dev->opts in rbd_dev_image_release() rbd: call rbd_dev_unprobe() after unwatching and flushing notifies rbd: avoid a deadlock on header_rwsem when flushing notifies
2020-04-16	smb3: remove overly noisy debug line in signing errors	Steve French
	A dump_stack call for signature related errors can be too noisy and not of much value in debugging such problems. Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Shyam Prasad N <nspmangalore@gmail.com>
2020-04-16	xfs: move inode flush to the sync workqueue	Darrick J. Wong
	Move the inode dirty data flushing to a workqueue so that multiple threads can take advantage of a single thread's flushing work. The ratelimiting technique used in bdd4ee4 was not successful, because threads that skipped the inode flush scan due to ratelimiting would ENOSPC early, which caused occasional (but noticeable) changes in behavior and sporadic fstest regressions. Therefore, make all the writer threads wait on a single inode flush, which eliminates both the stampeding hordes of flushers and the small window in which a write could fail with ENOSPC because it lost the ratelimit race after even another thread freed space. Fixes: c6425702f21e ("xfs: ratelimit inode flush on buffered write ENOSPC") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>