summaryrefslogtreecommitdiff
path: root/fs/fuse/inode.c
AgeCommit message (Collapse)Author
2024-11-18fuse: check attributes staleness on fuse_iget()Zhang Tianci
Function fuse_direntplus_link() might call fuse_iget() to initialize a new fuse_inode and change its attributes. If fi->attr_version is always initialized with 0, even if the attributes returned by the FUSE_READDIR request is staled, as the new fi->attr_version is 0, fuse_change_attributes will still set the staled attributes to inode. This wrong behaviour may cause file size inconsistency even when there is no changes from server-side. To reproduce the issue, consider the following 2 programs (A and B) are running concurrently, A B ---------------------------------- -------------------------------- { /fusemnt/dir/f is a file path in a fuse mount, the size of f is 0. } readdir(/fusemnt/dir) start //Daemon set size 0 to f direntry fallocate(f, 1024) stat(f) // B see size 1024 echo 2 > /proc/sys/vm/drop_caches readdir(/fusemnt/dir) reply to kernel Kernel set 0 to the I_NEW inode stat(f) // B see size 0 In the above case, only program B is modifying the file size, however, B observes file size changing between the 2 'readonly' stat() calls. To fix this issue, we should make sure readdirplus still follows the rule of attr_version staleness checking even if the fi->attr_version is lost due to inode eviction. To identify this situation, the new fc->evict_ctr is used to record whether the eviction of inodes occurs during the readdirplus request processing. If it does, the result of readdirplus may be inaccurate; otherwise, the result of readdirplus can be trusted. Although this may still lead to incorrect invalidation, considering the relatively low frequency of evict occurrences, it should be acceptable. Link: https://lore.kernel.org/lkml/20230711043405.66256-2-zhangjiachen.jaycee@bytedance.com/ Link: https://lore.kernel.org/lkml/20241114070905.48901-1-zhangtianci.1997@bytedance.com/ Reported-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com> Suggested-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Zhang Tianci <zhangtianci.1997@bytedance.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-10-25fuse: enable dynamic configuration of fuse max pages limit (FUSE_MAX_MAX_PAGES)Joanne Koong
Introduce the capability to dynamically configure the max pages limit (FUSE_MAX_MAX_PAGES) through a sysctl. This allows system administrators to dynamically set the maximum number of pages that can be used for servicing requests in fuse. Previously, this is gated by FUSE_MAX_MAX_PAGES which is statically set to 256 pages. One result of this is that the buffer size for a write request is limited to 1 MiB on a 4k-page system. The default value for this sysctl is the original limit (256 pages). $ sysctl -a | grep max_pages_limit fs.fuse.max_pages_limit = 256 $ sysctl -n fs.fuse.max_pages_limit 256 $ echo 1024 | sudo tee /proc/sys/fs/fuse/max_pages_limit 1024 $ sysctl -n fs.fuse.max_pages_limit 1024 $ echo 65536 | sudo tee /proc/sys/fs/fuse/max_pages_limit tee: /proc/sys/fs/fuse/max_pages_limit: Invalid argument $ echo 0 | sudo tee /proc/sys/fs/fuse/max_pages_limit tee: /proc/sys/fs/fuse/max_pages_limit: Invalid argument $ echo 65535 | sudo tee /proc/sys/fs/fuse/max_pages_limit 65535 $ sysctl -n fs.fuse.max_pages_limit 65535 Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-09-23fs/fuse: introduce and use fuse_simple_idmap_request() helperAlexander Mikhalitsyn
Let's convert all existing callers properly. No functional changes intended. Suggested-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-09-04fuse: allow idmapped mountsAlexander Mikhalitsyn
Now we have everything in place and we can allow idmapped mounts by setting the FS_ALLOW_IDMAP flag. Notice that real availability of idmapped mounts will depend on the fuse daemon. Fuse daemon have to set FUSE_ALLOW_IDMAP flag in the FUSE_INIT reply. To discuss: - we enable idmapped mounts support only if "default_permissions" mode is enabled, because otherwise we would need to deal with UID/GID mappings in the userspace side OR provide the userspace with idmapped req->in.h.uid/req->in.h.gid values which is not something that we probably want to. Idmapped mounts philosophy is not about faking caller uid/gid. Some extra links and examples: - libfuse support https://github.com/mihalicyn/libfuse/commits/idmap_support - fuse-overlayfs support: https://github.com/mihalicyn/fuse-overlayfs/commits/idmap_support - cephfs-fuse conversion example https://github.com/mihalicyn/ceph/commits/fuse_idmap - glusterfs conversion example https://github.com/mihalicyn/glusterfs/commits/fuse_idmap Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-09-04fuse: add an idmap argument to fuse_simple_requestAlexander Mikhalitsyn
If idmap == NULL *and* filesystem daemon declared idmapped mounts support, then uid/gid values in a fuse header will be -1. No functional changes intended. Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-09-04fuse: add basic infrastructure to support idmappingsAlexander Mikhalitsyn
Add some preparational changes in fuse_get_req/fuse_force_creds to handle idmappings. Miklos suggested [1], [2] to change the meaning of in.h.uid/in.h.gid fields when daemon declares support for idmapped mounts. In a new semantic, we fill uid/gid values in fuse header with a id-mapped caller uid/gid (for requests which create new inodes), for all the rest cases we just send -1 to userspace. No functional changes intended. Link: https://lore.kernel.org/all/CAJfpegsVY97_5mHSc06mSw79FehFWtoXT=hhTUK_E-Yhr7OAuQ@mail.gmail.com/ [1] Link: https://lore.kernel.org/all/CAJfpegtHQsEUuFq1k4ZbTD3E1h-GsrN3PWyv7X8cg6sfU_W2Yw@mail.gmail.com/ [2] Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-08-29fuse: disable the combination of passthrough and writeback cacheBernd Schubert
Current design and handling of passthrough is without fuse caching and with that FUSE_WRITEBACK_CACHE is conflicting. Fixes: 7dc4e97a4f9a ("fuse: introduce FUSE_PASSTHROUGH capability") Cc: stable@kernel.org # v6.9 Signed-off-by: Bernd Schubert <bschubert@ddn.com> Acked-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-07-03fuse: Convert to new uid/gid option parsing helpersEric Sandeen
Convert to new uid/gid option parsing helpers Signed-off-by: Eric Sandeen <sandeen@redhat.com> Link: https://lore.kernel.org/r/4e1a4efa-4ca5-4358-acee-40efd07c3c44@redhat.com Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-07-03fuse: verify {g,u}id mount options correctlyEric Sandeen
As was done in 0200679fc795 ("tmpfs: verify {g,u}id mount options correctly") we need to validate that the requested uid and/or gid is representable in the filesystem's idmapping. Cribbing from the above commit log, The contract for {g,u}id mount options and {g,u}id values in general set from userspace has always been that they are translated according to the caller's idmapping. In so far, fuse has been doing the correct thing. But since fuse is mountable in unprivileged contexts it is also necessary to verify that the resulting {k,g}uid is representable in the namespace of the superblock. Fixes: c30da2e981a7 ("fuse: convert to use the new mount API") Cc: stable@vger.kernel.org # 5.4+ Signed-off-by: Eric Sandeen <sandeen@redhat.com> Link: https://lore.kernel.org/r/8f07d45d-c806-484d-a2e3-7a2199df1cd2@redhat.com Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-15fuse: fix wrong ff->iomode state changes from parallel dio writeAmir Goldstein
There is a confusion with fuse_file_uncached_io_{start,end} interface. These helpers do two things when called from passthrough open()/release(): 1. Take/drop negative refcount of fi->iocachectr (inode uncached io mode) 2. State change ff->iomode IOM_NONE <-> IOM_UNCACHED (file uncached open) The calls from parallel dio write path need to take a reference on fi->iocachectr, but they should not be changing ff->iomode state, because in this case, the fi->iocachectr reference does not stick around until file release(). Factor out helpers fuse_inode_uncached_io_{start,end}, to be used from parallel dio write path and rename fuse_file_*cached_io_{start,end} helpers to fuse_file_*cached_io_{open,release} to clarify the difference. Fixes: 205c1d802683 ("fuse: allow parallel dio writes with FUSE_DIRECT_IO_ALLOW_MMAP") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-03-15Merge tag 'fuse-update-6.9' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: - Add passthrough mode for regular file I/O. This allows performing read and write (also via memory maps) on a backing file without incurring the overhead of roundtrips to userspace. For now this is only allowed to privileged servers, but this limitation will go away in the future (Amir Goldstein) - Fix interaction of direct I/O mode with memory maps (Bernd Schubert) - Export filesystem tags through sysfs for virtiofs (Stefan Hajnoczi) - Allow resending queued requests for server crash recovery (Zhao Chen) - Misc fixes and cleanups * tag 'fuse-update-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (38 commits) fuse: get rid of ff->readdir.lock fuse: remove unneeded lock which protecting update of congestion_threshold fuse: Fix missing FOLL_PIN for direct-io fuse: remove an unnecessary if statement fuse: Track process write operations in both direct and writethrough modes fuse: Use the high bit of request ID for indicating resend requests fuse: Introduce a new notification type for resend pending requests fuse: add support for explicit export disabling fuse: __kuid_val/__kgid_val helpers in fuse_fill_attr_from_inode() fuse: fix typo for fuse_permission comment fuse: Convert fuse_writepage_locked to take a folio fuse: Remove fuse_writepage virtio_fs: remove duplicate check if queue is broken fuse: use FUSE_ROOT_ID in fuse_get_root_inode() fuse: don't unhash root fuse: fix root lookup with nonzero generation fuse: replace remaining make_bad_inode() with fuse_make_bad() virtiofs: drop __exit from virtio_fs_sysfs_exit() fuse: implement passthrough for mmap fuse: implement splice read/write passthrough ...
2024-03-06fuse: Use the high bit of request ID for indicating resend requestsZhao Chen
Some FUSE daemons want to know if the received request is a resend request. The high bit of the fuse request ID is utilized for indicating this, enabling the receiver to perform appropriate handling. The init flag "FUSE_HAS_RESEND" is added to indicate this feature. Signed-off-by: Zhao Chen <winters.zc@antgroup.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-03-06fuse: add support for explicit export disablingJingbo Xu
open_by_handle_at(2) can fail with -ESTALE with a valid handle returned by a previous name_to_handle_at(2) for evicted fuse inodes, which is especially common when entry_valid_timeout is 0, e.g. when the fuse daemon is in "cache=none" mode. The time sequence is like: name_to_handle_at(2) # succeed evict fuse inode open_by_handle_at(2) # fail The root cause is that, with 0 entry_valid_timeout, the dput() called in name_to_handle_at(2) will trigger iput -> evict(), which will send FUSE_FORGET to the daemon. The following open_by_handle_at(2) will send a new FUSE_LOOKUP request upon inode cache miss since the previous inode eviction. Then the fuse daemon may fail the FUSE_LOOKUP request with -ENOENT as the cached metadata of the requested inode has already been cleaned up during the previous FUSE_FORGET. The returned -ENOENT is treated as -ESTALE when open_by_handle_at(2) returns. This confuses the application somehow, as open_by_handle_at(2) fails when the previous name_to_handle_at(2) succeeds. The returned errno is also confusing as the requested file is not deleted and already there. It is reasonable to fail name_to_handle_at(2) early in this case, after which the application can fallback to open(2) to access files. Since this issue typically appears when entry_valid_timeout is 0 which is configured by the fuse daemon, the fuse daemon is the right person to explicitly disable the export when required. Also considering FUSE_EXPORT_SUPPORT actually indicates the support for lookups of "." and "..", and there are existing fuse daemons supporting export without FUSE_EXPORT_SUPPORT set, for compatibility, we add a new INIT flag for such purpose. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-03-06fuse: __kuid_val/__kgid_val helpers in fuse_fill_attr_from_inode()Alexander Mikhalitsyn
For the sake of consistency, let's use these helpers to extract {u,g}id_t values from k{u,g}id_t ones. There are no functional changes, just to make code cleaner. Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-03-05fuse: use FUSE_ROOT_ID in fuse_get_root_inode()Miklos Szeredi
...when calling fuse_iget(). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-03-05fuse: don't unhash rootMiklos Szeredi
The root inode is assumed to be always hashed. Do not unhash the root inode even if it is marked BAD. Fixes: 5d069dbe8aaf ("fuse: fix bad inode") Cc: <stable@vger.kernel.org> # v5.11 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-03-05fuse: implement ioctls to manage backing filesAmir Goldstein
FUSE server calls the FUSE_DEV_IOC_BACKING_OPEN ioctl with a backing file descriptor. If the call succeeds, a backing file identifier is returned. A later change will be using this backing file id in a reply to OPEN request with the flag FOPEN_PASSTHROUGH to setup passthrough of file operations on the open FUSE file to the backing file. The FUSE server should call FUSE_DEV_IOC_BACKING_CLOSE ioctl to close the backing file by its id. This can be done at any time, but if an open reply with FOPEN_PASSTHROUGH flag is still in progress, the open may fail if the backing file is closed before the fuse file was opened. Setting up backing files requires a server with CAP_SYS_ADMIN privileges. For the backing file to be successfully setup, the backing file must implement both read_iter and write_iter file operations. The limitation on the level of filesystem stacking allowed for the backing file is enforced before setting up the backing file. Signed-off-by: Alessio Balsini <balsini@android.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-02-25fuse: fix UAF in rcu pathwalksAl Viro
->permission(), ->get_link() and ->inode_get_acl() might dereference ->s_fs_info (and, in case of ->permission(), ->s_fs_info->fc->user_ns as well) when called from rcu pathwalk. Freeing ->s_fs_info->fc is rcu-delayed; we need to make freeing ->s_fs_info and dropping ->user_ns rcu-delayed too. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-23fuse: introduce FUSE_PASSTHROUGH capabilityAmir Goldstein
FUSE_PASSTHROUGH capability to passthrough FUSE operations to backing files will be made available with kernel config CONFIG_FUSE_PASSTHROUGH. When requesting FUSE_PASSTHROUGH, userspace needs to specify the max_stack_depth that is allowed for FUSE on top of backing files. Introduce the flag FOPEN_PASSTHROUGH and backing_id to fuse_open_out argument that can be used when replying to OPEN request, to setup passthrough of io operations on the fuse inode to a backing file. Introduce a refcounted fuse_backing object that will be used to associate an open backing file with a fuse inode. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-12-04fuse: share lookup state between submount and its parentKrister Johansen
Fuse submounts do not perform a lookup for the nodeid that they inherit from their parent. Instead, the code decrements the nlookup on the submount's fuse_inode when it is instantiated, and no forget is performed when a submount root is evicted. Trouble arises when the submount's parent is evicted despite the submount itself being in use. In this author's case, the submount was in a container and deatched from the initial mount namespace via a MNT_DEATCH operation. When memory pressure triggered the shrinker, the inode from the parent was evicted, which triggered enough forgets to render the submount's nodeid invalid. Since submounts should still function, even if their parent goes away, solve this problem by sharing refcounted state between the parent and its submount. When all of the references on this shared state reach zero, it's safe to forget the final lookup of the fuse nodeid. Signed-off-by: Krister Johansen <kjlx@templeofstupid.com> Cc: stable@vger.kernel.org Fixes: 1866d779d5d2 ("fuse: Allow fuse_fill_super_common() for submounts") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-12-04fuse: Rename DIRECT_IO_RELAX to DIRECT_IO_ALLOW_MMAPTyler Fanelli
Although DIRECT_IO_RELAX's initial usage is to allow shared mmap, its description indicates a purpose of reducing memory footprint. This may imply that it could be further used to relax other DIRECT_IO operations in the future. Replace it with a flag DIRECT_IO_ALLOW_MMAP which does only one thing, allow shared mmap of DIRECT_IO files while still bypassing the cache on regular reads and writes. [Miklos] Also Keep DIRECT_IO_RELAX definition for backward compatibility. Signed-off-by: Tyler Fanelli <tfanelli@redhat.com> Fixes: e78662e818f9 ("fuse: add a new fuse init flag to relax restrictions in no cache mode") Cc: <stable@vger.kernel.org> # v6.6 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-11-07Merge tag 'vfs-6.7.fsid' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fanotify fsid updates from Christian Brauner: "This work is part of the plan to enable fanotify to serve as a drop-in replacement for inotify. While inotify is availabe on all filesystems, fanotify currently isn't. In order to support fanotify on all filesystems two things are needed: (1) all filesystems need to support AT_HANDLE_FID (2) all filesystems need to report a non-zero f_fsid This contains (1) and allows filesystems to encode non-decodable file handlers for fanotify without implementing any exportfs operations by encoding a file id of type FILEID_INO64_GEN from i_ino and i_generation. Filesystems that want to opt out of encoding non-decodable file ids for fanotify that don't support NFS export can do so by providing an empty export_operations struct. This also partially addresses (2) by generating f_fsid for simple filesystems as well as freevxfs. Remaining filesystems will be dealt with by separate patches. Finally, this contains the patch from the current exportfs maintainers which moves exportfs under vfs with Chuck, Jeff, and Amir as maintainers and vfs.git as tree" * tag 'vfs-6.7.fsid' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: MAINTAINERS: create an entry for exportfs fs: fix build error with CONFIG_EXPORTFS=m or not defined freevxfs: derive f_fsid from bdev->bd_dev fs: report f_fsid from s_dev for "simple" filesystems exportfs: support encoding non-decodeable file handles by default exportfs: define FILEID_INO64_GEN* file handle types exportfs: make ->encode_fh() a mandatory method for NFS export exportfs: add helpers to check if filesystem can encode/decode file handles
2023-10-28exportfs: define FILEID_INO64_GEN* file handle typesAmir Goldstein
Similar to the common FILEID_INO32* file handle types, define common FILEID_INO64* file handle types. The type values of FILEID_INO64_GEN and FILEID_INO64_GEN_PARENT are the values returned by fuse and xfs for 64bit ino encoded file handle types. Note that these type value are filesystem specific and they do not define a universal file handle format, for example: fuse encodes FILEID_INO64_GEN as [ino-hi32,ino-lo32,gen] and xfs encodes FILEID_INO64_GEN as [hostr-order-ino64,gen] (a.k.a xfs_fid64). The FILEID_INO64_GEN fhandle type is going to be used for file ids for fanotify from filesystems that do not support NFS export. Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/r/20231023180801.2953446-4-amir73il@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-18fuse: convert to new timestamp accessorsJeff Layton
Convert to using the new inode timestamp accessor functions. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20231004185347.80880-37-jlayton@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-09-05Merge tag 'fuse-update-6.6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: - Revert non-waiting FLUSH due to a regression - Fix a lookup counter leak in readdirplus - Add an option to allow shared mmaps in no-cache mode - Add btime support and statx intrastructure to the protocol - Invalidate positive/negative dentry on failed create/delete * tag 'fuse-update-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: conditionally fill kstat in fuse_do_statx() fuse: invalidate dentry on EEXIST creates or ENOENT deletes fuse: cache btime fuse: implement statx fuse: add ATTR_TIMEOUT macro fuse: add STATX request fuse: handle empty request_mask in statx fuse: write back dirty pages before direct write in direct_io_relax mode fuse: add a new fuse init flag to relax restrictions in no cache mode fuse: invalidate page cache pages before direct write fuse: nlookup missing decrement in fuse_direntplus_link Revert "fuse: in fuse_flush only wait if someone wants the return code"
2023-08-28Merge tag 'v6.6-vfs.ctime' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs timestamp updates from Christian Brauner: "This adds VFS support for multi-grain timestamps and converts tmpfs, xfs, ext4, and btrfs to use them. This carries acks from all relevant filesystems. The VFS always uses coarse-grained timestamps when updating the ctime and mtime after a change. This has the benefit of allowing filesystems to optimize away a lot of metadata updates, down to around 1 per jiffy, even when a file is under heavy writes. Unfortunately, this has always been an issue when we're exporting via NFSv3, which relies on timestamps to validate caches. A lot of changes can happen in a jiffy, so timestamps aren't sufficient to help the client decide to invalidate the cache. Even with NFSv4, a lot of exported filesystems don't properly support a change attribute and are subject to the same problems with timestamp granularity. Other applications have similar issues with timestamps (e.g., backup applications). If we were to always use fine-grained timestamps, that would improve the situation, but that becomes rather expensive, as the underlying filesystem would have to log a lot more metadata updates. This introduces fine-grained timestamps that are used when they are actively queried. This uses the 31st bit of the ctime tv_nsec field to indicate that something has queried the inode for the mtime or ctime. When this flag is set, on the next mtime or ctime update, the kernel will fetch a fine-grained timestamp instead of the usual coarse-grained one. As POSIX generally mandates that when the mtime changes, the ctime must also change the kernel always stores normalized ctime values, so only the first 30 bits of the tv_nsec field are ever used. Filesytems can opt into this behavior by setting the FS_MGTIME flag in the fstype. Filesystems that don't set this flag will continue to use coarse-grained timestamps. Various preparatory changes, fixes and cleanups are included: - Fixup all relevant places where POSIX requires updating ctime together with mtime. This is a wide-range of places and all maintainers provided necessary Acks. - Add new accessors for inode->i_ctime directly and change all callers to rely on them. Plain accesses to inode->i_ctime are now gone and it is accordingly rename to inode->__i_ctime and commented as requiring accessors. - Extend generic_fillattr() to pass in a request mask mirroring in a sense the statx() uapi. This allows callers to pass in a request mask to only get a subset of attributes filled in. - Rework timestamp updates so it's possible to drop the @now parameter the update_time() inode operation and associated helpers. - Add inode_update_timestamps() and convert all filesystems to it removing a bunch of open-coding" * tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (107 commits) btrfs: convert to multigrain timestamps ext4: switch to multigrain timestamps xfs: switch to multigrain timestamps tmpfs: add support for multigrain timestamps fs: add infrastructure for multigrain timestamps fs: drop the timespec64 argument from update_time xfs: have xfs_vn_update_time gets its own timestamp fat: make fat_update_time get its own timestamp fat: remove i_version handling from fat_update_time ubifs: have ubifs_update_time use inode_update_timestamps btrfs: have it use inode_update_timestamps fs: drop the timespec64 arg from generic_update_time fs: pass the request_mask to generic_fillattr fs: remove silly warning from current_time gfs2: fix timestamp handling on quota inodes fs: rename i_ctime field to __i_ctime selinux: convert to ctime accessor functions security: convert to ctime accessor functions apparmor: convert to ctime accessor functions sunrpc: convert to ctime accessor functions ...
2023-08-21fuse: cache btimeMiklos Szeredi
Not all inode attributes are supported by all filesystems, but for the basic stats (which are returned by stat(2) and friends) all of them will have some value, even if that doesn't reflect a real attribute of the file. Btime is different, in that filesystems are free to report or not report a value in statx. If the value is available, then STATX_BTIME bit is set in stx_mask. When caching the value of btime, remember the availability of the attribute as well as the value (if available). This is done by using the FUSE_I_BTIME bit in fuse_inode->state to indicate availability, while using fuse_inode->inval_mask & STATX_BTIME to indicate the state of the cache itself (i.e. set if cache is invalid, and cleared if cache is valid). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-08-21fuse: implement statxMiklos Szeredi
Allow querying btime. When btime is requested in mask, then FUSE_STATX request is sent. Otherwise keep using FUSE_GETATTR. The userspace interface for statx matches that of the statx(2) API. However there are limitations on how this interface is used: - returned basic stats and btime are used, stx_attributes, etc. are ignored - always query basic stats and btime, regardless of what was requested - requested sync type is ignored, the default is passed to the server - if server returns with some attributes missing from the result_mask, then no attributes will be cached - btime is not cached yet (next patch will fix that) For new inodes initialize fi->inval_mask to "all invalid", instead of "all valid" as previously. Also only clear basic stats from inval_mask when caching attributes. This will result in the caching logic not thinking that btime is cached. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-08-16fuse: add a new fuse init flag to relax restrictions in no cache modeHao Xu
FOPEN_DIRECT_IO is usually set by fuse daemon to indicate need of strong coherency, e.g. network filesystems. Thus shared mmap is disabled since it leverages page cache and may write to it, which may cause inconsistence. But FOPEN_DIRECT_IO can be used not for coherency but to reduce memory footprint as well, e.g. reduce guest memory usage with virtiofs. Therefore, add a new fuse init flag FUSE_DIRECT_IO_RELAX to relax restrictions in that mode, currently, it allows shared mmap. One thing to note is to make sure it doesn't break coherency in your use case. Signed-off-by: Hao Xu <howeyxu@tencent.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-07-24fuse: convert to ctime accessor functionsJeff Layton
In later patches, we're going to change how the inode's ctime field is used. Switch to using accessor functions instead of raw accesses of inode->i_ctime. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230705190309.579783-44-jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-07fuse: Apply flags2 only when userspace set the FUSE_INIT_EXTBernd Schubert
This is just a safety precaution to avoid checking flags on memory that was initialized on the user space side. libfuse zeroes struct fuse_init_out outarg, but this is not guranteed to be done in all implementations. Better is to act on flags and to only apply flags2 when FUSE_INIT_EXT is set. There is a risk with this change, though - it might break existing user space libraries, which are already using flags2 without setting FUSE_INIT_EXT. The corresponding libfuse patch is here https://github.com/libfuse/libfuse/pull/662 Signed-off-by: Bernd Schubert <bschubert@ddn.com> Fixes: 53db28933e95 ("fuse: extend init flags") Cc: <stable@vger.kernel.org> # v5.17 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-07fuse: add feature flag for expire-onlyMiklos Szeredi
Add an init flag idicating whether the FUSE_EXPIRE_ONLY flag of FUSE_NOTIFY_INVAL_ENTRY is effective. This is needed for backports of this feature, otherwise the server could just check the protocol version. Fixes: 4f8d37020e1f ("fuse: add "expire only" mode to FUSE_NOTIFY_INVAL_ENTRY") Cc: <stable@vger.kernel.org> # v6.2 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-02-27Merge tag 'fuse-update-6.3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: - Fix regression in fileattr permission checking - Fix possible hang during PID namespace destruction - Add generic support for request extensions - Add supplementary group list extension - Add limited support for supplying supplementary groups in create requests - Documentation fixes * tag 'fuse-update-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: add inode/permission checks to fileattr_get/fileattr_set fuse: fix all W=1 kernel-doc warnings fuse: in fuse_flush only wait if someone wants the return code fuse: optional supplementary group in create requests fuse: add request extension
2023-01-26fuse: optional supplementary group in create requestsMiklos Szeredi
Permission to create an object (create, mkdir, symlink, mknod) needs to take supplementary groups into account. Add a supplementary group request extension. This can contain an arbitrary number of group IDs and can be added to any request. This extension is not added to any request by default. Add FUSE_CREATE_SUPP_GROUP init flag to enable supplementary group info in creation requests. This adds just a single supplementary group that matches the parent group in the case described above. In other cases the extension is not added. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-01-24fuse: fixes after adapting to new posix acl apiChristian Brauner
This cycle we ported all filesystems to the new posix acl api. While looking at further simplifications in this area to remove the last remnants of the generic dummy posix acl handlers we realized that we regressed fuse daemons that don't set FUSE_POSIX_ACL but still make use of posix acls. With the change to a dedicated posix acl api interacting with posix acls doesn't go through the old xattr codepaths anymore and instead only relies the get acl and set acl inode operations. Before this change fuse daemons that don't set FUSE_POSIX_ACL were able to get and set posix acl albeit with two caveats. First, that posix acls aren't cached. And second, that they aren't used for permission checking in the vfs. We regressed that use-case as we currently refuse to retrieve any posix acls if they aren't enabled via FUSE_POSIX_ACL. So older fuse daemons would see a change in behavior. We can restore the old behavior in multiple ways. We could change the new posix acl api and look for a dedicated xattr handler and if we find one prefer that over the dedicated posix acl api. That would break the consistency of the new posix acl api so we would very much prefer not to do that. We could introduce a new ACL_*_CACHE sentinel that would instruct the vfs permission checking codepath to not call into the filesystem and ignore acls. But a more straightforward fix for v6.2 is to do the same thing that Overlayfs does and give fuse a separate get acl method for permission checking. Overlayfs uses this to express different needs for vfs permission lookup and acl based retrieval via the regular system call path as well. Let fuse do the same for now. This way fuse can continue to refuse to retrieve posix acls for daemons that don't set FUSE_POSXI_ACL for permission checking while allowing a fuse server to retrieve it via the usual system calls. In the future, we could extend the get acl inode operation to not just pass a simple boolean to indicate rcu lookup but instead make it a flag argument. Then in addition to passing the information that this is an rcu lookup to the filesystem we could also introduce a flag that tells the filesystem that this is a request from the vfs to use these acls for permission checking. Then fuse could refuse the get acl request for permission checking when the daemon doesn't have FUSE_POSIX_ACL set in the same get acl method. This would also help Overlayfs and allow us to remove the second method for it as well. But since that change is more invasive as we need to update the get acl inode operation for multiple filesystems we should not do this as a fix for v6.2. Instead we will do this for the v6.3 merge window. Fwiw, since posix acls are now always correctly translated in the new posix acl api we could also allow them to be used for daemons without FUSE_POSIX_ACL that are not mounted on the host. But this is behavioral change and again if dones should be done for v6.3. For now, let's just restore the original behavior. A nice side-effect of this change is that for fuse daemons with and without FUSE_POSIX_ACL the same code is used for posix acls in a backwards compatible way. This also means we can remove the legacy xattr handlers completely. We've also added comments to explain the expected behavior for daemons without FUSE_POSIX_ACL into the code. Fixes: 318e66856dde ("xattr: use posix acl api") Signed-off-by: Seth Forshee (Digital Ocean) <sforshee@kernel.org> Reviewed-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-07-27fuse: retire block-device-based superblock on force unmountDaniil Lunev
Force unmount of FUSE severes the connection with the user space, even if there are still open files. Subsequent remount tries to re-use the superblock held by the open files, which is meaningless in the FUSE case after disconnect - reused super block doesn't have userspace counterpart attached to it and is incapable of doing any IO. This patch adds the functionality only for the block-device-based supers, since the primary use case of the feature is to gracefully handle force unmount of external devices, mounted with FUSE. This can be further extended to cover all superblocks, if the need arises. Signed-off-by: Daniil Lunev <dlunev@chromium.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-07-21fuse: limit nsecMiklos Szeredi
Limit nanoseconds to 0..999999999. Fixes: d8a5ba45457e ("[PATCH] FUSE - core") Cc: <stable@vger.kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-03-22fs: allocate inode by using alloc_inode_sb()Muchun Song
The inode allocation is supposed to use alloc_inode_sb(), so convert kmem_cache_alloc() of all filesystems to alloc_inode_sb(). Link: https://lkml.kernel.org/r/20220228122126.37293-5-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Theodore Ts'o <tytso@mit.edu> [ext4] Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Alex Shi <alexs@kernel.org> Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Cc: Chao Yu <chao@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kari Argillander <kari.argillander@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-02-21fuse: move FUSE_SUPER_MAGIC definition to magic.hJeff Layton
...to help userland apps that need to identify FUSE mounts. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-12-14fuse: mark inode DONT_CACHE when per inode DAX hint changesJeffle Xu
When the per inode DAX hint changes while the file is still *opened*, it is quite complicated and maybe fragile to dynamically change the DAX state. Hence mark the inode and corresponding dentries as DONE_CACHE once the per inode DAX hint changes, so that the inode instance will be evicted and freed as soon as possible once the file is closed and the last reference to the inode is put. And then when the file gets reopened next time, the new instantiated inode will reflect the new DAX state. In summary, when the per inode DAX hint changes for an *opened* file, the DAX state of the file won't be updated until this file is closed and reopened later. Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Reviewed-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-12-14fuse: negotiate per inode DAX in FUSE_INITJeffle Xu
Among the FUSE_INIT phase, client shall advertise per inode DAX if it's mounted with "dax=inode". Then server is aware that client is in per inode DAX mode, and will construct per-inode DAX attribute accordingly. Server shall also advertise support for per inode DAX. If server doesn't support it while client is mounted with "dax=inode", client will silently fallback to "dax=never" since "dax=inode" is advisory only. Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Reviewed-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-12-14fuse: enable per inode DAXJeffle Xu
DAX may be limited in some specific situation. When the number of usable DAX windows is under watermark, the recalim routine will be triggered to reclaim some DAX windows. It may have a negative impact on the performance, since some processes may need to wait for DAX windows to be recalimed and reused then. To mitigate the performance degradation, the overall DAX window need to be expanded larger. However, simply expanding the DAX window may not be a good deal in some scenario. To maintain one DAX window chunk (i.e., 2MB in size), 32KB (512 * 64 bytes) memory footprint will be consumed for page descriptors inside guest, which is greater than the memory footprint if it uses guest page cache when DAX disabled. Thus it'd better disable DAX for those files smaller than 32KB, to reduce the demand for DAX window and thus avoid the unworthy memory overhead. Per inode DAX feature is introduced to address this issue, by offering a finer grained control for dax to users, trying to achieve a balance between performance and memory overhead. The FUSE_ATTR_DAX flag in FUSE_LOOKUP reply is used to indicate whether DAX should be enabled or not for corresponding file. Currently the state whether DAX is enabled or not for the file is initialized only when inode is instantiated. Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Reviewed-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-12-14fuse: make DAX mount option a tri-stateJeffle Xu
We add 'always', 'never', and 'inode' (default). '-o dax' continues to operate the same which is equivalent to 'always'. The following behavior is consistent with that on ext4/xfs: - The default behavior (when neither '-o dax' nor '-o dax=always|never|inode' option is specified) is equal to 'inode' mode, while 'dax=inode' won't be printed among the mount option list. - The 'inode' mode is only advisory. It will silently fallback to 'never' mode if fuse server doesn't support that. Also noted that by the time of this commit, 'inode' mode is actually equal to 'always' mode, before the per inode DAX flag is introduced in the following patch. Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Reviewed-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-11-25fuse: send security context of inode on fileVivek Goyal
When a new inode is created, send its security context to server along with creation request (FUSE_CREAT, FUSE_MKNOD, FUSE_MKDIR and FUSE_SYMLINK). This gives server an opportunity to create new file and set security context (possibly atomically). In all the configurations it might not be possible to set context atomically. Like nfs and ceph, use security_dentry_init_security() to dermine security context of inode and send it with create, mkdir, mknod, and symlink requests. Following is the information sent to server. fuse_sectx_header, fuse_secctx, xattr_name, security_context - struct fuse_secctx_header This contains total number of security contexts being sent and total size of all the security contexts (including size of fuse_secctx_header). - struct fuse_secctx This contains size of security context which follows this structure. There is one fuse_secctx instance per security context. - xattr name string This string represents name of xattr which should be used while setting security context. - security context This is the actual security context whose size is specified in fuse_secctx struct. Also add the FUSE_SECURITY_CTX flag for the `flags` field of the fuse_init_out struct. When this flag is set the kernel will append the security context for a newly created inode to the request (create, mkdir, mknod, and symlink). The server is responsible for ensuring that the inode appears atomically (preferrably) with the requested security context. For example, If the server is using SELinux and backed by a "real" linux file system that supports extended attributes it can write the security context value to /proc/thread-self/attr/fscreate before making the syscall to create the inode. This patch is based on patch from Chirantan Ekbote <chirantan@chromium.org> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-11-25fuse: extend init flagsMiklos Szeredi
FUSE_INIT flags are close to running out, so add another 32bits worth of space. Add FUSE_INIT_EXT flag to the old flags field in fuse_init_in. If this flag is set, then fuse_init_in is extended by 48bytes, in which a flags_hi field is allocated to contain the high 32bits of the flags. A flags_hi field is also added to fuse_init_out, allocated out of the remaining unused fields. Known userspace implementations of the fuse protocol have been checked to accept the extended FUSE_INIT request, but this might cause problems with other implementations. If that happens to be the case, the protocol negotiation will have to be extended with an extra initialization request roundtrip. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-28fuse: add cache_maskMiklos Szeredi
If writeback_cache is enabled, then the size, mtime and ctime attributes of regular files are always valid in the kernel's cache. They are retrieved from userspace only when the inode is freshly looked up. Add a more generic "cache_mask", that indicates which attributes are currently valid in cache. This patch doesn't change behavior. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-28fuse: move reverting attributes to fuse_change_attributes()Miklos Szeredi
In case of writeback_cache fuse_fillattr() would revert the queried attributes to the cached version. Move this to fuse_change_attributes() in order to manage the writeback logic in a central helper. This will be necessary for patches that follow. Only fuse_do_getattr() -> fuse_fillattr() uses the attributes after calling fuse_change_attributes(), so this should not change behavior. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-28fuse: simplify local variables holding writeback cache stateMiklos Szeredi
There are two instances of "bool is_wb = fc->writeback_cache" where the actual use mostly involves checking "is_wb && S_ISREG(inode->i_mode)". Clean up these cases by storing the second condition in the local variable. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-22fuse: make sure reclaim doesn't write the inodeMiklos Szeredi
In writeback cache mode mtime/ctime updates are cached, and flushed to the server using the ->write_inode() callback. Closing the file will result in a dirty inode being immediately written, but in other cases the inode can remain dirty after all references are dropped. This result in the inode being written back from reclaim, which can deadlock on a regular allocation while the request is being served. The usual mechanisms (GFP_NOFS/PF_MEMALLOC*) don't work for FUSE, because serving a request involves unrelated userspace process(es). Instead do the same as for dirty pages: make sure the inode is written before the last reference is gone. - fallocate(2)/copy_file_range(2): these call file_update_time() or file_modified(), so flush the inode before returning from the call - unlink(2), link(2) and rename(2): these call fuse_update_ctime(), so flush the ctime directly from this helper Reported-by: chenguanyou <chenguanyou@xiaomi.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-21fuse: clean up error exits in fuse_fill_super()Miklos Szeredi
Instead of "goto err", return error directly, since there's no error cleanup to do now. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>