summaryrefslogtreecommitdiff
path: root/drivers/net/ethernet/mellanox/mlx5/core
AgeCommit message (Collapse)Author
2024-12-04net/mlx5e: Remove workaround to avoid syndrome for internal portJianbo Liu
Previously a workaround was added to avoid syndrome 0xcdb051. It is triggered when offload a rule with tunnel encapsulation, and forwarding to another table, but not matching on the internal port in firmware steering mode. The original workaround skips internal tunnel port logic, which is not correct as not all cases are considered. As an example, if vlan is configured on the uplink port, traffic can't pass because vlan header is not added with this workaround. Besides, there is no such issue for software steering. So, this patch removes that, and returns error directly if trying to offload such rule for firmware steering. Fixes: 06b4eac9c4be ("net/mlx5e: Don't offload internal port if filter device is out device") Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Tested-by: Frode Nordahl <frode.nordahl@canonical.com> Reviewed-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Ariel Levkovich <lariel@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241203204920.232744-7-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-04net/mlx5e: SD, Use correct mdev to build channel paramTariq Toukan
In a multi-PF netdev, each traffic channel creates its own resources against a specific PF. In the cited commit, where this support was added, the channel_param logic was mistakenly kept unchanged, so it always used the primary PF which is found at priv->mdev. In this patch we fix this by moving the logic to be per-channel, and passing the correct mdev instance. This bug happened to be usually harmless, as the resulting cparam structures would be the same for all channels, due to identical FW logic and decisions. However, in some use cases, like fwreset, this gets broken. This could lead to different symptoms. Example: Error cqe on cqn 0x428, ci 0x0, qn 0x10a9, opcode 0xe, syndrome 0x4, vendor syndrome 0x32 Fixes: e4f9686bdee7 ("net/mlx5e: Let channels be SD-aware") Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Lama Kayal <lkayal@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Link: https://patch.msgid.link/20241203204920.232744-6-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-04net/mlx5: E-Switch, Fix switching to switchdev mode in MPVPatrisious Haddad
Fix the mentioned commit change for MPV mode, since in MPV mode the IB device is shared between different core devices, so under this change when moving both devices simultaneously to switchdev mode the IB device removal and re-addition can race with itself causing unexpected behavior. In such case do rescan_drivers() only once in order to add the ethernet representor auxiliary device, and skip adding and removing IB devices. Fixes: ab85ebf43723 ("net/mlx5: E-switch, refactor eswitch mode change") Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241203204920.232744-5-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-04net/mlx5: E-Switch, Fix switching to switchdev mode with IB device disabledPatrisious Haddad
In case that IB device is already disabled when moving to switchdev mode, which can happen when working with LAG, need to do rescan_drivers() before leaving in order to add ethernet representor auxiliary device. Fixes: ab85ebf43723 ("net/mlx5: E-switch, refactor eswitch mode change") Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241203204920.232744-4-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-04net/mlx5: HWS: Properly set bwc queue locks lock classesCosmin Ratiu
The mentioned "Fixes" patch forgot to do that. Fixes: 9addffa34359 ("net/mlx5: HWS, use lock classes for bwc locks") Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241203204920.232744-3-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-04net/mlx5: HWS: Fix memory leak in mlx5hws_definer_calc_layoutCosmin Ratiu
It allocates a match template, which creates a compressed definer fc struct, but that is not deallocated. This commit fixes that. Fixes: 74a778b4a63f ("net/mlx5: HWS, added definers handling") Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241203204920.232744-2-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-22Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds
Pull rdma updates from Jason Gunthorpe: "Seveal fixes scattered across the drivers and a few new features: - Minor updates and bug fixes to hfi1, efa, iopob, bnxt, hns - Force disassociate the userspace FD when hns does an async reset - bnxt new features for optimized modify QP to skip certain stayes, CQ coalescing, better debug dumping - mlx5 new data placement ordering feature - Faster destruction of mlx5 devx HW objects - Improvements to RDMA CM mad handling" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (51 commits) RDMA/bnxt_re: Correct the sequence of device suspend RDMA/bnxt_re: Use the default mode of congestion control RDMA/bnxt_re: Support different traffic class IB/cm: Rework sending DREQ when destroying a cm_id IB/cm: Do not hold reference on cm_id unless needed IB/cm: Explicitly mark if a response MAD is a retransmission RDMA/mlx5: Move events notifier registration to be after device registration RDMA/bnxt_re: Cache MSIx info to a local structure RDMA/bnxt_re: Refurbish CQ to NQ hash calculation RDMA/bnxt_re: Refactor NQ allocation RDMA/bnxt_re: Fail probe early when not enough MSI-x vectors are reserved RDMA/hns: Fix different dgids mapping to the same dip_idx RDMA/bnxt_re: Add set_func_resources support for P5/P7 adapters RDMA/bnxt_re: Enhance RoCE SRIOV resource configuration design bnxt_en: Add support for RoCE sriov configuration RDMA/hns: Fix NULL pointer derefernce in hns_roce_map_mr_sg() RDMA/hns: Fix out-of-order issue of requester when setting FENCE RDMA/nldev: Add IB device and net device rename events RDMA/mlx5: Add implementation for ufile_hw_cleanup device operation RDMA/core: Move ib_uverbs_file struct to uverbs_types.h ...
2024-11-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.12-rc8). Conflicts: tools/testing/selftests/net/.gitignore 252e01e68241 ("selftests: net: add netlink-dumps to .gitignore") be43a6b23829 ("selftests: ncdevmem: Move ncdevmem under drivers/net/hw") https://lore.kernel.org/all/20241113122359.1b95180a@canb.auug.org.au/ drivers/net/phy/phylink.c 671154f174e0 ("net: phylink: ensure PHY momentary link-fails are handled") 7530ea26c810 ("net: phylink: remove "using_mac_select_pcs"") Adjacent changes: drivers/net/ethernet/stmicro/stmmac/dwmac-intel-plat.c 5b366eae7193 ("stmmac: dwmac-intel-plat: fix call balance of tx_clk handling routines") e96321fad3ad ("net: ethernet: Switch back to struct platform_driver::remove()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11net/mlx5e: SHAMPO, Rework header allocation loopDragos Tatulea
The current loop code was based on the assumption that there can be page leftovers from previous function calls. This patch changes the allocation loop to make it clearer how pages get allocated every MLX5E_SHAMPO_WQ_HEADER_PER_PAGE headers. This change has no functional implications. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107194357.683732-13-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11net/mlx5e: SHAMPO, Drop info arrayDragos Tatulea
The info array is used to store a pointer to the dma address of the header and to the frag page. However, this array is not really required: - The frag page can be calculated from the header index frag page index = header index / headers per page. - The dma address can be calculated through a formula: dma page address + header offset. This series gets rid of the info array and uses the above formulas instead. The current_page_index was used in conjunction with the info array to store page fragment indices. This variable is dropped as well. There was no performance regression observed. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107194357.683732-12-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11net/mlx5e: SHAMPO, Change frag page setup order during allocationDragos Tatulea
Now that the UMR allocation has been simplified, it is no longer possible to have a leftover page from a previous call to mlx5e_build_shampo_hd_umr(). This patch simplifies the code by switching the order of operations: first take the frag page and then increment the index. This is more straightforward and it also paves the way for dropping the info array. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107194357.683732-11-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11net/mlx5e: SHAMPO, Fix page_index calculation inconsistencyDragos Tatulea
When calculating the index for the next frag page slot, the divisor is incorrect: it should be the number of pages per queue not the number of headers per queue. This is currently harmless because frag pages are not used directly, but they are intermediated through the info array. But it needs to be fixed as an upcoming patch will get rid of the info array. This patch introduces a new pages per queue variable and plugs it in the formula. Now that this variable exists, additional code can be simplified in the SHAMPO initialization code. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107194357.683732-10-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11net/mlx5e: SHAMPO, Simplify UMR allocation for headersDragos Tatulea
Allocating page fragments for header data split is currently more complicated than it should be. That's because the number of KSM entries allocated is not aligned to the number of headers per page. This leads to having leftovers in the next allocation which require additional accounting and needlessly complicated code. This patch aligns (down) the number of KSM entries in the UMR WQE to the number of headers per page by: 1) Aligning the max number of entries allocated per UMR WQE (max_ksm_entries) to MLX5E_SHAMPO_WQ_HEADER_PER_PAGE. 2) Aligning the total number of free headers to MLX5E_SHAMPO_WQ_HEADER_PER_PAGE. ... and then it drops the extra accounting code from mlx5e_build_shampo_hd_umr(). Although the number of entries allocated per UMR WQE is slightly smaller due to aligning down, no performance impact was observed. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107194357.683732-9-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11net/mlx5: Make vport QoS enablement more flexible for future extensionsCarolina Jubran
Refactor esw_qos_vport_enable to support more generic configurations, allowing it to be reused for new vport node types in future patches. This refactor includes a new way to change the vport parent node by disabling the current setup and re-enabling it with the new parent. This change sets the foundation for adapting configuration based on the parent type in future patches. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107194357.683732-8-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11net/mlx5: Integrate esw_qos_vport_enable logic into rate operationsCarolina Jubran
Fold the esw_qos_vport_enable function into operations for configuring maximum and minimum rates, simplifying QoS logic. This change consolidates enabling and updating the scheduling element configuration, streamlining how vport QoS is initialized and adjusted. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107194357.683732-7-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11net/mlx5: Generalize scheduling element operationsCarolina Jubran
Introduce helper functions to create and destroy scheduling elements, allowing flexible configuration for different scheduling element types. The new helper functions streamline the process by centralizing error handling and logging through esw_qos_sched_elem_op_warn, which now accepts the operation type (create, destroy, or modify). The changes also adjust the esw_qos_vport_enable and mlx5_esw_qos_vport_disable functions to leverage the new generalized create/destroy helpers. The destroy functions now log errors with esw_warn without returning them. This prevents unnecessary error handling since the node was already destroyed and no further action is required from callers. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107194357.683732-6-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11net/mlx5: Refactor scheduling element configuration bitmasksCarolina Jubran
Refactor esw_qos_sched_elem_config to set bitmasks only when max_rate or bw_share values change, allowing the function to configure nodes with only one of these parameters. This enables more flexible usage for nodes where only one parameter requires configuration. Remove scattered assignments and checks to centralize them within this function, removing the now redundant esw_qos_set_node_max_rate entirely. With this refactor, also remove the assignment of the vport scheduling node max rate to the parent max rate for unlimited vports (where max rate is set to zero), as firmware already handles this behavior. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107194357.683732-5-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11net/mlx5: Generalize max_rate and min_rate setting for nodesCarolina Jubran
Refactor max_rate and min_rate setting functions to operate on mlx5_esw_sched_node, allowing for generalized handling of both vports and nodes. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107194357.683732-4-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11net/mlx5: Simplify QoS normalization by removing error handlingCarolina Jubran
This change updates esw_qos_normalize_min_rate to not return errors, significantly simplifying the code. Normalization failures are software bugs, and it's unnecessary to handle them with rollback mechanisms. Instead, `esw_qos_update_sched_node_bw_share` and `esw_qos_normalize_min_rate` now return void, with any errors logged as warnings to indicate potential software issues. This approach avoids compensating for hidden bugs and removes error handling from all places that perform normalization, streamlining future patches. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107194357.683732-3-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11net/mlx5: E-switch, refactor eswitch mode changePatrisious Haddad
The E-switch mode was previously updated before removing and re-adding the IB device, which could cause a temporary mismatch between the E-switch mode and the IB device configuration. To prevent this discrepancy, the IB device is now removed first, then the E-switch mode is updated, and finally, the IB device is re-added. This sequence ensures consistent alignment between the E-switch mode and the IB device whenever the mode changes, regardless of the new mode value. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107194357.683732-2-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11net/mlx5e: Disable loopback self-test on multi-PF netdevCarolina Jubran
In Multi-PF (Socket Direct) configurations, when a loopback packet is sent through one of the secondary devices, it will always be received on the primary device. This causes the loopback layer to fail in identifying the loopback packet as the devices are different. To avoid false test failures, disable the loopback self-test in Multi-PF configurations. Fixes: ed29705e4ed1 ("net/mlx5: Enable SD feature") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107183527.676877-8-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11net/mlx5e: CT: Fix null-ptr-deref in add rule err flowMoshe Shemesh
In error flow of mlx5_tc_ct_entry_add_rule(), in case ct_rule_add() callback returns error, zone_rule->attr is used uninitiated. Fix it to use attr which has the needed pointer value. Kernel log: BUG: kernel NULL pointer dereference, address: 0000000000000110 RIP: 0010:mlx5_tc_ct_entry_add_rule+0x2b1/0x2f0 [mlx5_core] … Call Trace: <TASK> ? __die+0x20/0x70 ? page_fault_oops+0x150/0x3e0 ? exc_page_fault+0x74/0x140 ? asm_exc_page_fault+0x22/0x30 ? mlx5_tc_ct_entry_add_rule+0x2b1/0x2f0 [mlx5_core] ? mlx5_tc_ct_entry_add_rule+0x1d5/0x2f0 [mlx5_core] mlx5_tc_ct_block_flow_offload+0xc6a/0xf90 [mlx5_core] ? nf_flow_offload_tuple+0xd8/0x190 [nf_flow_table] nf_flow_offload_tuple+0xd8/0x190 [nf_flow_table] flow_offload_work_handler+0x142/0x320 [nf_flow_table] ? finish_task_switch.isra.0+0x15b/0x2b0 process_one_work+0x16c/0x320 worker_thread+0x28c/0x3a0 ? __pfx_worker_thread+0x10/0x10 kthread+0xb8/0xf0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x2d/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> Fixes: 7fac5c2eced3 ("net/mlx5: CT: Avoid reusing modify header context for natted entries") Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107183527.676877-7-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11net/mlx5e: clear xdp features on non-uplink representorsWilliam Tu
Non-uplink representor port does not support XDP. The patch clears the xdp feature by checking the net_device_ops.ndo_bpf is set or not. Verify using the netlink tool: $ tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml --dump dev-get Representor netdev before the patch: {'ifindex': 8, 'xdp-features': {'basic', 'ndo-xmit', 'ndo-xmit-sg', 'redirect', 'rx-sg', 'xsk-zerocopy'}, 'xdp-rx-metadata-features': set(), 'xdp-zc-max-segs': 1, 'xsk-features': set()}, With the patch: {'ifindex': 8, 'xdp-features': set(), 'xdp-rx-metadata-features': set(), 'xsk-features': set()}, Fixes: 4d5ab0ad964d ("net/mlx5e: take into account device reconfiguration for xdp_features flag") Signed-off-by: William Tu <witu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107183527.676877-6-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11net/mlx5e: kTLS, Fix incorrect page refcountingDragos Tatulea
The kTLS tx handling code is using a mix of get_page() and page_ref_inc() APIs to increment the page reference. But on the release path (mlx5e_ktls_tx_handle_resync_dump_comp()), only put_page() is used. This is an issue when using pages from large folios: the get_page() references are stored on the folio page while the page_ref_inc() references are stored directly in the given page. On release the folio page will be dereferenced too many times. This was found while doing kTLS testing with sendfile() + ZC when the served file was read from NFS on a kernel with NFS large folios support (commit 49b29a573da8 ("nfs: add support for large folios")). Fixes: 84d1bb2b139e ("net/mlx5e: kTLS, Limit DUMP wqe size") Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107183527.676877-5-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11net/mlx5: fs, lock FTE when checking if activeMark Bloch
The referenced commits introduced a two-step process for deleting FTEs: - Lock the FTE, delete it from hardware, set the hardware deletion function to NULL and unlock the FTE. - Lock the parent flow group, delete the software copy of the FTE, and remove it from the xarray. However, this approach encounters a race condition if a rule with the same match value is added simultaneously. In this scenario, fs_core may set the hardware deletion function to NULL prematurely, causing a panic during subsequent rule deletions. To prevent this, ensure the active flag of the FTE is checked under a lock, which will prevent the fs_core layer from attaching a new steering rule to an FTE that is in the process of deletion. [ 438.967589] MOSHE: 2496 mlx5_del_flow_rules del_hw_func [ 438.968205] ------------[ cut here ]------------ [ 438.968654] refcount_t: decrement hit 0; leaking memory. [ 438.969249] WARNING: CPU: 0 PID: 8957 at lib/refcount.c:31 refcount_warn_saturate+0xfb/0x110 [ 438.970054] Modules linked in: act_mirred cls_flower act_gact sch_ingress openvswitch nsh mlx5_vdpa vringh vhost_iotlb vdpa mlx5_ib mlx5_core xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcsec_gss_krb5 auth_rpcgss oid_registry overlay rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm ib_uverbs ib_core zram zsmalloc fuse [last unloaded: cls_flower] [ 438.973288] CPU: 0 UID: 0 PID: 8957 Comm: tc Not tainted 6.12.0-rc1+ #8 [ 438.973888] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [ 438.974874] RIP: 0010:refcount_warn_saturate+0xfb/0x110 [ 438.975363] Code: 40 66 3b 82 c6 05 16 e9 4d 01 01 e8 1f 7c a0 ff 0f 0b c3 cc cc cc cc 48 c7 c7 10 66 3b 82 c6 05 fd e8 4d 01 01 e8 05 7c a0 ff <0f> 0b c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 90 [ 438.976947] RSP: 0018:ffff888124a53610 EFLAGS: 00010286 [ 438.977446] RAX: 0000000000000000 RBX: ffff888119d56de0 RCX: 0000000000000000 [ 438.978090] RDX: ffff88852c828700 RSI: ffff88852c81b3c0 RDI: ffff88852c81b3c0 [ 438.978721] RBP: ffff888120fa0e88 R08: 0000000000000000 R09: ffff888124a534b0 [ 438.979353] R10: 0000000000000001 R11: 0000000000000001 R12: ffff888119d56de0 [ 438.979979] R13: ffff888120fa0ec0 R14: ffff888120fa0ee8 R15: ffff888119d56de0 [ 438.980607] FS: 00007fe6dcc0f800(0000) GS:ffff88852c800000(0000) knlGS:0000000000000000 [ 438.983984] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 438.984544] CR2: 00000000004275e0 CR3: 0000000186982001 CR4: 0000000000372eb0 [ 438.985205] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 438.985842] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 438.986507] Call Trace: [ 438.986799] <TASK> [ 438.987070] ? __warn+0x7d/0x110 [ 438.987426] ? refcount_warn_saturate+0xfb/0x110 [ 438.987877] ? report_bug+0x17d/0x190 [ 438.988261] ? prb_read_valid+0x17/0x20 [ 438.988659] ? handle_bug+0x53/0x90 [ 438.989054] ? exc_invalid_op+0x14/0x70 [ 438.989458] ? asm_exc_invalid_op+0x16/0x20 [ 438.989883] ? refcount_warn_saturate+0xfb/0x110 [ 438.990348] mlx5_del_flow_rules+0x2f7/0x340 [mlx5_core] [ 438.990932] __mlx5_eswitch_del_rule+0x49/0x170 [mlx5_core] [ 438.991519] ? mlx5_lag_is_sriov+0x3c/0x50 [mlx5_core] [ 438.992054] ? xas_load+0x9/0xb0 [ 438.992407] mlx5e_tc_rule_unoffload+0x45/0xe0 [mlx5_core] [ 438.993037] mlx5e_tc_del_fdb_flow+0x2a6/0x2e0 [mlx5_core] [ 438.993623] mlx5e_flow_put+0x29/0x60 [mlx5_core] [ 438.994161] mlx5e_delete_flower+0x261/0x390 [mlx5_core] [ 438.994728] tc_setup_cb_destroy+0xb9/0x190 [ 438.995150] fl_hw_destroy_filter+0x94/0xc0 [cls_flower] [ 438.995650] fl_change+0x11a4/0x13c0 [cls_flower] [ 438.996105] tc_new_tfilter+0x347/0xbc0 [ 438.996503] ? ___slab_alloc+0x70/0x8c0 [ 438.996929] rtnetlink_rcv_msg+0xf9/0x3e0 [ 438.997339] ? __netlink_sendskb+0x4c/0x70 [ 438.997751] ? netlink_unicast+0x286/0x2d0 [ 438.998171] ? __pfx_rtnetlink_rcv_msg+0x10/0x10 [ 438.998625] netlink_rcv_skb+0x54/0x100 [ 438.999020] netlink_unicast+0x203/0x2d0 [ 438.999421] netlink_sendmsg+0x1e4/0x420 [ 438.999820] __sock_sendmsg+0xa1/0xb0 [ 439.000203] ____sys_sendmsg+0x207/0x2a0 [ 439.000600] ? copy_msghdr_from_user+0x6d/0xa0 [ 439.001072] ___sys_sendmsg+0x80/0xc0 [ 439.001459] ? ___sys_recvmsg+0x8b/0xc0 [ 439.001848] ? generic_update_time+0x4d/0x60 [ 439.002282] __sys_sendmsg+0x51/0x90 [ 439.002658] do_syscall_64+0x50/0x110 [ 439.003040] entry_SYSCALL_64_after_hwframe+0x76/0x7e Fixes: 718ce4d601db ("net/mlx5: Consolidate update FTE for all removal changes") Fixes: cefc23554fc2 ("net/mlx5: Fix FTE cleanup") Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107183527.676877-4-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11net/mlx5: Fix msix vectors to respect platform limitParav Pandit
The number of PCI vectors allocated by the platform (which may be fewer than requested) is currently not honored when creating the SF pool; only the PCI MSI-X capability is considered. As a result, when a platform allocates fewer vectors (in non-dynamic mode) than requested, the PF and SF pools end up with an invalid vector range. This causes incorrect SF vector accounting, which leads to the following call trace when an invalid IRQ vector is allocated. This issue is resolved by ensuring that the platform's vector limit is respected for both the SF and PF pools. Workqueue: mlx5_vhca_event0 mlx5_sf_dev_add_active_work [mlx5_core] RIP: 0010:pci_irq_vector+0x23/0x80 RSP: 0018:ffffabd5cebd7248 EFLAGS: 00010246 RAX: ffff980880e7f308 RBX: ffff9808932fb880 RCX: 0000000000000001 RDX: 00000000000001ff RSI: 0000000000000200 RDI: ffff980880e7f308 RBP: 0000000000000200 R08: 0000000000000010 R09: ffff97a9116f0860 R10: 0000000000000002 R11: 0000000000000228 R12: ffff980897cd0160 R13: 0000000000000000 R14: ffff97a920fec0c0 R15: ffffabd5cebd72d0 FS: 0000000000000000(0000) GS:ffff97c7ff9c0000(0000) knlGS:0000000000000000 ? rescuer_thread+0x350/0x350 kthread+0x11b/0x140 ? __kthread_bind_mask+0x60/0x60 ret_from_fork+0x22/0x30 mlx5_core 0000:a1:00.0: mlx5_irq_alloc:321:(pid 6781): Failed to request irq. err = -22 mlx5_core 0000:a1:00.0: mlx5_irq_alloc:321:(pid 6781): Failed to request irq. err = -22 mlx5_core.sf mlx5_core.sf.6: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0 enhanced) mlx5_core.sf mlx5_core.sf.7: firmware version: 32.43.356 mlx5_core.sf mlx5_core.sf.6 enpa1s0f0s4: renamed from eth0 mlx5_core.sf mlx5_core.sf.7: Rate limit: 127 rates are supported, range: 0Mbps to 195312Mbps mlx5_core 0000:a1:00.0: mlx5_irq_alloc:321:(pid 6781): Failed to request irq. err = -22 mlx5_core 0000:a1:00.0: mlx5_irq_alloc:321:(pid 6781): Failed to request irq. err = -22 mlx5_core 0000:a1:00.0: mlx5_irq_alloc:321:(pid 6781): Failed to request irq. err = -22 Fixes: 3354822cde5a ("net/mlx5: Use dynamic msix vectors allocation") Signed-off-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Amir Tzin <amirtz@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107183527.676877-3-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11net/mlx5: E-switch, unload IB representors when unloading ETH representorsChiara Meiohas
IB representors depend on ETH representors, so the IB representors should not exist without the ETH ones. When unloading the ETH representors, the corresponding IB representors should be also unloaded. The commit 8d159eb2117b ("RDMA/mlx5: Use IB set_netdev and get_netdev functions") introduced the use of the ib_device_set_netdev API in IB repsresentors. ib_device_set_netdev() increments the refcount of the representor's netdev when loading an IB representor and decrements it when unloading. Without the unloading of the IB representor, the refcount of the representor's netdev remains greater than 0, preventing it from being unregistered. The patch uncovered an underlying bug where the eth representor is unloaded, without unloading the IB representor. This issue happened when using multiport E-switch and rebooting, causing the shutdown to hang when unloading the ETH representor because the refcount of the representor's netdevice was greater than 0. Call trace: unregister_netdevice: waiting for eth3 to become free. Usage count = 2 ref_tracker: eth%d@00000000661d60f7 has 1/1 users at ib_device_set_netdev+0x160/0x2d0 [ib_core] mlx5_ib_vport_rep_load+0x104/0x3f0 [mlx5_ib] mlx5_eswitch_reload_ib_reps+0xfc/0x110 [mlx5_core] mlx5_mpesw_work+0x236/0x330 [mlx5_core] process_one_work+0x169/0x320 worker_thread+0x288/0x3a0 kthread+0xb8/0xe0 ret_from_fork+0x2d/0x50 ret_from_fork_asm+0x11/0x20 Fixes: 8d159eb2117b ("RDMA/mlx5: Use IB set_netdev and get_netdev functions") Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107183527.676877-2-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11mlx5/core: deduplicate {mlx5_,}eq_update_ci()Caleb Sander Mateos
The logic of eq_update_ci() is duplicated in mlx5_eq_update_ci(). The only additional work done by mlx5_eq_update_ci() is to increment eq->cons_index. Call eq_update_ci() from mlx5_eq_update_ci() to avoid the duplication. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Acked-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107183054.2443218-2-csander@purestorage.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11mlx5/core: relax memory barrier in eq_update_ci()Caleb Sander Mateos
The memory barrier in eq_update_ci() after the doorbell write is a significant hot spot in mlx5_eq_comp_int(). Under heavy TCP load, we see 3% of CPU time spent on the mfence instruction. 98df6d5b877c ("net/mlx5: A write memory barrier is sufficient in EQ ci update") already relaxed the full memory barrier to just a write barrier in mlx5_eq_update_ci(), which duplicates eq_update_ci(). So replace mb() with wmb() in eq_update_ci() too. On strongly ordered architectures, no barrier is actually needed because the MMIO writes to the doorbell register are guaranteed to appear to the device in the order they were made. However, the kernel's ordered MMIO primitive writel() lacks a convenient big-endian interface. Therefore, we opt to stick with __raw_writel() + a barrier. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Acked-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107183054.2443218-1-csander@purestorage.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-09mlx5/core: Schedule EQ comp tasklet only if necessaryCaleb Sander Mateos
Currently, the mlx5_eq_comp_int() interrupt handler schedules a tasklet to call mlx5_cq_tasklet_cb() if it processes any completions. For CQs whose completions don't need to be processed in tasklet context, this adds unnecessary overhead. In a heavy TCP workload, we see 4% of CPU time spent on the tasklet_trylock() in tasklet_action_common(), with a smaller amount spent on the atomic operations in tasklet_schedule(), tasklet_clear_sched(), and locking the spinlock in mlx5_cq_tasklet_cb(). TCP completions are handled by mlx5e_completion_event(), which schedules NAPI to poll the queue, so they don't need tasklet processing. Schedule the tasklet in mlx5_add_cq_to_tasklet() instead to avoid this overhead. mlx5_add_cq_to_tasklet() is responsible for enqueuing the CQs to be processed in tasklet context, so it can schedule the tasklet. CQs that need tasklet processing have their interrupt comp handler set to mlx5_add_cq_to_tasklet(), so they will schedule the tasklet. CQs that don't need tasklet processing won't schedule the tasklet. To avoid scheduling the tasklet multiple times during the same interrupt, only schedule the tasklet in mlx5_add_cq_to_tasklet() if the tasklet work queue was empty before the new CQ was pushed to it. The additional branch in mlx5_add_cq_to_tasklet(), called for each EQE, may add a small cost for the userspace Infiniband CQs whose completions are processed in tasklet context. But this seems worth it to avoid the tasklet overhead for CQs that don't need it. Note that the mlx4 driver works the same way: it schedules the tasklet in mlx4_add_cq_to_tasklet() and only if the work queue was empty before. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Acked-by: Tariq Toukan <tariqt@nvidia.com> Acked-by: Saeed Mahameed <saeedm@nvidia.com> Link: https://patch.msgid.link/20241105204000.1807095-1-csander@purestorage.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-05mlx5_en: use read sequence for gettimex64Vadim Fedorenko
The gettimex64() doesn't modify values in timecounter, that's why there is no need to update sequence counter. Reduce the contention on sequence lock for multi-thread PHC reading use-case. Signed-off-by: Vadim Fedorenko <vadfed@meta.com> Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Acked-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241014170103.2473580-1-vadfed@meta.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-04RDMA/mlx5: Ensure active slave attachment to the bond IB deviceChiara Meiohas
Fix a race condition when creating a lag bond in active backup mode where after the bond creation the backup slave was attached to the IB device, instead of the active slave. This caused stale entries in the GID table, as the gid updating mechanism relies on ib_device_get_netdev(), which would return the backup slave. Send an MLX5_DRIVER_EVENT_ACTIVE_BACKUP_LAG_CHANGE_LOWERSTATE event when activating the lag, additionally to when modifying the lag. This ensures that eventually the active netdevice is stored in the bond IB device. When handling this event remove the GIDs of the previously attached netdevice in this port and rescan the GIDs of the newly attached netdevice. This ensures that eventually the active slave netdevice is correctly stored in the IB device port. While there might be a brief moment where the backup slave GIDs appear in the GID table, it will eventually stabilize with the correct GIDs (of the bond and the active slave). Fixes: 8d159eb2117b ("RDMA/mlx5: Use IB set_netdev and get_netdev functions") Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com> Link: https://patch.msgid.link/91fc2cb24f63add266a528c1c702668a80416d9f.1730381292.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-11-04RDMA/mlx5: Call dev_put() after the blocking notifierChiara Meiohas
Move dev_put() call to occur directly after the blocking notifier, instead of within the event handler. Fixes: 8d159eb2117b ("RDMA/mlx5: Use IB set_netdev and get_netdev functions") Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com> Link: https://patch.msgid.link/342ff94b3dcbb07da1c7dab862a73933d604b717.1730381292.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-11-03net/mlx5e: do not create xdp_redirect for non-uplink repWilliam Tu
XDP and XDP socket require extra SQ/RQ/CQs. Most of these resources are dynamically created: no XDP program loaded, no resources are created. One exception is the SQ/CQ created for XDP_REDRIECT, used for other netdev to forward packet to mlx5 for transmit. The patch disables creation of SQ and CQ used for egress XDP_REDIRECT, by checking whether ndo_xdp_xmit is set or not. For netdev without XDP support such as non-uplink representor, this saves around 0.35MB of memory, per representor netdevice per channel. Signed-off-by: William Tu <witu@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241031125856.530927-6-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-03net/mlx5e: move XDP_REDIRECT sq to dynamic allocationWilliam Tu
Dynamically allocating xdpsq, used by egress side XDP_REDIRECT. mlx5 has multiple XDP sqs. Under struct mlx5e_channel: 1. rx_xdpsq: used for XDP_TX, an XDP prog handles the rx packet and transmits using the same queue as rx. 2. xdpsq: used by egress side XDP_REDIRECT. This is for another interface to redirect packet to the mlx5 interface, using ndo_xdp_xmit . 3. xsksq: used by XSK. XSK has its own dedicated channel, and it also has resources of 1 and 2. The patch changes only the 2. xdpsq. Signed-off-by: William Tu <witu@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241031125856.530927-5-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-03net/mlx5: HWS, renamed the files in accordance with naming conventionYevgeny Kliteynik
Removed the 'mlx5hws_' file name prefix from the internal HWS files. Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241031125856.530927-4-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-03net/mlx5: DR, moved all the SWS code into a separate directoryYevgeny Kliteynik
After adding HWS support in a separate folder, moving all the SWS code into its own folder as well. Now SWS and HWS implementation are located in their appropriate folders: - steering/sws/ - steering/hws/ Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241031125856.530927-3-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-03net/mlx5: Rework esw qos domain init and cleanupCosmin Ratiu
The first approach was flawed, because there are situations where the esw mode change fails, leaving the qos domain as NULL. Various calls into the QoS infra then trigger a NULL pointer access and unhappiness. Improve that by a combination of: - Allocating the QoS domain on esw init and cleaning it up on teardown. - Refactoring mode change to only call qos domain init but not cleanup. - Making qos domain init idempotent - not change anything if nothing needs changing. Together, these should guarantee that, as long as the memory allocations succeed, there should always be a valid qos domain until the esw cleanup, no matter what mode changes happen (or failures thereof). Fixes: 107a034d5c1e ("net/mlx5: qos: Store rate groups in a qos domain") Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241031125856.530927-2-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-03dim: pass dim_sample to net_dim() by referenceCaleb Sander Mateos
net_dim() is currently passed a struct dim_sample argument by value. struct dim_sample is 24 bytes. Since this is greater 16 bytes, x86-64 passes it on the stack. All callers have already initialized dim_sample on the stack, so passing it by value requires pushing a duplicated copy to the stack. Either witing to the stack and immediately reading it, or perhaps dereferencing addresses relative to the stack pointer in a chain of push instructions, seems to perform quite poorly. In a heavy TCP workload, mlx5e_handle_rx_dim() consumes 3% of CPU time, 94% of which is attributed to the first push instruction to copy dim_sample on the stack for the call to net_dim(): // Call ktime_get() 0.26 |4ead2: call 4ead7 <mlx5e_handle_rx_dim+0x47> // Pass the address of struct dim in %rdi |4ead7: lea 0x3d0(%rbx),%rdi // Set dim_sample.pkt_ctr |4eade: mov %r13d,0x8(%rsp) // Set dim_sample.byte_ctr |4eae3: mov %r12d,0xc(%rsp) // Set dim_sample.event_ctr 0.15 |4eae8: mov %bp,0x10(%rsp) // Duplicate dim_sample on the stack 94.16 |4eaed: push 0x10(%rsp) 2.79 |4eaf1: push 0x10(%rsp) 0.07 |4eaf5: push %rax // Call net_dim() 0.21 |4eaf6: call 4eafb <mlx5e_handle_rx_dim+0x6b> To allow the caller to reuse the struct dim_sample already on the stack, pass the struct dim_sample by reference to net_dim(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Shannon Nelson <shannon.nelson@amd.com> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Arthur Kiyanovski <akiyano@amazon.com> Reviewed-by: Louis Peens <louis.peens@corigine.com> Link: https://patch.msgid.link/20241031002326.3426181-2-csander@purestorage.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-03net/mlx5: DPLL, Add clock quality level op implementationJiri Pirko
Use MSECQ register to query clock quality from firmware. Implement the dpll op and fill-up the quality level value properly. Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Signed-off-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20241030081157.966604-3-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-29mlx5: simplify EQ interrupt polling logicCaleb Sander Mateos
Use a while loop in mlx5_eq_comp_int() and mlx5_eq_async_int() to clarify the EQE polling logic. This consolidates the next_eqe_sw() calls for the first and subequent iterations. It also avoids a goto. Turn the num_eqes < MLX5_EQ_POLLING_BUDGET check into a break condition. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20241023205113.255866-1-csander@purestorage.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-29mlx5: fix typo in "mlx5_cqwq_get_cqe_enahnced_comp"Caleb Sander Mateos
"enahnced" looks to be a misspelling of "enhanced". Rename "mlx5_cqwq_get_cqe_enahnced_comp" to "mlx5_cqwq_get_cqe_enhanced_comp". Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Link: https://patch.msgid.link/20241023164840.140535-1-csander@purestorage.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-29net/mlx5e: Update features on ring size changeDragos Tatulea
When the ring size changes successfully, trigger netdev_update_features() to enable features in wanted state if applicable. An example of such scenario: $ ip link set dev eth1 up $ ethtool --set-ring eth1 rx 8192 $ ip link set dev eth1 mtu 9000 $ ethtool --features eth1 rx-gro-hw on --> fails $ ethtool --set-ring eth1 rx 1024 With this patch, HW GRO will be turned on automatically because it is set in the device's wanted_features. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20241024164134.299646-3-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-29net/mlx5e: Update features on MTU changeDragos Tatulea
When the MTU changes successfully, trigger netdev_update_features() to enable features in wanted state if applicable. An example of such scenario: $ ip link set dev eth1 up $ ethtool --set-ring eth1 rx 8192 $ ip link set dev eth1 mtu 9000 $ ethtool --features eth1 rx-gro-hw on --> fails $ ip link set dev eth1 mtu 7000 With this patch, HW GRO will be turned on automatically because it is set in the device's wanted_features. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20241024164134.299646-2-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-28net/mlx5: unique names for per device cachesSebastian Ott
Add the device name to the per device kmem_cache names to ensure their uniqueness. This fixes warnings like this: "kmem_cache of name 'mlx5_fs_fgs' already exists". Signed-off-by: Sebastian Ott <sebott@redhat.com> Reviewed-by: Breno Leitao <leitao@debian.org> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241023134146.28448-1-sebott@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-21net/mlx5: fs, rename modify header struct member actionMoshe Shemesh
As preparation for HW Steering support, rename modify header struct member action to fs_dr_action, to distinguish from fs_hws_action which will be added. Add a pointer where needed to keep code line shorter and more readable. Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-21net/mlx5: fs, rename packet reformat struct member actionMoshe Shemesh
As preparation for HW Steering support, rename packet reformat struct member action to fs_dr_action, to distinguish from fs_hws_action which will be added. Add a pointer where needed to keep code line shorter and more readable. Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-21net/mlx5: Only create VEPA flow table when in VEPA modeBenjamin Poirier
Currently, when VFs are created, two flow tables are added for the eswitch: the "fdb" table, which contains rules for each VF and the "vepa_fdb" table. In the default VEB mode, the vepa_fdb table is empty. When switching to VEPA mode, flow steering rules are added to vepa_fdb. Even though the vepa_fdb table is empty in VEB mode, its presence adds some cost to packet processing. In some workloads, this leads to drops which are reported by the rx_discards_phy ethtool counter. In order to improve performance, only create vepa_fdb when in VEPA mode. Tests were done on a ConnectX-6 Lx adapter forwarding 64B packets between both ports using dpdk-testpmd. Numbers are Rx-pps for each port, as reported by testpmd. Without changes: traffic to unknown mac testpmd on PF numvfs=0,0 35257998,35264499 numvfs=1,1 24590124,24590888 testpmd on VF with numvfs=1,1 20434338,20434887 traffic to VF mac testpmd on VF with numvfs=1,1 30341014,30340749 With changes: traffic to unknown mac testpmd on PF numvfs=0,0 35404361,35383378 numvfs=1,1 29801247,29790757 testpmd on VF with numvfs=1,1 24310435,24309084 traffic to VF mac testpmd on VF with numvfs=1,1 34811436,34781706 Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-21net/mlx5: Add sync reset drop mode supportMoshe Shemesh
On sync reset flow, firmware may request a PF, which already acknowledged the unload event, to move to drop mode. Drop mode means that this PF will reduce polling frequency, as this PF is not going to have another active part in the reset, but only reload back after the reset. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Aya Levin <ayal@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-21net/mlx5: Generalize QoS operations for nodes and vportsCarolina Jubran
Refactor QoS normalization and rate calculation functions to operate on mlx5_esw_sched_node, allowing for generalized handling of both vports and nodes. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>