Skip to content

tg3: add napi_enabled flag to track napi_enable/napi_disable calls#570

Open
yurypm wants to merge 1 commit into
sonic-net:masterfrom
yurypm:master
Open

tg3: add napi_enabled flag to track napi_enable/napi_disable calls#570
yurypm wants to merge 1 commit into
sonic-net:masterfrom
yurypm:master

Conversation

@yurypm
Copy link
Copy Markdown
Contributor

@yurypm yurypm commented May 7, 2026

We need this patch to fix a soft lockup in the Linux kernel on Arista modular chassis in the 202511 branch.
During linecard resets, uncorrectable errors could be reported. As a result, AER recovery for the tg3 device can be initiated by the AER kernel driver. The tg3_io_error_detected function is the AER error recovery handler.

From tg3_io_error_detected, we call tg3_netif_stop->tg3_napi_disable->
napi_disable and return PCI_ERS_RESULT_NEED_RESET on non-fatal error.
We expect that during AER recovery tg3_io_slot_reset and tg3_io_resume will
be called. But AER error recovery can fail. For example, when one of PCIe
devices on the same bus reports PCI_ERS_RESULT_NO_AER_DRIVER. As a result,
tg3_io_slot_reset and tg3_io_resume are not called, PCIe device is
disabled and NAPI is disabled (pci_disable_device and napi_disabled
are called from tg3_io_error_detected). Then we can try to disable PCIe link
and napi_disable will be called again:
napi_disable+0x1b/0x1b0
tg3_napi_disable+0x89/0xa0 [tg3]
tg3_netif_stop+0x37/0xe3 [tg3]
tg3_stop+0x30/0x160 [tg3]
tg3_close+0x2a/0x60 [tg3]
__dev_close_many+0xad/0x130
dev_close_many+0xb2/0x190
unregister_netdevice_many_notify+0x19d/0xa00
? try_to_wake_up+0x302/0x680
unregister_netdevice_queue+0xf8/0x140
unregister_netdev+0x1c/0x30
tg3_remove_one+0xaa/0x150 [tg3]
pci_device_remove+0x42/0xb0
device_release_driver_internal+0x19c/0x200
pci_stop_bus_device+0x85/0xb0
pci_stop_bus_device+0x2c/0xb0
pci_stop_bus_device+0x2c/0xb0
pci_stop_and_remove_bus_device+0x12/0x20
pciehp_unconfigure_device+0x9f/0x160
pciehp_disable_slot+0x67/0x100
pciehp_handle_presence_or_link_change+0x77/0x350
This is not expected by napi_disable and a thread can be locked in
napi_disable forever. We have pcierr_recovery to cover similar issue, but for
fatal errors. We cannot reuse this flag because it is reset in tg3_io_resume,
but it is not called when AER recovery fails.
Added new napi_enable flag in tg3 struct and don't call napi_disable if
napi_enable was not called before.

@yurypm yurypm requested a review from a team as a code owner May 7, 2026 15:46
@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented May 7, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: yurypm / name: Yury Murashka (9564cef)

@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Comment thread patches-sonic/driver-arista-net-tg3-napi-enable-called-flag.patch Outdated
Comment thread patches-sonic/driver-arista-net-tg3-napi-enable-called-flag.patch
Comment thread patches-sonic/driver-arista-net-tg3-napi-enable-called-flag.patch
Comment thread patches-sonic/driver-arista-net-tg3-napi-enable-called-flag.patch
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@yurypm yurypm requested a review from paulmenzel May 12, 2026 23:17
Copy link
Copy Markdown
Contributor

@paulmenzel paulmenzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I forgot to publish my comments I made some days ago.

Comment thread patches-sonic/driver-arista-net-tg3-napi-enable-called-flag.patch
Comment thread patches-sonic/driver-arista-net-tg3-napi-enable-called-flag.patch
Comment thread patches-sonic/driver-arista-net-tg3-napi-enable-called-flag.patch Outdated
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@yurypm yurypm requested a review from paulmenzel May 13, 2026 13:19
We need this patch to fix a soft lockup in the Linux kernel on Arista modular chassis in the 202511 branch.
During linecard resets, uncorrectable errors could be reported. As a result, AER recovery for the tg3 device can be initiated by the AER kernel driver. The tg3_io_error_detected function is the AER error recovery handler.

From tg3_io_error_detected, we call tg3_netif_stop->tg3_napi_disable->
napi_disable and return PCI_ERS_RESULT_NEED_RESET on non-fatal error.
We expect that during AER recovery tg3_io_slot_reset and tg3_io_resume will
be called. But AER error recovery can fail. For example, when one of PCIe
devices on the same bus reports PCI_ERS_RESULT_NO_AER_DRIVER. As a result,
tg3_io_slot_reset and tg3_io_resume are not called, PCIe device is
disabled and NAPI is disabled (pci_disable_device and napi_disabled
are called from tg3_io_error_detected). Then we can try to disable PCIe link
and napi_disable will be called again:
napi_disable+0x1b/0x1b0
tg3_napi_disable+0x89/0xa0 [tg3]
tg3_netif_stop+0x37/0xe3 [tg3]
tg3_stop+0x30/0x160 [tg3]
tg3_close+0x2a/0x60 [tg3]
__dev_close_many+0xad/0x130
dev_close_many+0xb2/0x190
unregister_netdevice_many_notify+0x19d/0xa00
? try_to_wake_up+0x302/0x680
unregister_netdevice_queue+0xf8/0x140
unregister_netdev+0x1c/0x30
tg3_remove_one+0xaa/0x150 [tg3]
pci_device_remove+0x42/0xb0
device_release_driver_internal+0x19c/0x200
pci_stop_bus_device+0x85/0xb0
pci_stop_bus_device+0x2c/0xb0
pci_stop_bus_device+0x2c/0xb0
pci_stop_and_remove_bus_device+0x12/0x20
pciehp_unconfigure_device+0x9f/0x160
pciehp_disable_slot+0x67/0x100
pciehp_handle_presence_or_link_change+0x77/0x350
This is not expected by napi_disable and a thread can be locked in
napi_disable forever. We have pcierr_recovery to cover similar issue, but for
fatal errors. We cannot reuse this flag because it is reset in tg3_io_resume,
but it is not called when AER recovery fails.

If an AER error is reported, recovery is started and tg3_io_error_detected is
called. In tg3_io_error_detected, NAPI is disabled and pci_disable_device is
called. Then, if we try to reset the device, pci_disable_device will be called
again for the same device.

Add a napi_enabled flag to struct tg3 to track whether napi_enable has
been called. Guard tg3_napi_disable() against being called before
tg3_napi_enable(), logging an error if that happens. Also guard
pci_disable_device() calls in tg3_remove_one() and tg3_shutdown() with
pci_is_enabled() to avoid disabling an already-disabled device.

Signed-off-by: Yury Murashka <yurypm@arista.com>
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants