8000
Skip to content

Conversation

@sashu2310
Copy link
Contributor

This PR addresses issue #9849 :
BUG:
Workers encounter Out Of Memory (OOM) issues when fetching a large number of ETA/countdown tasks, as there’s no configurable limit on how many ETA tasks a worker can hold in memory. This results in workers crashing and then, those tasks go back to the queue, and get fetched by other workers, which go in their turn out of memory, and so on.

FIX:

  • Introduced worker_eta_task_limit to limit the number of ETA/countdown tasks a worker can hold in memory, preventing memory exhaustion.
  • Updated the task execution strategy to reject new ETA tasks when the limit is reached.
  • Added documentation for the new configuration option.
  • Implemented unit tests to validate the behavior of the ETA task limit.

Fixes #9849

@sashu2310 sashu2310 force-pushed the fix/eta-tasks-prefetch-limit branch from 8f2e60f to e81eb4a Compare August 8, 2025 11:59
@auvipy auvipy requested review from auvipy and Copilot August 8, 2025 12:04

This comment was marked as outdated.

8000
@sashu2310 sashu2310 force-pushed the fix/eta-tasks-prefetch-limit branch from e81eb4a to ba314c9 Compare August 8, 2025 12:12
Copy link
Member
@auvipy auvipy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please check the failing unit tests

Copy link
Member
@Nusnus Nusnus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting PR - how did you test it (aside from the unit tests)?

@Nusnus
Copy link
Member
Nusnus commented Aug 8, 2025

please check the failing unit tests

2025-08-08T12:23:06.0103380Z ==================================== ERRORS ====================================
2025-08-08T12:23:06.0103733Z ________________ ERROR collecting t/unit/worker/test_request.py ________________
2025-08-08T12:23:06.0104157Z .tox/3.13-unit/lib/python3.13/site-packages/_pytest/python.py:493: in importtestmodule
2025-08-08T12:23:06.0104553Z     mod = import_path(
2025-08-08T12:23:06.0104827Z .tox/3.13-unit/lib/python3.13/site-packages/_pytest/pathlib.py:587: in import_path
2025-08-08T12:23:06.0105134Z     importlib.import_module(module_name)
2025-08-08T12:23:06.0105509Z /opt/hostedtoolcache/Python/3.13.2/x64/lib/python3.13/importlib/__init__.py:88: in import_module
2025-08-08T12:23:06.0105912Z     return _bootstrap._gcd_import(name[level:], package, level)
2025-08-08T12:23:06.0106191Z <frozen importlib._bootstrap>:1387: in _gcd_import
2025-08-08T12:23:06.0106467Z     ???
2025-08-08T12:23:06.0106696Z <frozen importlib._bootstrap>:1360: in _find_and_load
2025-08-08T12:23:06.0106908Z     ???
2025-08-08T12:23:06.0107101Z <frozen importlib._bootstrap>:1331: in _find_and_load_unlocked
2025-08-08T12:23:06.0107325Z     ???
2025-08-08T12:23:06.0107496Z <frozen importlib._bootstrap>:935: in _load_unlocked
2025-08-08T12:23:06.0107699Z     ???
2025-08-08T12:23:06.0107961Z .tox/3.13-unit/lib/python3.13/site-packages/_pytest/assertion/rewrite.py:185: in exec_module
2025-08-08T12:23:06.0108278Z     exec(co, module.__dict__)
2025-08-08T12:23:06.0108481Z t/unit/worker/test_request.py:22: in <module>
2025-08-08T12:23:06.0108694Z     from celery.worker import strategy
2025-08-08T12:23:06.0109014Z E     File "/home/runner/_work/celery/celery/celery/worker/strategy.py", line 205
2025-08-08T12:23:06.0109294Z E       nonlocal eta_task_count
2025-08-08T12:23:06.0109472Z E       ^^^^^^^^^^^^^^^^^^^^^^^
2025-08-08T12:23:06.0109710Z E   SyntaxError: name 'eta_task_count' is used prior to nonlocal declaration
2025-08-08T12:23:06.0110059Z _______________ ERROR collecting t/unit/worker/test_strategy.py ________________
2025-08-08T12:23:06.0110435Z .tox/3.13-unit/lib/python3.13/site-packages/_pytest/python.py:493: in importtestmodule
2025-08-08T12:23:06.0110858Z     mod = import_path(
2025-08-08T12:23:06.0111121Z .tox/3.13-unit/lib/python3.13/site-packages/_pytest/pathlib.py:587: in import_path
2025-08-08T12:23:06.0111418Z     importlib.import_module(module_name)
2025-08-08T12:23:06.0111819Z /opt/hostedtoolcache/Python/3.13.2/x64/lib/python3.13/importlib/__init__.py:88: in import_module
2025-08-08T12:23:06.0112195Z     return _bootstrap._gcd_import(name[level:], package, level)
2025-08-08T12:23:06.0112463Z <frozen importlib._bootstrap>:1387: in _gcd_import
2025-08-08T12:23:06.0112672Z     ???
2025-08-08T12:23:06.0112847Z <frozen importlib._bootstrap>:1360: in _find_and_load
2025-08-08T12:23:06.0113058Z     ???
2025-08-08T12:23:06.0113248Z <frozen importlib._bootstrap>:1331: in _find_and_load_unlocked
2025-08-08T12:23:06.0113466Z     ???
2025-08-08T12:23:06.0113640Z <frozen importlib._bootstrap>:935: in _load_unlocked
2025-08-08T12:23:06.0113902Z     ???
2025-08-08T12:23:06.0114173Z .tox/3.13-unit/lib/python3.13/site-packages/_pytest/assertion/rewrite.py:185: in exec_module
2025-08-08T12:23:06.0114486Z     exec(co, module.__dict__)
2025-08-08T12:23:06.0114750Z t/unit/worker/test_strategy.py:15: in <module>
2025-08-08T12:23:06.0115054Z     from celery.worker.strategy import default as default_strategy
2025-08-08T12:23:06.0115416Z E     File "/home/runner/_work/celery/celery/celery/worker/strategy.py", line 205
2025-08-08T12:23:06.0115784Z E       nonlocal eta_task_count
2025-08-08T12:23:06.0115985Z E       ^^^^^^^^^^^^^^^^^^^^^^^
2025-08-08T12:23:06.0116232Z E   SyntaxError: name 'eta_task_count' is used prior to nonlocal declaration
2025-08-08T12:23:06.0116616Z =============================== warnings summary ===============================

Copy link
Member
@Nusnus Nusnus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use tox -e lint to check lint issues locally

@sashu2310
Copy link
Contributor Author

Interesting PR - how did you test it (aside from the unit tests)?

I tested the changes by mocking a bulk set of ETA tasks with an ETA of around 30 minutes. I then manually set a low memory limit for the worker process. The worker kept fetching the ETA messages from the Redis broker and holding them in memory. Once the memory usage hit the set limit, the worker started crashing, and the remaining ETA tasks were picked up by other workers, which also eventually crashed as they exceeded the memory limit.

With the new configuration in place, the worker rejects new ETA tasks when the limit is hit, preventing memory exhaustion and worker crashes. This behaviour was validated by monitoring that workers no longer crashed under heavy ETA task load, and the rejected tasks were safely re-queued as expected.

@Nusnus Nusnus marked this pull request as draft August 8, 2025 12:42
Nusnus
Nusnus previously requested changes Aug 8, 2025
Copy link
Member
@Nusnus Nusnus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to draft until all of the issues are addressed.

@sashu2310 sashu2310 force-pushed the fix/eta-tasks-prefetch-limit branch from 8f798e7 to 45b4c41 Compare August 8, 2025 13:38
@sashu2310 sashu2310 marked this pull request as ready for review August 8, 2025 13:40
@sashu2310 sashu2310 requested review from Nusnus and auvipy August 8, 2025 13:41
@sashu2310 sashu2310 marked this pull request as draft August 8, 2025 14:16
@sashu2310 sashu2310 marked this pull request as ready for review August 8, 2025 14:19
@sashu2310
Copy link
Contributor Author

Are these tests running on stale code ? I have already fixed the pertaining issue

@sashu2310 sashu2310 force-pushed the fix/eta-tasks-prefetch-limit branch from 45b4c41 to 8c4e5f2 Compare August 9, 2025 07:51
@sashu2310 sashu2310 marked this pull request as draft August 9, 2025 08:08
Copy link
@AyoubOm AyoubOm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments about the implementation approach

@sashu2310 sashu2310 force-pushed the fix/eta-tasks-prefetch-limit branch from acbbe08 to dac304a Compare August 9, 2025 13:14
@codecov
Copy link
codecov bot commented Aug 9, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.64%. Comparing base (166f705) to head (25f4a38).
⚠️ Report is 66 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #9853   +/-   ##
=======================================
  Coverage   78.63%   78.64%           
=======================================
  Files         153      153           
  Lines       19222    19223    +1     
  Branches     2555     2555           
=======================================
+ Hits        15115    15117    +2     
  Misses       3810     3810           
+ Partials      297      296    -1     
Flag Coverage Δ
unittests 78.61% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sashu2310 sashu2310 marked this pull request as ready for review August 9, 2025 14:26
@auvipy auvipy added this to the 5.6.0 milestone Aug 9, 2025
Copy link
Member
@Nusnus Nusnus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for fixing everything.
The failing integration tests are an issue with main.

We're trying to solve it ASAP to get a clean CI on new contributions, FYI

@sashu2310 sashu2310 force-pushed the fix/eta-tasks-prefetch-limit branch from 0d57db9 to 6b61e4d Compare August 11, 2025 08:17
@Nusnus
Copy link
Member
Nusnus commented Aug 11, 2025

OK lets wait for the kombu 5.6b2

I'll release it in the next 24h; OOO for today :)

@Nusnus
Copy link
Member
Nusnus commented Aug 12, 2025

OK lets wait for the kombu 5.6b2

I'll release it in the next 24h; OOO for today :)

Done: https://github.com/celery/kombu/releases/tag/v5.6.0b2

@sashu2310

@Nusnus Nusnus force-pushed the fix/eta-tasks-prefetch-limit branch from 6b61e4d to f9362eb Compare August 12, 2025 11:22
@sashu2310
Copy link
Contributor Author

Hey @auvipy @Nusnus Is this is good to be merged now ? or we're gonna wait for the failing integration tests to be fixed first ?

@auvipy
Copy link
Member
auvipy commented Aug 13, 2025

lets us wait for couple of days for another round of review and the CI fix...

@AyoubOm
Copy link
AyoubOm commented Aug 13, 2025

Thank you for working on this ! I’ll give it a look as well asap, when I have some time

Copy link
@AyoubOm AyoubOm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One minor comment on the docs

@Nusnus
Copy link
Member
Nusnus commented Aug 21, 2025

lets us wait for couple of days for another round of review and the CI fix...

Unfortunately this one is a bit more tricky than expected.

I'm planning to give it another shot tomorrow.

Fyi

@AyoubOm
Copy link
AyoubOm commented Aug 22, 2025

lets us wait for couple of days for another round of review and the CI fix...

Unfortunately this one is a bit more tricky than expected.

I'm planning to give it another shot tomorrow.

Fyi

@Nusnus Do you mean the PR or the CI fix ?

@Nusnus
Copy link
Member
Nusnus commented Aug 22, 2025

lets us wait for couple of days for another round of review and the CI fix...

Unfortunately this one is a bit more tricky than expected.

I'm planning to give it another shot tomorrow.

Fyi

@Nusnus Do you mean the PR or the CI fix ?

CI. There are many contributions that are blocked because we can't confirm the changes are valid.

@auvipy auvipy force-pushed the fix/eta-tasks-prefetch-limit branch from 8acc6ff to fba15f3 Compare August 30, 2025 13:29
sashu2310 and others added 11 commits August 31, 2025 11:03
- Introduced `worker_eta_task_limit` to limit the number of ETA/countdown tasks a worker can hold in memory, preventing memory exhaustion.
- Updated the task execution strategy to reject new ETA tasks when the limit is reached.
- Added documentation for the new configuration option.
- Implemented unit tests to validate the behavior of the ETA task limit.
Co-authored-by: Tomer Nosrati <tomer.nosrati@gmail.com>
- Introduced `worker_eta_task_limit` setting to limit the number of ETA tasks a worker can hold in memory.
- Implemented `ETATaskTracker` class to track and enforce the ETA task limit.
- Updated `default` strategy to reject new ETA tasks when the limit is reached.
- Added unit tests to verify the behavior of the ETA task limit and tracker.
@auvipy auvipy force-pushed the fix/eta-tasks-prefetch-limit branch from fba15f3 to 25f4a38 Compare August 31, 2025 05:03
@auvipy auvipy merged commit 6da3282 into celery:main Aug 31, 2025
196 of 202 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Celery worker OOM due to unbounded prefetch on eta tasks

5 participants

0