feat(control): improve DNS fallback reliability and harden connection lifecycle#936
feat(control): improve DNS fallback reliability and harden connection lifecycle#936olicesx wants to merge 525 commits intodaeuniverse:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR hardens the control-plane DNS forwarding and connection lifecycle under concurrency, adding a UDP-first/TCP-fallback path for tcp+udp DNS upstreams, health feedback on forward failures, and multiple regression tests to validate fallback, cleanup, and pool safety.
Changes:
- Add robust DNS forwarding behavior: UDP-first with TCP fallback for
tcp+udpupstreams, plus dialer health feedback on forward failures and concurrency limiting/singleflight coalescing. - Improve UDP/TCP connection lifecycle and concurrency safety (UDP task pool GC, DNS connection pooling/pipelining, routing-result caching for UDP endpoints).
- Add/expand regression tests and update changelog entries.
Reviewed changes
Copilot reviewed 32 out of 32 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| control/utils.go | Adjust routing matcher call signature and MAC formatting to fixed-size 16-byte inputs. |
| control/udp_task_pool_test.go | Replace prior timing-based test with deterministic concurrency/order regression tests. |
| control/udp_task_pool.go | Rework UDP task queue lifecycle/GC for concurrency safety and predictable idle cleanup. |
| control/udp_routing_cache_test.go | Add tests for UDP endpoint routing-result cache hit/expire behavior. |
| control/udp_endpoint_pool.go | Add per-endpoint routing-result cache with TTL and concurrency protection. |
| control/udp.go | Add fast QUIC prefilter to avoid expensive sniffing; handle DNS concurrency-limit refusal explicitly. |
| control/tcp_test.go | Add regression test ensuring RelayTCP cancellation unblocks the opposite direction. |
| control/tcp.go | Use context-driven cancellation to interrupt the other copy direction promptly; switch to consts IPPROTO. |
| control/routing_matcher_userspace.go | Change matcher inputs to fixed-size [16]uint8 and reduce per-call allocations. |
| control/packet_sniffer_pool_test.go | Make packet sniffer tests isolate global pool state and remove sessions explicitly. |
| control/dns_pipelining_bench_test.go | Add benchmarks around pipelined DNS conn, ID allocation, and contention patterns. |
| control/dns_pipelined_conn_test.go | Add tests ensuring pipelined conn cleanup on success/timeout and input ID preservation. |
| control/dns_listener.go | Fix TCP listener start logic and ignore concurrency-limit errors (REFUSED already written). |
| control/dns_id_bitmap_test.go | Add tests for concurrent ID allocation uniqueness and reuse behavior. |
| control/dns_fallback_test.go | Add tests for UDP→TCP fallback and for avoiding dialer poisoning on canceled contexts. |
| control/dns_control.go | Major DNS controller hardening: concurrency limiter, singleflight coalescing, sync.Map caches, forwarder caching, fallback routing, and close lifecycle. |
| control/dns_conn_pool_test.go | Add tests for UDP conn pool close/put races, conn pool dial contention, and responseSlot reuse. |
| control/dns_concurrency_test.go | Add regression test validating concurrency limiter rejection behavior. |
| control/dns_cache.go | Replace reflection-based deep copy with explicit RR copying and add DnsCache.Clone. |
| control/dns.go | Implement pooled/pipelined DNS transports, TCP/TLS connection pooling, UDP conn pool, and improved cleanup paths. |
| control/control_plane_core_test.go | Add race regression test validating atomic core flip behavior. |
| control/control_plane_core.go | Make core flip atomic/CAS-based and apply best-effort qdisc/filter cleanup comments. |
| control/control_plane.go | Wire DNS controller Close into lifecycle, switch to sync.Map cloning, and add UDP routing-result caching in receive path. |
| control/connectivity.go | Use local consts IP protocol numbers instead of unix constants. |
| control/bpf_utils.go | Add documentation notes for auto-generated BPF types; improve batch map error wrapping. |
| control/anyfrom_pool.go | Gate GSO detection behind env var to keep GSO disabled by default but optionally testable. |
| component/sniffing/sniffer.go | Add cheap QUIC header precheck to avoid costly sniffing when not applicable. |
| component/sniffing/quic_test.go | Add unit tests for the QUIC initial-packet precheck. |
| component/sniffing/quic.go | Add IsLikelyQuicInitialPacket helper for fast-path filtering. |
| common/consts/dialer_test.go | Add tests for protocol conversion helpers. |
| common/consts/dialer.go | Replace unix protocol constants with local IANA protocol numbers; add doc comments. |
| CHANGELOGS.md | Add Unreleased section documenting DNS/control robustness improvements and tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if slot == nil { | ||
| continue | ||
| } | ||
| slot.set(&msg) |
There was a problem hiding this comment.
In pipelinedConn.readLoop, slot.set(&msg) passes a pointer to the loop-local msg variable. Because multiple in-flight requests can be pending concurrently, this can lead to responses sharing/overwriting the same underlying dnsmessage.Msg storage (and potential data races / wrong answers delivered). Allocate a distinct message per response before sending it to the slot (e.g., create a new dnsmessage.Msg and copy msg into it, or copy into a new variable whose address won’t be reused across iterations).
| slot.set(&msg) | |
| // Allocate a distinct message instance per response to avoid | |
| // sharing the loop-local msg between concurrent requests. | |
| resp := new(dnsmessage.Msg) | |
| *resp = msg | |
| slot.set(resp) |
| func (c *DnsCache) FillInto(req *dnsmessage.Msg) { | ||
| req.Answer = deepcopy.Copy(c.Answer).([]dnsmessage.RR) | ||
| if c.Answer != nil { | ||
| req.Answer = make([]dnsmessage.RR, len(c.Answer)) | ||
| for i, rr := range c.Answer { | ||
| req.Answer[i] = dnsmessage.Copy(rr) | ||
| } | ||
| } |
There was a problem hiding this comment.
DnsCache.FillInto only assigns req.Answer when c.Answer != nil. If the cache entry has a nil/empty Answer, this leaves any existing req.Answer intact, which can accidentally leak stale answers if the dnsmessage.Msg is reused. Always set req.Answer explicitly (set to nil/empty when c.Answer is nil) before setting the other response fields.
71cf539 to
8828b24
Compare
6ecfee3 to
348514f
Compare
如果下述不适合发这里,请告诉我删除。feat(dns): optimize DNS caching with async write and lock-free upstream resolver 594f449 这个改进挺厉害的~~
|
|
@olicesx 话说,你这个 PR 的 commit (100+)太庞大了,啥时候能完工呢。 02-26 20:19:55 level=debug msg="UDP routing tuple missing; short-lived UDP fast path fallback (Total=7200)" dst="192.168.1.15:53"
02-26 20:19:15 level=debug msg="UDP routing tuple missing; short-lived UDP fast path fallback" dst="198.18.0.2:53" error="reading map: key [192.168.1.171:57322, 17, 198.18.0.2:53]: lookup: key does not exist" src="192.168.1.171:57322"
02-26 20:06:49 level=debug msg="UDP routing tuple missing; short-lived UDP fast path fallback (Total=6900)" dst="192.168.1.15:53" |
|
@MaurUppi debug 日志就是用来 debug 的,以及现在并没有改动太多日志相关内容后续再说吧 |
debug level 显示太多只是我个人偏好而已,但后续若是改造日志的话,还是希望考虑结构性/方便过滤查询的方式。 |
|
@olicesx udp-taskpool-cpu-regression summary
已确认
新发现(原文未覆盖)
修复建议
2026-02-27-udp-taskpool-cpu-regression-analysis# dae CPU 回归排查记录(2026-02-27) 1. 背景与现象
2. 关键证据(本次排查)2.1 在线指标
2.2 CPU pprof(2026-02-27 09:42:34)
2.3 goroutine pprof
3. 根因分析3.1 直接缺陷(根因)
3.2 放大因素(触发本次回归)
4. commit 溯源结论
5. 建议修复方案5.1 必选修复(先做)
5.2 建议修复(次步)
5.3 可观测性补充
6. 修复后验收标准
Opus-reviewed# Opus Review: UDP TaskPool CPU Regression Analysis Reviewed: 2026-02-27 1. Verification of Live EvidenceAll key claims independently confirmed via live pprof (captured 2026-02-27 ~10:00 CST):
Key observations:
2. Root Cause Validation2.1
|
| Commit | Role | Evidence |
|---|---|---|
a795323 (2026-02-17) |
Introduced defect: LoadAndDelete + pointer compare pattern |
git blame on tryDeleteQueue |
41d20f2 (2026-02-25) |
Triggered regression: 100ms aging + broader QUIC path usage | git blame on UdpTaskPoolAgingTime |
4. Fix Recommendations — Review & Amendments
4.1 MUST FIX: tryDeleteQueue → CompareAndDelete ✅ Agree
func (p *UdpTaskPool) tryDeleteQueue(key netip.AddrPort, expected *UdpTaskQueue) bool {
return p.queues.CompareAndDelete(key, expected)
}sync.Map.CompareAndDelete is available since Go 1.20; project uses Go 1.26. Pointer comparison via any interface equality works correctly.
4.2 MUST FIX (MISSING): acquireQueue draining path
The p.queues.Delete(key) at line 246 should also use CompareAndDelete:
if q.draining.Load() {
p.queues.CompareAndDelete(key, q) // only delete if still the draining queue
goto createNew
}4.3 SHOULD FIX: Tighten IsLikelyQuicInitialPacket ✅ Agree
Add version field check (bytes 1-4). All deployed QUIC versions use specific version numbers:
- QUIC v1:
0x00000001 - QUIC v2:
0x6b3343cf - Version negotiation:
0x00000000
Even checking that buf[1:5] matches a known version set would cut false positives dramatically.
4.4 SHOULD EVALUATE: agingTime=100ms
100ms is aggressive. Consider:
- 1s default with configurable override
- Or adaptive: start at 100ms for known-QUIC (version-verified), use longer timeout for unverified matches
4.5 OPTIONAL: Convoy goroutine exit safety net
Add a maximum lifetime or iteration count to convoy's cleanup loop as defense-in-depth. If tryDeleteQueue fails N times consecutively with no tasks received, force-exit. This prevents permanent leaks from any future race conditions.
5. Validation Criteria — Agree with additions
Original 6D40 criteria are correct. Add:
- No convoy goroutine should exist without a corresponding entry in
queuessync.Map (can verify via pprof goroutine dump + metrics endpointudp_task_pool_count) - After fix deployment, convoy goroutine count should track queue count closely (within single-digit delta at any point)
- CPU should drop to baseline within minutes, not gradually (the fix eliminates the leak, existing orphans exit on restart)
6. Summary Assessment
| Aspect | Rating | Notes |
|---|---|---|
| Problem identification | Excellent | Correct root cause, correct commit attribution |
| Evidence quality | Excellent | pprof CPU + goroutine + metrics all consistent |
| Race analysis | Good | Primary race correct; missed secondary race in acquireQueue |
| Fix proposal | Good | CompareAndDelete is the right fix; missing acquireQueue fix |
| Severity assessment | Accurate | Real regression, accumulates over time, will worsen |
Bottom line: The analysis is sound and actionable. Apply both CompareAndDelete fixes (tryDeleteQueue + acquireQueue draining path), tighten QUIC detection, and the regression will be resolved.
感谢审阅 |
感谢你的一直贡献才对 测试/构建细节见:MaurUppi#26 EDITED PR#26 before/after 报告 <-- 追加 24h 后的 同口径复测(两段窗口 + pprof)
udp-taskpool CPU 回归修复前后对比监测(2026-02-27)1. 监测目标与对象
2. 修复前基线(Before)时间点:
3. 修复后监测(After)
3.1 第一段(启动后早期窗口)监测时间段:
汇总:
pprof(第一段末尾,
3.2 第二段(启动后 1h+ 窗口)监测时间段:
汇总:
pprof(第二段末尾,
4. Before/After 结论4.1 回归修复有效(核心指标)
4.2 当前状态判断
5. 后续建议
6. 2026-02-28 同口径复测(两段窗口 + pprof)复测目标:验证第 5 节建议中的关键判断,即 口径说明(与第 3 节一致):
原始采样文件: 6.1 第一段窗口监测时间段:
汇总:
pprof(第一段末尾,
6.2 第二段窗口监测时间段:
汇总:
pprof(第二段末尾,
6.3 对“低双位数稳定性”的复测结论
7. 新发现与下一步(Systematic Debugging)基于本次复测证据,当前更接近“convoy 再次放大/持续存在”的故障形态,而非“低位稳定波动”。 建议下一步按同一流程继续 Phase 1(仅取证,不先改代码):
8. 2026-02-28 追加取证:剩余“convoy 常驻”路径验证(不改代码)8.1 取证目标
8.2 关键实时证据(线上)采样时间: 单点样本:
短时序样本(共 8 点,节选):
观测结论:
8.3 与代码路径的对应关系相关代码点:
可导致常驻的剩余路径(基于代码与上述指标背离):
8.4 本节结论(Phase 1)
9. Phase 2/3:单变量最小化验证(
|
|
PR#936 需要补 fix 的 commit 定位
我自己 fork repo 的 PR: #27 |
e9685c7 to
7eafc6e
Compare
…bility - Simplified the dnsForwarderKey structure by removing unnecessary dialArgument. - Added tests for ResetDnsForwarders to ensure in-flight forwarders are handled correctly. - Enhanced DNSListener to use atomic pointers for the ControlPlane, improving thread safety. - Updated dnsHandler to utilize the new Controller method for better error handling. - Introduced new methods in failedQuicDcidCache for managing shard storage and cleanup. - Improved routing matcher builder to retain state in snapshots and refactored kernspace building logic. - Added tests to verify the integrity of the routing kernspace snapshot. - Enhanced UDP handling with new packet sending functions to support advanced features.
The reload preparation path in cmd/run.go uses a 45-second timeout context that was leaking into ControlPlane lifecycle contexts via context.WithCancel(ctx). When the timeout fired, Serve() would exit and all traffic (both direct and proxy) would die. - Derive all CP-owned contexts from context.Background() instead of the caller's potentially-timed-out ctx - Add retired atomic.Bool to block stale health-check callback writes during drain - Add MarkRetired() to both staged and non-staged retirement goroutines - Add Serve() exit reason logging to distinguish normal vs timeout-driven exits Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
DNS controller workers (bpfUpdateWorker, janitor, evictor) watched baseContext().Done() which changes across reload generations. When the lifecycle context swapped during reload, workers would exit prematurely. Remove baseContext().Done() watches so workers survive across reloads. Workers are stopped via explicit stop channels closed in DnsController.Close(). Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
clearRejectedReloadProgress() hardcoded SignalProgressFilePath for reads, but tests override the writer to use temp files. Add getRunSignalProgress variable so tests can override both read and write paths. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Three DNS router tests use real UDP sockets with SO_MARK which requires CAP_NET_ADMIN. Add skipIfNoSocketMark helper and skip these tests in CI containers that lack the capability. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
RestoreHealthSnapshot unconditionally set reloadInheritedHealth which added a full CheckInterval (~30s) delay before the first health check. When dialers inherited NOT-ALIVE state from the previous generation, they stayed unreachable for 30+ seconds after reload. Only defer the first health check when ALL inherited collections are ALIVE. NOT-ALIVE dialers need an immediate probe to recover connectivity. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
…and limit response reads - errors: IsUDPEndpointNormalClose(nil) returns false to match companion function - netutils: comma-ok type assertion for logger from context value - subscription: cap io.ReadAll with 10MB LimitReader - config_merger: defer f.Close() after os.Open to prevent fd leak - rawsock_linux: syscall.Close(sock) on bind failure to prevent fd leak Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
…t DoH response reads - daedns: add singleflight.Group to LookupIPAddr for concurrent lookup dedup - daedns: use sync.Pool for 65535-byte UDP DNS buffers instead of per-query allocation - daedns: cap DoH response with io.LimitReader(resp.Body, 65535) - control/dns: cap DoH response with io.LimitReader(resp.Body, 65535) Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
…e exit, and connectivity check limit - routing: detect IPv6 with ':' and use /128 instead of /32 - outbound/filter: cache compiled regexp2 patterns in sync.Map - sniffing: select on ctx.Done() in readStreamOnceAsync to prevent goroutine leak - connectivity_check: cap debug body read with io.LimitReader(resp.Body, 4096) Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
…te audit-fix head - cmd/run: convert if-else chain to switch for golangci-lint gocritic - go.mod: replace local outbound with remote olicesx/outbound pseudo-version Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
- Introduced a shared timeout for DNS lookups to prevent blocking indefinitely. - Added functions to interrupt connections on context cancellation for both TCP and QUIC. - Enhanced the Router to utilize the new timeout and interruption mechanisms. - Updated tests to verify the behavior of deduplicated lookups and large UDP responses. - Modified the DNS forwarder to track consecutive errors and retire after a threshold. - Adjusted the handling of proxy TCP forwarders to retain them on ordinary transport errors. - Updated go.mod to use the latest outbound dependency version.
Background
This PR improves control-plane robustness with a focus on DNS forwarding reliability and concurrency safety.
Main updates include:
Improve tcp+udp DNS upstream behavior with robust UDP-first and TCP fallback handling.
Feed DNS forward failures into dialer health feedback so failover decisions can react faster.
Harden DNS/UDP connection lifecycle under high concurrency.
Add regression tests for fallback, timeout cleanup, and pool safety.
Checklist
Full Changelogs
feat(dns): add robust DNS forward fallback path for tcp+udp upstream (UDP-first with TCP fallback on request failure).
fix(dns): report DNS forward failures to dialer health feedback path to improve failover quality.
fix(control): harden DNS/UDP connection lifecycle handling in high-concurrency paths.
test(control): add regression tests for DNS fallback, timeout cleanup, and pool concurrency safety.
Issue Reference
Closes #[issue number]
Test Result
Environment: Linux
Passed:
go test ./common/consts ./component/sniffing ./control
go test -race ./control
go build .