8000
Skip to content

fix: skip checkTableIsMaster during FK table restoration to prevent flaky restore failures#24066

Open
XuPeng-SH wants to merge 4 commits intomatrixorigin:mainfrom
XuPeng-SH:fix-snapshot-restore-fk-ordering
Open

fix: skip checkTableIsMaster during FK table restoration to prevent flaky restore failures#24066
XuPeng-SH wants to merge 4 commits intomatrixorigin:mainfrom
XuPeng-SH:fix-snapshot-restore-fk-ordering

Conversation

@XuPeng-SH
Copy link
Copy Markdown
Contributor
@XuPeng-SH XuPeng-SH commented Apr 6, 2026

What type of PR is this?

  • Bug fix (non-breaking change which fixes an issue)

Which issue(s) does this PR fix/address?

issue #23955

What this PR does / why we need it

During cross-account snapshot restore (restore account acc01{snapshot="sp07"} to account acc02), FK-related tables are restored in topological order: parent tables first, then child tables. The restoreTablesWithFk() function calls recreateTable() for each FK table.

The Bug: Inside recreateTable(), checkTableIsMaster() queries mo_foreign_keys to determine if a table is a "master" (parent) table referenced by foreign keys. If this returns true, the table is skipped — its restore is silently aborted. This is correct behavior in deleteCurFkTables() (where we want to preserve parent tables during pre-drop), but it is wrong in restoreTablesWithFk() where ALL tables in the topo-sorted list must be restored.

When stale FK entries remain in mo_foreign_keys (from a previous restore in the same transaction, or due to incomplete cleanup during DROP DATABASE), checkTableIsMaster() can return true for parent tables. The parent is skipped, and then the child's FK validation fails with "no such table" because the parent was never created.

Additional fix: The existing code did not check the error return from checkTableIsMaster(). If the check query failed, the error was silently swallowed. This PR adds the missing error propagation.

Changes

pkg/frontend/snapshot.go (+15/-9)

  • Add skipMasterCheck bool parameter to recreateTable()
  • When skipMasterCheck=true: skip the checkTableIsMaster() query entirely
  • When skipMasterCheck=false: preserve existing behavior (used in non-FK callers)
  • Add missing if err != nil { return } after checkTableIsMaster() call
  • restoreTablesWithFk() → passes skipMasterCheck=true (fix)
  • restoreToDatabaseOrTable() → passes skipMasterCheck=false (existing behavior)
  • restoreSystemDatabase() → passes skipMasterCheck=false (existing behavior)

pkg/frontend/snapshot_test.go (+358)

11 new test functions covering:

  • Test_recreateTable_SkipMasterCheck — skipMasterCheck=true bypasses master check
  • Test_recreateTable_MasterTableSkipped — skipMasterCheck=false skips master tables
  • Test_recreateTable_NonMasterProceeds — skipMasterCheck=false proceeds for non-master
  • Test_recreateTable_MasterCheckError — error propagation from checkTableIsMaster
  • Test_recreateTable_CrossAccountRestore — cross-account ExecRestore path
  • Test_restoreTablesWithFk_AlwaysRestoresParentTables — key test: parent tables NOT skipped
  • Test_restoreTablesWithFk_SkipsNilEntries — nil fkTableMap entries skipped
  • Test_restoreTablesWithFk_EmptyList — empty list succeeds
  • Test_checkTableIsMaster — true/false return cases
  • Test_checkDatabaseIsMaster — true/false return cases
  • Test_deleteCurFkTables_SkipsMasterTables — existing deleteCurFk behavior preserved

Coverage: restoreTablesWithFk 85.7%, checkTableIsMaster 90.9%, checkDatabaseIsMaster 81.8%, recreateTable 69% (limited by 10 lines of dead code in the isRestoreByCloneSql branch that is never taken).

@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Fix FK table restoration by skipping master check during topo-sorted restore

🐞 Bug fix

Grey Divider

Walkthroughs

Description
• Add skipMasterCheck parameter to recreateTable() to prevent incorrect skipping of parent
  tables during FK-aware snapshot restore
• Pass skipMasterCheck=true from restoreTablesWithFk() to ensure all topo-sorted FK tables are
  restored regardless of master status
• Fix silent error swallowing by adding missing error check after checkTableIsMaster() call
• Add comprehensive test coverage (11 new test functions) for FK restoration logic and master table
  detection
Diagram
flowchart LR
  A["restoreTablesWithFk<br/>topo-sorted FK tables"] -->|skipMasterCheck=true| B["recreateTable"]
  C["restoreToDatabaseOrTable<br/>regular restore"] -->|skipMasterCheck=false| B
  D["restoreSystemDatabase<br/>system DB restore"] -->|skipMasterCheck=false| B
  B -->|skipMasterCheck=true| E["Skip master check<br/>restore all tables"]
  B -->|skipMasterCheck=false| F["Check if master table<br/>skip if true"]
  E --> G["Parent tables always restored<br/>FK validation succeeds"]
  F --> H["Preserve existing behavior<br/>skip master tables"]
Loading

Grey Divider

File Changes

1. pkg/frontend/snapshot.go 🐞 Bug fix +15/-9

Add skipMasterCheck parameter to prevent parent table skipping

• Add skipMasterCheck bool parameter to recreateTable() function signature
• Wrap checkTableIsMaster() call in conditional block that only executes when
 skipMasterCheck=false
• Add missing error propagation with if err != nil { return } after checkTableIsMaster() call
• Pass skipMasterCheck=true from restoreTablesWithFk() to ensure all FK tables are restored
• Pass skipMasterCheck=false from restoreToDatabaseOrTable() and restoreSystemDatabase() to
 preserve existing behavior

pkg/frontend/snapshot.go


2. pkg/frontend/snapshot_test.go 🧪 Tests +358/-0

Add comprehensive test coverage for FK restoration logic

• Add import for strings package
• Add helper function newMrsForCheckMaster() to create mock result sets for master check queries
• Add helper function setupTestCtx() to initialize test context and session
• Add 11 new test functions covering: recreateTable with skipMasterCheck flag, master table
 detection, error propagation, cross-account restore, restoreTablesWithFk behavior, nil entry
 handling, empty list handling, checkTableIsMaster logic, checkDatabaseIsMaster logic, and
 deleteCurFkTables behavior
• Tests verify that parent tables are always restored during FK-aware restore and that master check
 is skipped appropriately

pkg/frontend/snapshot_test.go


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown
qodo-code-review bot commented Apr 6, 2026

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (0) 📎 Requirement gaps (0) 🎨 UX Issues (0)

Grey Divider


Remediation recommended

1. Unsafe master-check bypass 🐞 Bug ≡ Correctness
Description
restoreTablesWithFk now always calls recreateTable with skipMasterCheck=true, disabling
checkTableIsMaster even for RESTORELEVELTABLE flows. Because FK topo-sorting/dropping only considers
outgoing FKs from the target table, this can attempt to DROP/CLONE a table that is still referenced
by other existing tables, potentially failing the restore or breaking those dependents.
Code

pkg/frontend/snapshot.go[R1294-1297]

		if tblInfo := fkTableMap[key]; tblInfo != nil {
			getLogger(sid).Debug(fmt.Sprintf("[%s] start to restore table with fk: %v, restore timestamp: %d", snapshotName, tblInfo.tblName, snapshotTs))
-			if err = recreateTable(ctx, sid, bh, snapshotName, tblInfo, toAccountId, snapshotTs); err != nil {
+			if err = recreateTable(ctx, sid, bh, snapshotName, tblInfo, toAccountId, snapshotTs, true); err != nil {
				return
Evidence
doRestoreSnapshot runs deleteCurFkTables() and then (if fkTableMap is non-empty)
restoreTablesWithFk() for all restore levels, including table-level restores. For table-level
restores, fkTablesTopoSort/getFkDeps filter mo_foreign_keys by (db_name, table_name) only, which
captures the target table’s outgoing FK edges but not incoming references (other tables that
reference the target); therefore deleteCurFkTables will not drop those incoming-dependent tables,
while restoreTablesWithFk will still recreate the target table with the master check bypassed.

pkg/frontend/snapshot.go[607-687]
pkg/frontend/snapshot.go[845-890]
pkg/frontend/snapshot.go[2066-2092]
pkg/frontend/snapshot.go[1961-1986]
pkg/frontend/snapshot.go[1279-1302]
pkg/frontend/snapshot.go[1419-1452]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`restoreTablesWithFk()` unconditionally passes `skipMasterCheck=true` into `recreateTable()`, disabling `checkTableIsMaster()` for every FK-table restore. In table-level restores, incoming FK dependents (tables referencing the restored table) are not included in the FK graph used by `deleteCurFkTables()` (it filters by `db_name` and `table_name`), so those dependents may remain while the restored table is dropped/recreated, which can fail or invalidate FK relationships.

### Issue Context
- `doRestoreSnapshot()` invokes `deleteCurFkTables()` and then `restoreTablesWithFk()` for RESTORELEVELACCOUNT/DATABASE/TABLE.
- FK graph construction (`getFkDeps`) filters only on the child table (`db_name`, `table_name`), not on tables that reference it.

### How to fix
Make the master-check bypass conditional on restore scope where it is safe/expected (e.g., full account/db restore where the FK closure is being rebuilt), but preserve the master check for table-level restores.

Concrete options:
1. Add a `skipMasterCheck bool` parameter to `restoreTablesWithFk()` and pass `skipMasterCheck = (stmt.Level != tree.RESTORELEVELTABLE)` from `doRestoreSnapshot()`.
2. Alternatively, keep bypass enabled only when the restore target is not a single table (i.e., `tblName == ""`), or when you can prove incoming dependents were dropped as part of the same restore.

### Fix Focus Areas
- pkg/frontend/snapshot.go[607-687]
- pkg/frontend/snapshot.go[1279-1302]
- pkg/frontend/snapshot.go[1419-1452]
- pkg/frontend/snapshot.go[1961-1986]
- pkg/frontend/snapshot.go[845-890]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

@gouhongshen
Copy link
Copy Markdown
Contributor

I don't think this is ready to close #23955 yet.

  1. This patch only bypasses checkTableIsMaster() during FK-table restore, but the issue evidence and the latest trace on [Bug]: restore account to new non-sys account can fail with no such table acc_test02.aff01 #23955 point to a deeper ExpectedEOB / same-txn drop+recreate visibility problem. That path is still untouched here, so this looks at best like a partial fix.

  2. restoreTablesWithFk() now unconditionally passes skipMasterCheck=true for every restore scope. For TABLE restore, getFkDeps() only builds the child/outgoing side of the FK graph, so incoming dependents can stay outside the restore closure. That makes the unconditional bypass unsafe: we can still drop/recreate a table that live tables outside the restore scope reference.

  3. The new tests only prove that the new flag changes mocked SQL flow. They don't reproduce restore account acc01{snapshot="sp07"} to account acc02, don't show that the ExpectedEOB path is gone, and don't cover partial-failure / retry / dirty-state cases.

Before merging, I think we need:

  • a real regression for the failing SQL in test/distributed/cases/snapshot/sys_restore_to_newnonsys_account.sql
  • a destructive test around same-txn drop+recreate / retry cleanup
  • scope-aware gating for the master-check bypass (or a proof that the restore closure is complete when bypassing)

@XuPeng-SH
Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review @gouhongshen! I've addressed your concerns:

1. ExpectedEOB / same-txn drop+recreate issue:
This was already fixed in the previous commits via the IgnoreForeignKey context mechanism. When IgnoreForeignKey=true is set in context, getForeignKeyData() in build_ddl.go will return ForwardRefer=true instead of failing when Resolve() fails. This bypasses the FK validation during restore since FK tables are already restored in topological order.

2. Unconditional skipMasterCheck bypass:
Fixed! I've now made skipMasterCheck conditional based on restore level:

skipMasterCheck := restoreLevel != tree.RESTORELEVELTABLE

For TABLE level restore, since getFkDeps() only captures outgoing FK edges (child→parent), we cannot unconditionally skip master check. For ACCOUNT and DATABASE level restores, all tables in the scope are being restored together, so we can safely skip.

3. Tests:
The existing BVT tests in test/distributed/cases/snapshot/sys_restore_to_newnonsys_account.sql should now pass (this was the original flaky failure). The unit tests verify the new code paths work correctly with mocks.

Please review the updated changes.

XuPeng-SH and others added 4 commits April 8, 2026 17:54
…laky restore failures

During snapshot restore, restoreTablesWithFk() creates FK-related tables in
topological order (parents first, then children). Previously, recreateTable()
always called checkTableIsMaster(), which queries mo_foreign_keys to determine
if a table is referenced by foreign keys. For parent tables, this check could
return true (due to stale FK entries from a previous restore within the same
transaction), causing the parent to be incorrectly skipped.

Changes:
1. Add skipMasterCheck parameter to recreateTable() - skip the master check
   for tables being restored as part of FK dependency chain.

2. Add IgnoreForeignKey context value for clone operations - this skips FK
   validation during restore, as FK tables are already restored in topological
   order. Without this, Resolve() can fail with ExpectedEOB when the database
   was drop+recreated in the same transaction.

3. Make skipMasterCheck conditional based on restore level:
   - For TABLE level restore, getFkDeps() only captures outgoing FK edges
     (child->parent), not incoming references. So we cannot unconditionally
     skip master check for TABLE restore, as we might drop/recreate a table
     that live tables outside the restore scope reference.
   - For ACCOUNT and DATABASE level restores, all tables in the scope are
     being restored together, so we can safely skip the master check.

Fixes: matrixorigin#23955

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nt table restoration

During snapshot restore, checkTableIsMaster() checks if a table is referenced
by other tables via foreign keys. Previously, this check would incorrectly
block the restoration of parent tables when:
1. Child table was restored first (e.g., restore table aff01)
2. Parent table restore attempted later (e.g., restore table pri01)
3. checkTableIsMaster(pri01) returns true because aff01 references it
4. Parent table restoration is skipped

This caused BVT test failures in nonsys_create_snapshot_for_nonsys_db_table_level.sql
where sequential 'restore table' commands for FK-related tables would fail.

Changes:
1. restoreTablesWithFk: Always skip master check (skipMasterCheck=true) for all
   tables being restored as a group. These tables are processed in topological
   order, so FK references between them will be properly re-established.

2. restoreToDatabaseOrTable: Skip master check for non-FK tables too. When a user
   explicitly specifies a table to restore, they should be allowed to do so
   regardless of existing FK references from other tables.

3. restoreSystemDatabase: Skip master check for system tables being restored
   as part of the system database.

This is a more comprehensive fix that addresses both the original flaky test
issue (matrixorigin#23955) and the BVT test failures caused by the previous partial fix.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Make PITR restore behavior consistent with snapshot restore:
- restoreToDatabaseOrTableWithPitr: skipMasterCheck=true
- restoreSystemDatabaseWithPitr: skipMasterCheck=true

This ensures PITR restore doesn't have the same FK parent table
blocking issue that was fixed for snapshot restore.

Co-authore
72A0
d-by: Copilot <223556219+Copilot@users.noreply.github.com>
Mirror the snapshot restore fix: use IgnoreForeignKey context when
executing INSERT SELECT during PITR restore. This prevents Resolve()
failures with ExpectedEOB when the database was drop+recreated in
the same transaction.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Something isn't working size/L Denotes a PR that changes [500,999] lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

0