Sandcastle Embedding Search #13090

tomdicarlo · 2025-12-09T19:02:47Z

Description

This PR demonstrates an alternative searching approach for sandcastles that uses a vector embedding search instead of the existing pagefind search. It updates the process of building the gallery to create a vector embedding of each sandcastle, and then compares those embeddings against the user query.

To allow everything to run locally, we have selected a small MIT licensed model from Huggingface. One consideration for this PR is concerns over model download size and hardware performance limitations. Memory usage from the model appeared in testing to not rise above what is consumed for several of the existing sandcastles, however the download size is approximately 30MB which is several times larger than the largest file download I have been able to produce from existing sandcastles.

I will open this PR as a draft for now, as modifications will likely need to be made to support a hybrid approach between Pagefind and Vector embedding search, which we can discuss further within the PR.

Issue number and link

internal issue

Testing plan

My changes were tested manually against a small subset of search queries that I found to be gaps in the current Cesium Sandcastle search.

I did not include any automated testing of the functionality as it doesn't seem testing is configured in the application, but I would be glad to add automated testing if that is requested before moving the PR out of draft.

Author checklist

I have submitted a Contributor License Agreement
I have added my name to CONTRIBUTORS.md
I have updated CHANGES.md with a short summary of my change
I have added or updated unit tests to ensure consistent code coverage
- No, as noted in testing plan
I have updated the inline documentation, and included code examples where relevant
I have performed a self-review of my code

github-actions · 2025-12-09T19:03:11Z

Thank you for the pull request, @tomdicarlo!

✅ We can confirm we have a CLA on file for you.

packages/sandcastle/src/Gallery/EmbeddingSearch.ts

jjspace

Thanks for the PR @tomdicarlo. I have a few initial comments about the code and some larger questions.

We previously discussed a "hybrid" approach to mixing the pagefind results with the embedding results or displaying both. Is there a reason we're not? was it just as a POC or do you feel the new search is stronger and fast enough already?
The new search seems to completely ignore the labels filter. Is this intended to be incorporated at some point and just isn't included yet?
I think it would be really helpful to have a loading spinner or some sort of indication while the search is happening. This is something we probably should already have but I can see it being needed even more with potential LLM latency (even though it is pretty fast on my machine currently)
- I thin this would be extra helpful for the initial loading of the model which may be slower on slower networks
Can you explain what knobs we have to tune the results?
- One sub-point here. After some quick testing it feels like it sometimes returns more results than it "should". For example searching for "Moon" does give the moon sandcastle but it also gives a bunch of other unrelated sandcastles. We don't have many in the first place but I wonder if it's actually helpful to include these results that are so separate.
we previously discussed potentially using the embeddings to generate extra metadata about sandcastles a build time or as a one time build step. I forget if that's still being considered? and if it is, is that part of the plan for this PR or a separate effort? (fine if it's separate, just clarifying)

jjspace · 2025-12-12T18:12:36Z

packages/sandcastle/src/Gallery/EmbeddingSearch.ts

+        `Loading embedding model: ${this.modelId} (this may take a moment on first load)...`,
+      );
+      this.tokenizer = await AutoTokenizer.from_pretrained(this.modelId);
+      this.model = await AutoModel.from_pretrained(this.modelId);


I assume this downloads the model directly from huggingface? Is there a way to include it locally? Or is that bad practice or limited by the license?

jjspace · 2025-12-12T18:13:06Z

packages/sandcastle/src/Gallery/GalleryItemStore.ts

+
+  // Perform vector search when search term changes (with debounce)
  useEffect(() => {
-    const pagefind = getPagefind();


I don't think we want to completely remove pagefind so we still support offline or isolated environments

jjspace · 2025-12-12T18:13:46Z

packages/sandcastle/src/Gallery/GalleryItemStore.ts

+        }) => {
+          // Find the item by matching the legacy_id to the slug
+          // The legacyIds map should provide legacy_id -> slug mapping
+          const slug = legacyIds[result.legacy_id];


This is not the right way to use this. The slug should already be the id. The legacy id is only there for loading historical urls for the previous version of sandcastle.

jjspace · 2025-12-12T18:14:09Z

packages/sandcastle/src/Gallery/GalleryItemStore.ts

+          // The legacyIds map should provide legacy_id -> slug mapping
+          const slug = legacyIds[result.legacy_id];
+
+          const item = items.find((item: { id: string; url: string }) => {


I don't really understand this search function, is this just trying to make up for bad data from the embedding search?

Add embedding search functionality

ffe6d17

tomdicarlo requested review from ggetz and jjspace December 9, 2025 19:02

tomdicarlo mentioned this pull request Dec 9, 2025

Sandcastle Embeddings Search #13066

Closed

6 tasks

tomdicarlo added the category - sandcastle label Dec 9, 2025

jjspace self-assigned this Dec 9, 2025

jjspace reviewed Dec 9, 2025

View reviewed changes

packages/sandcastle/src/Gallery/EmbeddingSearch.ts Outdated Show resolved Hide resolved

Switch to relative path

d237acb

jjspace reviewed Dec 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sandcastle Embedding Search #13090

Sandcastle Embedding Search #13090

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Sandcastle Embedding Search #13090

Are you sure you want to change the base?

Sandcastle Embedding Search #13090

Conversation

Uh oh!

Description

Issue number and link

Testing plan

Author checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants