8000
Skip to content

Conversation

@tomdicarlo
Copy link
Contributor
@tomdicarlo tomdicarlo commented Dec 9, 2025

Description

This PR demonstrates an alternative searching approach for sandcastles that uses a vector embedding search instead of the existing pagefind search. It updates the process of building the gallery to create a vector embedding of each sandcastle, and then compares those embeddings against the user query.

To allow everything to run locally, we have selected a small MIT licensed model from Huggingface. One consideration for this PR is concerns over model download size and hardware performance limitations. Memory usage from the model appeared in testing to not rise above what is consumed for several of the existing sandcastles, however the download size is approximately 30MB which is several times larger than the largest file download I have been able to produce from existing sandcastles.

I will open this PR as a draft for now, as modifications will likely need to be made to support a hybrid approach between Pagefind and Vector embedding search, which we can discuss further within the PR.

Issue number and link

internal issue

Testing plan

My changes were tested manually against a small subset of search queries that I found to be gaps in the current Cesium Sandcastle search.

I did not include any automated testing of the functionality as it doesn't seem testing is configured in the application, but I would be glad to add automated testing if that is requested before moving the PR out of draft.

Author checklist

  • I have submitted a Contributor License Agreement
  • I have added my name to CONTRIBUTORS.md
  • I have updated CHANGES.md with a short summary of my change
  • I have added or updated unit tests to ensure consistent code coverage
    • No, as noted in testing plan
  • I have updated the inline documentation, and included code examples where relevant
  • I have performed a self-review of my code

@tomdicarlo tomdicarlo requested review from ggetz and jjspace December 9, 2025 19:02
@github-actions
Copy link
github-actions bot commented Dec 9, 2025

Thank you for the pull request, @tomdicarlo!

✅ We can confirm we have a CLA on file for you.

Copy link
Contributor
@jjspace jjspace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @tomdicarlo. I have a few initial comments about the code and some larger questions.

  1. We previously discussed a "hybrid" approach to mixing the pagefind results with the embedding results or displaying both. Is there a reason we're not? was it just as a POC or do you feel the new search is stronger and fast enough already?
  2. The new search seems to completely ignore the labels filter. Is this intended to be incorporated at some point and just isn't included yet?
  3. I think it would be really helpful to have a loading spinner or some sort of indication while the search is happening. This is something we probably should already have but I can see it being needed even more with potential LLM latency (even though it is pretty fast on my machine currently)
    • I thin this would be extra helpful for the initial loading of the model which may be slower on slower networks
  4. Can you explain what knobs we have to tune the results?
    • One sub-point here. After some quick testing it feels like it sometimes returns more results than it "should". For example searching for "Moon" does give the moon sandcastle but it also gives a bunch of other unrelated sandcastles. We don't have many in the first place but I wonder if it's actually helpful to include these results that are so separate.
  5. we previously discussed potentially using the embeddings to generate extra metadata about sandcastles a build time or as a one time build step. I forget if that's still being considered? and if it is, is that part of the plan for this PR or a separate effort? (fine if it's separate, just clarifying)

`Loading embedding model: ${this.modelId} (this may take a moment on first load)...`,
);
this.tokenizer = await AutoTokenizer.from_pretrained(this.modelId);
this.model = await AutoModel.from_pretrained(this.modelId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this downloads the model directly from huggingface? Is there a way to include it locally? Or is that bad practice or limited by the license?


// Perform vector search when search term changes (with debounce)
useEffect(() => {
const pagefind = getPagefind();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to completely remove pagefind so we still support offline or isolated environments

}) => {
// Find the item by matching the legacy_id to the slug
// The legacyIds map should provide legacy_id -> slug mapping
const slug = legacyIds[result.legacy_id];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the right way to use this. The slug should already be the id. The legacy id is only there for loading historical urls for the previous version of sandcastle.

// The legacyIds map should provide legacy_id -> slug mapping
const slug = legacyIds[result.legacy_id];

const item = items.find((item: { id: string; url: string }) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand this search function, is this just trying to make up for bad data from the embedding search?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

0