Embeddings are about $20 per billion tokens with voyage-4-lite ($.02/M) and I’ve spent like $500. The model seemed to be strong on all the properties of a good embedding model at a viable price point, and Voyage 4 embeddings have an interesting property where voyage-4-lite works with voyage-4-nano (calculable locally), voyage-4, and voyage-4-large when I feel like upgrading. My chunking strategy is semantically-aware (e.g. working to split on sentences, paragraphs, common delimiters), with a target tokens of 164, with about 20% overlap.
Searching across corpuses absolutely works as embedding are just a function of text/tokens. The compositionality of embeddings works amazingly too (e.g. debias_vector(@guilt_axis, guilt_topic) searching for guilty vibes without overindexing on stuff mentioning “guilt”), although there’s absolutely footguns and intuition that should be built for it (which I try to distill in prompts for agents, and also I have a prompt designed to help teach the exploration of embedding space).
Like this is basically a canonical research substrate—a well-indexed large corpus of high leverage data with ML embeddings queryable with SQL (Datalog would be better but agents don’t have as much experience with it and implementations don’t have great support for embeddings). It really would be nice to be able to get funding for this and to have a more abundance mindset to improve shipping velocity and get this substrate in front of more researchers (e.g. Coefficient Giving) to help with triage grantmaking in the singularity.
As for comparing to Elicit, it certainly offers users powers they couldn’t dream of having Elicit answer without them basically implementing the same thing, but Elicit of course has beautiful UIs which are more friendly to the human eye and workflows researchers are more familiar with. Elicit should basically provide this functionality to users, and Scry could afford to offer novel UIs for people, but I tend to be much more comfortable iterating on backend and API functionality than UIs (which I do have taste for, but it takes a lot of time).
That looks ambitious and awesome! I haven’t looked deeply, but a few quick qs
what do the costs look like to get embeddings for all those docs? How are you making choices about which embedding models to use and things like that?
do you have qualitative (or quantitative?) sense of how well the semantic joins work for queries like the examples on the homepage?
what’s your sense of how this compares to tools like elicit?
Embeddings are about $20 per billion tokens with voyage-4-lite ($.02/M) and I’ve spent like $500. The model seemed to be strong on all the properties of a good embedding model at a viable price point, and Voyage 4 embeddings have an interesting property where voyage-4-lite works with voyage-4-nano (calculable locally), voyage-4, and voyage-4-large when I feel like upgrading. My chunking strategy is semantically-aware (e.g. working to split on sentences, paragraphs, common delimiters), with a target tokens of 164, with about 20% overlap.
Searching across corpuses absolutely works as embedding are just a function of text/tokens. The compositionality of embeddings works amazingly too (e.g. debias_vector(@guilt_axis, guilt_topic) searching for guilty vibes without overindexing on stuff mentioning “guilt”), although there’s absolutely footguns and intuition that should be built for it (which I try to distill in prompts for agents, and also I have a prompt designed to help teach the exploration of embedding space).
Like this is basically a canonical research substrate—a well-indexed large corpus of high leverage data with ML embeddings queryable with SQL (Datalog would be better but agents don’t have as much experience with it and implementations don’t have great support for embeddings). It really would be nice to be able to get funding for this and to have a more abundance mindset to improve shipping velocity and get this substrate in front of more researchers (e.g. Coefficient Giving) to help with triage grantmaking in the singularity.
As for comparing to Elicit, it certainly offers users powers they couldn’t dream of having Elicit answer without them basically implementing the same thing, but Elicit of course has beautiful UIs which are more friendly to the human eye and workflows researchers are more familiar with. Elicit should basically provide this functionality to users, and Scry could afford to offer novel UIs for people, but I tend to be much more comfortable iterating on backend and API functionality than UIs (which I do have taste for, but it takes a lot of time).