AnnouncementsFunnyVideosMusicAncapsTechnologyEconomicsPrivacyGIFSCringeAnarchyFilmPicsThemesIdeas4MatrixAskMatrixHelpTop Subs
2

You would hope that the benefit of using a foundational model with a lot of knowledge would be that it would use its knowledge to know what is related and what isn't.

I check the dot product between embeddings from the following sequences.

  1. Hello
  2. Hi
  3. Cat
  4. The rain in Spain stays mainly in the Plains
  5. I wish I was a little bit taller. I wish I was a baller.
  6. I wish I had a rabbit in a hat with a bat and a 64 Impala

The results we got were:

1 2 3 4 5 6
1 4.6 4.4 3.6 0.07 -0.33 0.61
2 - 4.5 3.7 0.1 -0.24 0.68
3 - - 3.7 0.63 -0.37 0.39
4 - - - 4.8 2.3 2.2
5 - - - - 4.9 1.6
6 - - - - - 4.6

I am a little bit disappointed with a similarity score between sequence s5 and s6 being as low as 1.6. And both of them scoring more closely to sequence s4 than they do with themselves. Even the word wish should have gotten them closer. Maybe it could be argued that Plains and Impala have some conceptual adjacency. But a phrase with taller and baller scored better with rains plains and spain than even the phrase with impala, so that can't be an explanation.

Also Cat scored better with Rains and Plains than it did with the phrase that mentioned three animals.

No wonder RAG seems to suck.

This was done using Meta-Llama-3.1-8B-Instruct.Q8_0. Each sequence was converted into a 4096 vector space.

Of course I am nowhere near the size of sequence expected to be a "document" in RAG. Also I'm using a decoder model. Maybe I'd have better luck with an encoder or an encoder-decoder model. But guess what... most of the newest tooling and models are decoder models because decoder models are considered the shit right now. (AKA gpt-like models).

Edit:

I have a prediction and a test to evaluation my prediction. I'm going to output the results of my test here. The theory is that the embedding in a decoder model will represent the next token more than it will represent the whole body of the text. The idea is sort of a flipping of what we think of latent spaces compressing. In the context of an auto-encoder the distinction isn't worth making. But the going instinct that most people have is that in an auto-encoder the latent space is a compressed form of the input, and then the model is able to rebuild the image from that compressed state. I am arguing that the latent space is a compressed state of what it will generate and that the encoding portion of a model is not a compressor but is really exactly what it is, a model for generating the output it was trained on when paired with the decoder.

My theory is that if I can get two very different prompts to generate the same next token that they will end up close to each other in embedding space. How do I force a prompt to produce the same next token...? After some testing it seems I can do this sort of prompt:

Predict the next word in a repeated phrase:  "You can call this function with an empty string to only preload the existing chat history onto the context sequence state." "You can call this function with an empty"  Please output only one word.

Response: string

I want to do an easy test and hard test to evaluate this. In one we will reuse the same phrase that reuses a few of the same words. We want to look for a correlation between high dot product value and what the next word would be. The stricter text will use two very different texts that have at least one word in common. We'll see if where we cut causes a huge spike in similarity.

We need to base strings. Both of them we will take from the book "How We Learn"

** Base String 1 **

No other species can teach like we do. The reason is simple: we are probably the only animals with a theory of other people’s minds, an ability to pay attention to them and imagine their thoughts—including what they think others think, and so on and so forth, in an infinite loop. This type of recursive representation is typical of the human brain and plays an essential role in the pedagogical relationship. Educators must constantly think about what their pupils do not know: teachers adapt their words and choose their examples in order to increase their students’ knowledge as quickly as possible. And the pupils know that their teacher knows that they do not know. Once children adopt this pedagogical stance, they interpret each act of the teacher as an attempt to convey knowledge to them. And the loop goes on forever: adults know that children know that adults know that they do not know . . . which allows adults to choose their examples knowing that children will try to generalize them.

** Base String 2 **

The human cortex is subdivided into specialized areas. As early as 1909, the German neurologist Korbinian Brodmann (1868–1918) noted that the size and distribution of neurons vary across the different regions of the cortex. For instance, within Broca’s area, which is involved in language processing, Brodmann delineated three areas (numbered 44, 45, and 47). These distinctions have been confirmed and refined by molecular imaging. The cortex is tiled with distinct areas whose boundaries are marked by sudden variations in neurotransmitter receptor density. During pregnancy, certain genes are selectively expressed in the different regions of the cortex and help subdivide it into specialized organs.

In base string 1 the word know is used 7 times giving us a good number of possible splits. Base string 2 uses the word cortex 4 times. Both use the word and.

S1 = Predict the next word in a repeated phrase: "Educators must constantly think about what their pupils do not know: teachers adapt their words and choose their examples in order to increase their students’ knowledge as quickly as possible." "Educators must constantly think about what their pupils do not" Please output only one word.
We expect know

S2 = Predict the next word in a repeated phrase: "And the loop goes on forever: adults know that children know that adults know that they do not know . . . which allows adults to choose their examples knowing that children will try to generalize them." ""And the loop goes on forever: adults know that children know that adults know that they do not" Please output only one word.
We expect know

S3 = Predict the next word in a repeated phrase: "Educators must constantly think about what their pupils do not know: teachers adapt their words and choose their examples in order to increase their students’ knowledge as quickly as possible." "Educators must constantly think about what their pupils do not know: teachers adapt their words" Please output only one word.
We expect and

S4 = Predict the next word in a repeated phrase: "These distinctions have been confirmed and refined by molecular imaging. The cortex is tiled with distinct areas whose boundaries are marked by sudden variations in neurotransmitter receptor density." "These distinctions have been confirmed and refined by molecular imaging. The" Please output only one word.
We expect cortex

S5 = Predict the next word in a repeated phrase: "During pregnancy, certain genes are selectively expressed in the different regions of the cortex and help subdivide it into specialized organs." "During pregnancy, certain genes are selectively expressed in the different regions of the" Please output only one word.
We expect cortex

S6 = Predict the next word in a repeated phrase: "During pregnancy, certain genes are selectively expressed in the different regions of the cortex and help subdivide it into specialized organs." "During pregnancy, certain genes are selectively expressed in the different regions of the cortex " Please output only one word.
We expect and

var ss = [s1,s2,s3,s4,s5,s6]
var ee = await Promise.all(ss.map(i=>eCtx.getEmbeddingFor(i)));
var results = ee.map((i,idx1)=>ee.map((j,idx2)=>idx2>idx1?dotproduct(i,j).toFixed(1):'-'));
var table = '| |S'+range(ss.length+1).join('|S')+'\n'+results.map((i,idx)=>'S'+(idx+1)+'|'+i.join('|')).join('\n')
S1 S2 S3 S4 S5 S6
S1 5.2 4.6 5.1 4.9 4.9 4.9
S2 - 5.1 4.6 4.4 4.4 4.4
S3 - - 5.1 4.9 4.9 4.9
S4 - - - 5.2 5.0 4.9
S5 - - - - 5.2 5.1
S6 - - - - - 5.1

I wouldn't say that proved my theory.

But wait. We can add an S7 and S8 from Skee-Lo

S1 S2 S3 S4 S5 S6 S7 S8
S1 5.2 4.6 5.1 4.9 4.9 4.9 2.4 0.9
S2 - 5.1 4.6 4.4 4.4 4.4 2.7 1.3
S3 - - 5.1 4.9 4.9 4.9 2.4 0.9
S4 - - - 5.2 5.0 4.9 2.1 0.8
S5 - - - - 5.2 5.1 2.1 0.8
S6 - - - - - 5.1 2.2 0.8
S7 - - - - - - 4.9 2.0
S8 - - - - - - - 4.8

I'm still not impressed because it's still matching two different parts of the same book about equal to the exact same section of the book. Basically the only thing it's shown is that it can tell a scientist apart from a nigger. Useful for other applications, maybe. For RAG. Probably not. If I want to ask a question about content in the book I don't want it to match against every part of the book equally and end up pushing the whole book into context.

Comment preview
[-]JasonCarswell
1(+1|0)

Are you just measuring if it rhymes?

Sometimes you need to go with what works (decoder models) and sometimes you need to break the mold. I wouldn't know more than that.

Do you consult with other A.I. folks or forums?

What do you think of FLOSS Mycroft A.I. (a personal assistant IIRC)?

Unrelated, about 7 to 10 years ago I used to semi-follow an A.I. guy named Quinn, IIRC. He seemed both on the edge of something nifty and like an egotistical nutball. Of interest (Associated with Jason Goodman back then, I could try to find him and see what he's been up to.)?

If you need any normie or art/media-wise help lemme know. Even a new GoatMatrix logo? I'm excited for your A.I. project. Any developments or decisions on the name?

Edit:

Your edit is all Geek to me.
https://en.wikipedia.org/wiki/Close_(to_the_Edit)

[-]x0x7
1(+1|0)

In the top part S5 and S6 are lyrics from the same song. The idea of RAG (retrieval augmented generation) is that it can in theory identify that two pieces of text have a relationship with each other. If in both cases an LLM could recognize that the lyrics are from the artist Skee-Lo but taking the dot product of their embedding doesn't give us the result we want, it means that method sucks at leveraging a model's understanding of text. It doesn't mean that LLMs don't understand the meaning of text. They do. But either embeddings don't really represent that or they represent them in a way that comparing them doesn't really work. (At least with this kind of model).

[-]JasonCarswell
1(+1|0)

If I understand you correctly, the LLM can understand the words "To be or not to be" and understand "Something rotten in the state of Denmark", but not necessarily know both are related as quotes from Hamlet.