1 | ||
1 | ||
1 | ||
1 |
I've done some searching on the internet and did not find what I want. I'm asking this in the open so anyone can answer or join the conversation.
You have surpassed me in knowledge of the image generation side of AI.
Is there any sort of way to do RAG with images as the RAG material. Everything I've found online about this seems to target the prompting phase, and so this ends up being all about text.
Now if modern image generation were as simple as the more classic VAE or any other direct auto-encoder you could do a RAG img2img where you take the latent space of the input image, find an image in a set that is similar, do a partial sum of the two images latent spaces, and process forwards.
This is where my lack of working knowledge and intuition fails me. I'm kind of assuming because these models aren't doing a direct auto-encoding and instead are doing diffusion (estimate of error), that adding the two latent spaces might not be productive because the decoder side of the model would really be creating an estimate of the error of both images (kind of) and so you would really be subtracting the target image. Or the found error of the target image.
Basically I think it would be cool if instead of training a LoRA you could use something like RAG to make a composite image (img2img) between an input image and the most similar image in some set.
But I know you've played with ComfyUI so I figure you might already know something about piping multiple inputs in a graph toward an eventual output.
we both overestimate my intelligence, and should @ me so i know i ought to weigh in. This is a good question and i am thinking hard on it. i will be back a few times to give my 2 cents.
A direction i think would be valuable would be using RAG to autoselect the best model based on either input image or text input or both. That could cut dont a lot of pick and choose and hope, especially when doing large render groups or maybe even video2video/gif2gif.
Yep, I meant to @. Forgot at the end.