> Is there a way to have current AI tools maintain consistency when generating m...

ftufek · on Jan 2, 2023

From my experience playing around with dreambooth in the last few weeks generating images of a specific person or pet (not just a generic concept), it surprisingly works really well. But you have to make sure to feed it enough pictures, make sure to label the images properly, use smaller learning rate, use prior preservation loss and make sure to not overfit, etc.

For the animation stuff where you need frame to frame consistency, the new diffusion based video models show that it's possible [1][2]. These are not open source yet as far I know, but it's highly likely that we'll get them within a few months.

1: https://arxiv.org/pdf/2212.11565.pdf

2: https://imagen.research.google/video/paper.pdf

wokwokwok · on Jan 2, 2023

> generating images of a specific person or pet (not just a generic concept)

There's no difference between those things. It's a specific label that directs the diffusion model. It doesn't matter if your label is 'dog' or 'betty' (ie. my personal dog). Anyway...

> it's highly likely that we'll get them within a few months.

Yep! It's not a technical limitation of the technology for sure; but the OP asked:

> Is there a way to have current AI tools ...

...and right now you can't do it with the current AI tools that are publicly available.

bitforger · on Jan 2, 2023

I think you can get the effect you're looking for by using the previous cell as an init image and only repainting the character.

As for consistency of character details, I think that will depend on how many images you use to train dreambooth etc. and how varied those images are.[1]

[1]: https://www.youtube.com/watch?v=W4Mcuh38wyM

genewitch · on Jan 2, 2023

consistency isn't really that difficult with more or less static images. I haven't tried to do "same outfit many poses" yet, because i don't really know what poses are called, and there's no guarantee that the humans that trained/tagged the input images knew, either. I've been messing around with "batch img2img" and i sort of like the jank; i am wondering if a more aggressive CLIP would help at all, but i think it boils down to there really isn't enough detailed tagging to make this worth messing with too much.

what i mean is, assuming this technology moves forward, and GPUs continue increasing VRAM as they have, and enough people are interested in doing extremely detailed tagging with small shapes, the sorts of issues you're talking about will go away over time. Or, alternatively, someone or a group could develop a way to scan hundreds of outputs and collate them according to similarity, allowing a human to use batches that are similar enough to do something like short comics or whatever. As it stands, when i do txt2img or img2img i will run off 20-40 images. I'm also wondering how much seed fiddling could be done - when i first got "Anything v3.0" every image was some person sitting at a dining table near a window with food in front of them, dozens in a row. I have no idea how it happened, but there was enough global cohesion between images i thought it was trained on just that for the first hour or so.

Each of the below images is a set of 4 images (i think generally called a grid in SD), so each image is a set of 4 "2 panel comic strips" - they aren't really intended to flow between the grid squares, but you'll notice that the clothing, hairstyles, etc between strips matches, even if they don't match between individual images. My personal favorite - and the one i used for something online, is the top left set in the first .png https://i.imgur.com/BWek3YI.png https://i.imgur.com/LHchsj5.png

P.S. if anyone knows what the source art could possibly be, let me know?