Story Image Generation

Image generation via Block Entropy endpoints

One of the issues is the lack of compatibility between conversation and the visualization via image generation. This is a challenging problem; however, there are two recent tech advancements (T5 encoders and IP adapters), that if utilized in the right way, achieve really good story-to-image adherence.

Method 1, T5 encoders

Two of the latest diffusion models utilize this, stable diffusion 3 and black lab’s flux.1. Here, you are not limited to the 77 tokens that clip has, but rather can go usually to 256 or 512 tokens. This gives you the ability to describe the character in more detail in each prompt, but in addition the T5 text encoder produces much more contextual understanding in the encodings. Using the "character-specific prompt prefix" and one of these image models, you can achieve really good, consistent story images, see Figure 1.

Here, I am using the Flux.1 Schnell new model from Black Forest Labs. The same kind of adherence is also possible with the Stable Diffusion 3 medium model. Both of these are open source image models and can be run on your local machine. You can also access these models as shown in Figure 2.

Method 2, IP Adapters

The IP adapters is a clever method of using the transformer architecture and cross attention to inject a specific style into the generation process. The IP adapters have been integrated into the older generation of diffusion models, including the SDXL, and SD1.5 (and LoRA variants), and does remarkably well. In order to utilize the IP Adapters, you need to condition the generation with an existing image, which is difficult to do with most UIs, including ST. To simplify the process, I created special tags that can be included in the prompt that are parsed by the API interpreter and inject user created images/templates/textual inversions into the image endpoint. If you would like to run it yourself, here is the code. Here is how to utilize IP adapters with the special grammar, see Figure 3.

For customized user images and styles, these can be uploaded to our website under the "Avatars" tab. Figure 4 outlines this process. Note that you must adhere to the grammar of "be-face:name" or be-style:name" when creating these avatars. The name should be unique and one word. When using the avatar, you must use the exact form, <be-face:name:weight>, where the name is what you have named it, and weight is how much influence you want the IP adapter to have. Normally a weight of 0.1-0.4 is good enough. Higher weights will not follow the text well, and only follow the image.

PreviousConnecting to Silly Tavern NextDanswers

Last updated 11 months ago