Story Image Generation
Image generation via Block Entropy endpoints
Last updated
Image generation via Block Entropy endpoints
Last updated
One of the issues is the lack of compatibility between conversation and the visualization via image generation. This is a challenging problem; however, there are two recent tech advancements (T5 encoders and IP adapters), that if utilized in the right way, achieve really good story-to-image adherence.
Two of the latest diffusion models utilize this, stable diffusion 3 and black lab’s flux.1. Here, you are not limited to the 77 tokens that clip has, but rather can go usually to 256 or 512 tokens. This gives you the ability to describe the character in more detail in each prompt, but in addition the T5 text encoder produces much more contextual understanding in the encodings. Using the "character-specific prompt prefix" and one of these image models, you can achieve really good, consistent story images, see Figure 1.
For customized user images and styles, these can be uploaded to our website under the "Avatars" tab. Figure 4 outlines this process. Note that you must adhere to the grammar of "be-face:name" or be-style:name" when creating these avatars. The name should be unique and one word. When using the avatar, you must use the exact form, <be-face:name:weight>, where the name is what you have named it, and weight is how much influence you want the IP adapter to have. Normally a weight of 0.1-0.4 is good enough. Higher weights will not follow the text well, and only follow the image.
Here, I am using the new model from Black Forest Labs. The same kind of adherence is also possible with the model. Both of these are open source image models and can be run on your local machine. You can also access these models as shown in Figure 2.
The IP adapters is a clever method of using the transformer architecture and cross attention to inject a specific style into the generation process. The have been integrated into the older generation of diffusion models, including the SDXL, and SD1.5 (and LoRA variants), and does remarkably well. In order to utilize the IP Adapters, you need to condition the generation with an existing image, which is difficult to do with most UIs, including ST. To simplify the process, I created special tags that can be included in the prompt that are parsed by the API interpreter and inject user created images/templates/textual inversions into the image endpoint. If you would like to run it yourself, here is the . Here is how to utilize IP adapters with the special grammar, see Figure 3.