Documentation
  • Quick Start
  • Models
    • Block Entropy Models
    • Block Entropy Image Generation
    • Avatars
    • Smart Routing
    • Routing through Openrouter
    • Block Entropy Embeddings
  • Generation
    • Chat Completions
    • Structured Generation
    • Image generation
      • Usage
      • API Reference
    • Text-to-speech
  • Guides
    • Open Web UI
    • Silly Tavern
      • Connecting to Silly Tavern
      • Story Image Generation
    • Danswers
  • Depreciations
Powered by GitBook
On this page
  • Method 1, T5 encoders
  • Method 2, IP Adapters
  1. Guides
  2. Silly Tavern

Story Image Generation

Image generation via Block Entropy endpoints

PreviousConnecting to Silly TavernNextDanswers

Last updated 10 months ago

One of the issues is the lack of compatibility between conversation and the visualization via image generation. This is a challenging problem; however, there are two recent tech advancements (T5 encoders and IP adapters), that if utilized in the right way, achieve really good story-to-image adherence.

Method 1, T5 encoders

Two of the latest diffusion models utilize this, stable diffusion 3 and black lab’s flux.1. Here, you are not limited to the 77 tokens that clip has, but rather can go usually to 256 or 512 tokens. This gives you the ability to describe the character in more detail in each prompt, but in addition the T5 text encoder produces much more contextual understanding in the encodings. Using the "character-specific prompt prefix" and one of these image models, you can achieve really good, consistent story images, see Figure 1.

Method 2, IP Adapters

For customized user images and styles, these can be uploaded to our website under the "Avatars" tab. Figure 4 outlines this process. Note that you must adhere to the grammar of "be-face:name" or be-style:name" when creating these avatars. The name should be unique and one word. When using the avatar, you must use the exact form, <be-face:name:weight>, where the name is what you have named it, and weight is how much influence you want the IP adapter to have. Normally a weight of 0.1-0.4 is good enough. Higher weights will not follow the text well, and only follow the image.

Here, I am using the new model from Black Forest Labs. The same kind of adherence is also possible with the model. Both of these are open source image models and can be run on your local machine. You can also access these models as shown in Figure 2.

The IP adapters is a clever method of using the transformer architecture and cross attention to inject a specific style into the generation process. The have been integrated into the older generation of diffusion models, including the SDXL, and SD1.5 (and LoRA variants), and does remarkably well. In order to utilize the IP Adapters, you need to condition the generation with an existing image, which is difficult to do with most UIs, including ST. To simplify the process, I created special tags that can be included in the prompt that are parsed by the API interpreter and inject user created images/templates/textual inversions into the image endpoint. If you would like to run it yourself, here is the . Here is how to utilize IP adapters with the special grammar, see Figure 3.

Flux.1 Schnell
Stable Diffusion 3 medium
IP adapters
code
Figure 1. Conversation with Albert Einstein. T5 image encoder with text prompts only can give consistent results.
Using our API, you can easily access image models using the (1) Block Entropy Image configuration. (2) Select a model in the model field. Model's prefixed with "be-anim" are animation endpoints. (3) Enter the character specific prompt prefix. (4) the /imagine last, with this simplified last message prompt does a good job with image story summarization.
Figure 3. Adding IP adapters via special grammars to the image generation process. From left to right, (left) prompt only prefix, not using T5 encoders has poor consistency. (2nd left) Using the grammar, <be-face:name:weight>, you can inject a face to condition the generation. (2nd right) Using <be-style:name:weight>, you can inject a style, and (right) using both of them together achieves both face and style consistency.
Figure 4. (1) Adding block entropy as an API, (2) generating keys, (3,4) adding customized avatars to your account. Please note the strict grammar