1526 words
8 minutes
Announcing Animagine XL 3.0
Generation Parameter
{
  "prompt": "1girl, arima kana, oshi no ko, solo, idol, idol clothes, one eye closed, red shirt, black skirt, black headwear, gloves, stage light, singing, open mouth, crowd, smile, pointing at viewer, masterpiece, best quality",
  "negative_prompt": "nsfw, lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry, artist name, ",
  "resolution": "1216 x 832",
  "guidance_scale": 7,
  "num_inference_steps": 28,
  "seed": 1495338659,
  "sampler": "Euler a",
  "sdxl_style": "(None)",
  "quality_tags": "Standard",
  "use_lora": null,
  "use_upscaler": {
    "upscale_method": "nearest-exact",
    "upscaler_strength": 0.55,
    "upscale_by": 1.5,
    "new_resolution": "1824 x 1248"
  }
}

Two months ago, we announced Animagine XL 2.0. Today, we are happy to introduce you to Animagine XL 3.0, the next iteration of our opinionated open-source anime text-to-image model based on Stable Diffusion XL. Following the last iteration, V3 has been developed and improved in order to become the best open anime image generation model.

Compared to the last iteration, it has better knowledge, better concept, and prompt understanding. It also can generate much better hand anatomy.

Fine-tuned on top of Animagine XL 2.0#

After some trial and error, we realized that Animagine XL 2.0 could become the base model for the pretraining process of Animagine XL 3.0. Not only was the model built on top of SDXL—the world’s best open image generation model so far—but Animagine XL 2.0 had already learned the anime concept better than the stock version, which made it easy and efficient to do continual training.

To train Animagine XL 3.0, we used only 2x A100 80GB at Runpod, with half of the credits granted by our precious friend, and the rest came from our wallet. The model was trained for 21 days in December, approximately over 500 GPU hours. We used a slightly modified kohya-ss/sd-scripts as our training script. We added something like keep_tokens_separator in order to dynamically keep labels from being shuffled.

However, the training configuration of Animagine XL 3.0 may slightly differ from Animagine XL 2.0, and I hope we can adapt to that.

Tag Ordering#

NovelAI announced their third iteration of their anime text-to-image model last year, NovelAI Diffusion V3. They claimed that NAID V3 was trained with unique tag ordering, meaning that prompt order is crucial to get what we want. Thankfully, they shared their findings within their docs.

Based on that small information, we have tried to reproduce tag ordering by building our datasets and training it similarly to NovelAI Diffusion V3. So far, we are pleased with the result. Thus, we recommend using this prompt template to infer with our V3 model.

1boy/1girl, what character, from which series, everything else in random order*
  • ) everything else in random order could be everything, from general tags to quality tags.

prompting-guide

Generation Parameter
{
  "prompt": "1girl, c.c., code geass, white shirt, long sleeves, turtleneck, sitting, looking at viewer, eating, pizza, plate, fork, knife, table, chair, table, restaurant, cinematic angle, cinematic lighting, masterpiece, best quality",
  "negative_prompt": "nsfw, lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry, artist name, ",
  "resolution": "896 x 1152",
  "guidance_scale": 7,
  "num_inference_steps": 28,
  "seed": 0,
  "sampler": "Euler a",
  "sdxl_style": "(None)",
  "quality_tags": "Standard",
  "use_lora": null,
  "use_upscaler": null
}

Better Hands#

We believe that V3 can generate better hands compared to the last iteration. We have been testing some hand gesture tags such as waving, double v, v, pointing at viewer, hands up, rabbit pose, and shushing, and so far, we are satisfied with the results.

wavingvindex finger raised
kurumi-handsazusa-handsarima-kana-hands

Simpler Prompt, Better Knowledge#

Another reason that led us to train this model was the uncontrollable and inefficient developments in LoRA. Let’s say we need 2800 LoRAs for 2800 characters; assuming LoRAs have an average size of 50 MBs (8 dim, 8 alpha, 8 conv dim, 4 alpha), then we need to allocate around 140 GBs of storage to save the LoRAs. Not only that, but we also have to load all LoRAs, validate to check if they are broken or not, open the network tab, select LoRAs, and adjust the weight of the LoRA adapter before generating.

Make LoRA Great Again!

We need to make LoRA more effective and efficient by training a better base model, so that we only need to train LoRAs for concepts and art styles that the model lacks. With this model, you can generate a lot of well-known characters using only a prompt. You don’t even need to explain the characteristics for most characters to get what you want. If you type in Hoshimachi Suisei, you will get Hoshimachi Suisei; if you type in Arima Kana, you will get Arima Kana. It’s that simple!

character-list

Lower CFG Scale, Fewer Sampling Steps#

Based on our findings, it’s recommended to use a lower classifier-free guidance (CFG Scale) of around 5-7, sampling steps below 30, and to use Euler Ancestral (Euler A) as a sampler. This setup optimizes performance without compromising the quality of results.

cfg-comparison

Generation Parameter
{
  "prompt": "A boy and a girl, Emiya Shirou and Artoria Pendragon from fate series, having their breakfast in the dining room. Emiya Shirou wears white t-shirt and jacket. Artoria Pendragon wears white dress with blue neck ribbon. Rice, soup, and minced meats are served on the table. They look at each other while smiling happily, masterpiece, best quality",
  "negative_prompt": "nsfw, lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry, artist name, ",
  "resolution": "896 x 1152",
  "guidance_scale": 1~12,
  "num_inference_steps": 28,
  "seed": 0,
  "sampler": "Euler a",
  "sdxl_style": "(None)",
  "quality_tags": "Standard",
  "use_lora": null,
  "use_upscaler": null
}

Uncontrollable#

After the good things, we now move to explain the bad things that happened. Users may encounter more NSFW results if they use masterpiece, best quality, because many high-scored datasets are NSFW. It’s better to add nsfw, rating: sensitive to the negative prompt and rating: general to the positive prompt.

And we just realized something is wrong with the training script after the training finished; it has a distributed data parallel problem because we are using multiple GPUs, so the gradient is not synchronized, meaning the model might be undertrained because it may receive only partial updates.

Special Tags#

Like the previous iteration, this model was trained with some special tags to steer the result toward quality and year tags. The model can still do the job without these special tags, but it’s recommended to use them if we want to make the model easier to handle.

Quality Tags#

To make it easier for users to migrate from SD 1.5 to SDXL, we chose to keep the quality tags the same. We measure quality tags based on dataset scores. Here’s the list, from best to worst:

  • masterpiece
  • best quality
  • high quality
  • normal quality
  • low quality
  • worst quality
without quality tagswith quality tags
badgood
Generation Parameter
{
  "prompt": "1girl, sakurauchi riko, \(love live\), queen hat, noble coat, red coat, noble shirt, sitting, crossed legs, gentle smile, parted lips, throne, cinematic angle, masterpiece, best quality",
  "negative_prompt": "nsfw, lowres, worst quality, low quality",
  "resolution": "896 x 1152",
  "guidance_scale": 7,
  "num_inference_steps": 28,
  "seed": 0,
  "sampler": "Euler a",
  "sdxl_style": "(None)",
  "quality_tags": "(None)",
  "use_lora": null,
  "use_upscaler": null
}

Year Tags#

We’re also introducing year tags, but unlike NovelAI Diffusion V3, we train based on a range of post years instead of single post years. It is supposed to be another quality tag, to steer the result toward the modernity of anime art styles. Year tags are not very effective to use but they should work if we specifically want to get a 2014s art style. Here’s the list, from newest to oldest:

  • newest
  • late
  • mid
  • early
  • oldest
{prompt}, oldest, early, mid{prompt}, newest, late
oldestnewest
Generation Parameter
{
  "prompt": "1girl, karyl \\(princess connect!\\), princess connect!, solo, animal ears, tsundere, panic, waving, indoors, depth of field, looking at viewer, masterpiece, best quality",
  "negative_prompt": "nsfw, lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry, artist name, ",
  "resolution": "832 x 1216",
  "guidance_scale": 7,
  "num_inference_steps": 28,
  "seed": 1154518567,
  "sampler": "Euler a",
  "sdxl_style": "(None)",
  "quality_tags": "Standard",
  "use_lora": null,
  "use_upscaler": null
}

Get Started With Animagine XL 3.0#

There are several ways to get started with this model:

About the New License#

Animagine XL 3.0 now uses the Fair AI Public License 1.0-SD, compatible with Stable Diffusion models. Key points:

  1. Modification Sharing: If you modify Animagine XL 3.0, you must share both your changes and the original license.
  2. Source Code Accessibility: If your modified version is network-accessible, provide a way (like a download link) for others to get the source code. This applies to derived models too.
  3. Distribution Terms: Any distribution must be under this license or another with similar rules.
  4. Compliance: Non-compliance must be fixed within 30 days to avoid license termination, emphasizing transparency and adherence to open-source values.

The choice of this license aims to keep Animagine XL 3.0 open and modifiable, aligning with open source community spirit. It protects contributors and users, encouraging a collaborative, ethical open-source community. This ensures the model not only benefits from communal input but also respects open-source development freedoms.

Announcing Animagine XL 3.0
https://cagliostrolab.net/posts/animagine-xl-v3-release/
Author
Cagliostro Research Lab
Published at
2024-01-10
© 2023 Cagliostro Research Lab. All Rights Reserved.
Powered by Fuwari