The A1111 webUI and extension ecosystem is a beautiful example of what FOSS can be.
- User can install with a few clicks, no thinking about command line
- webUI has hover tooltips on everything, so user can most figure out what's going on without ever needing to touch documentation
- A1111 has a tab which can load a list of extensions from a github page
- Click to install, then refresh UI and extension just works
- Users are getting new incredibly powerful extensions every week or two - deforum lets you sequentially generate as many frames as you want and stitch them into a video, controlnet lets you copypaste features from a source image to your target image(s). Controlnet was added to A1111 ~2 weeks ago and is already integrated into the deforum tab so you can use both together.
Truly beautiful. I'd love to see more FOSS projects that felt so user friendly, generous with features, and rapid. Really fun to play with new cutting edge tech every couple weeks.
This is really interesting. Another thing I noticed in my fun with SD is that it is extremely stubborn about colors during the denoising process.
That is, whatever color a region of the image has during denoising step 3, it will almost surely have that color at step 50, even if it makes no logical sense for the thing in that location to have that color.
This may not seem bad, but it's annoying when doing anything image-to-image, because regardless of the prompt you give it, the colors are "sticky".
If you have an image of an apple, and you use image-to-image with the prompt "an image of an orange", you will get a very reddish orange (in my experience at least).
You can use this to your advantage too though. I've manually generated "noise" for img2img by taking a base image, hitting it with a bunch of filters, then putting it back into img2img with a low-ish denoising strength. This method works quite well for ensuring composition while still letting the image be styled, but it's probably obsolete for that purpose now that we have controlnet.
I've heard putting a fuzzy multicolored noise (perlin in RGB space or something) over your IMG2IMG input. Not sure what scale or what opacity you need, but that's something you can try.
This could possibly be fixed by generating training image sequences which, along with random noise pixes, also have random hue shift for subregions of the image.
As a ckpt that's cool, but I'd like to see it as a LORA so you can use any checkpoint you already have. That would (let's face it... will next week) be amazing.
It's a way to cut just the components/styles/themes/patterns out from a model and apply them into other models.
So if I have a Disney characters checkpoint, but I really like this MakeGiantEyes checkpoint, if I can get it down to a MakeGiantEyes LoRA, I can apply that on top of my Disney Characters model which is already a custom trained set. It definitely does not always work, but when it does it's like magic. At a practical level, it's a model-modifier.
... It took me a minute to get those because I had to sort through a LOT things that would probably get me banned here. If anyone wanted to know nationalities were using SD more... It's Asians, hands down, all day long, and I think that's interesting.
EDIT: And if you were wondering what a Textual Inversion is vs a LoRA... Don't ask me! They're both model modifiers, but as I understand it, textual inversions are good for faces (which is why most of those are people, and they are kilobytes in size), and LoRAs aren't as good for faces specifically but better for themes.
TI exmaples (I couldn't use any of the million women... there are almost none that would be appropriate to post. Even though civitai does a good job of removing the nsfw posts of real people even with cloths on some are just still too much... Thirst is driving AI now)
https://civitai.com/models/11039/ian-mckellen
or
https://civitai.com/models/8060/seu-madruga
I'm curious if this generalizes to mid frequencies (ie. add some blurred noise in addition to the offset) and what effect that might have on the generations.
Hi Jack, I've been fascinated with your work on Colormind and Fontjoy and tried to reach out to you on your @colormind email to ask about your API.
Is there a better way to reach you?
Exactly my thought too. The offset is just the zero frequency. But in general, the need to do this for the zero frequency would suggest that there's a scaling problem for all the longer frequency Fourier components? And that, perhaps, the effective spatial Fourier spectrum of noise which is used in Sd is not optimal?
Does anyone know how are they taking wavelengths from an image or what exactly "long-wavelength features" means?
I googled "wavelength of images" it doesn't seem like I am going in the right direction because it's about finding the wavelength of light from images rather than "wavelength of features" that this blog is talking about.
Diffusion is so interesting. Unlike LLM that have some parallels to how the human mind works, it's not as obvious that reverse engineering from noise to a prompt has any similar parallels.
Will this cause us to hit walls at some point or actually exceed what a human can create?
We know that the biological brain does a lot of iterative refinement via recurrent processing (attractor dynamics), which is very similar to how diffusion works.
However, while prediction is also a core functionality of the brain, it's not really that you always auto-regressively generate word-by-word.
Is the noise function used for just the starting data provided to the first iteration of denoising, or does it get called repeatedly throughout the iterations?
Just last week I saw ControlNet[1] which ads a lot more control.
Today I saw what corridor crew[2] did to stabilize the randomness when you want to make videos. Very exciting.
[1] https://github.com/lllyasviel/ControlNet
[2] https://www.youtube.com/watch?v=_9LX9HSQkWo