Adobe should make a boring app for prompt engineers
19.51, Thursday 2 Jun 2022 Link to this post
AI image synthesis has just crossed a reliability and quality threshold that means that it can replace a broad swathe of creative work, like, tomorrow, but what’s interesting is that a new kind of expertise is required, prompt engineering, and that gives a clue to the future shape of the industry.
DALL-E 2 (by OpenAI) transforms a natural language prompt into a matching image – and it’s remarkable. Scroll through the #dalle tag on Twitter to see photorealistic images, graphic design, fake stills from movies that never existed, etc.
The variety is amazing. Watch Alan Resnick’s DALL-E variant test (YouTube, 2 mins) where he starts with a single prompt and simply wanders through adjacent images, making a stop motion animation (background here).
Then Google launched Imagen which similarly combines
an unprecedented degree of photorealism and a deep level of language understanding. Check out the examples on their project page.
It’s one thing seeing photorealistic images. It’s another seeing pixelated graphics, or DALL-E doing brand design, producing plausible logos.
What’s I love about this new technology is that we’re still figuring out how to use it, and we’re figuring it out together.
I’ve been playing with the previous gen of text-to-image AIs as part of the Midjourney beta.
It’s fun! The pics are pretty!
I’ve popped a couple of Midjourney-generated images on my Instagram:
- Lake, trees, and city. Prompt: “London is a forest in the style of David Hockney’s The Splash” – the colour palette is gorgeous, and the brushwork on the trees is something I’ve taken inspiration from for my own sketches.
- Monument Valley. Again the prompt included something about Hockney, then I followed a couple of variations.
(The text you give the AI is called the “prompt.” As you can tell, the secret incantation I like to use is “in the style of David Hockney.”)
The interface is curious! During this beta, there is no app and no website. Instead the sole interface to Midjourney is via Discord.
All users are in the same Discord server, and there a whole bunch of chat channels. To get an image, you publicly type the command ‘/imagine’ with your prompt. The Midjourney bot replies, then updates its reply continuously as it generates four variations – this process takes a couple minutes. You can then choose one of the variations to either (a) branch out create additional variations, or (b) upscale to greater resolution.
Everyone can see what everyone else is doing. All your work is in the open!
It doesn’t feel like social media. With something like Instagram, you’re all in the same world, but the foundational feeling is one of: seeing and being seen.
Instead Midjourney feels more like contributing to Wikipedia, or like a giant scientific experiment. We’re not generating images, we’re discerning the internal nature of the AI by sending in probes (words) and seeing what bounces back.
It’s like X-ray crystallography, our word beams diffracting into output images, there to be decoded.
It’s definitely collaborative. Early on with text-to-image AIs it was discovered that simply appending “high quality” or “4K” to your prompt, whatever it is, will product better results. And the Midjourney Discord is like this all the time.
Note also Midjourney’s current economics.
High quality AI models are too big to run anywhere but the cloud, and running them takes a lot of compute. Compute is expensive.
So Midjourney gets all of its beta users onto a paid plan as soon as possible. You get a certain quantity of CPU hours for free. Then you get nudged onto a monthly subscription and your usage is metered – the bot will let you know how many CPU hours you’re consuming.
Additionally the images are licensed. If you use the generated images commercially, for a corporation with annual revenue north of $1m, or anything related to the blockchain, you need pay for a commercial licence.
(Which is fascinating, right? This is what was in my head when I was circling the concept of ownership recently – a camera manufacturer doesn’t get a say in how I use my photos. Microsoft Excel costs the same whether I’m using it to manage my household budget or sell a bank. We believe that Henrietta Lacks’ family today have moral ownership of the cell line produced from a tumour produced (against its will) by Lacks’ body in 1951, but we don’t grant the photographer who set up the camera and the trigger mechanism copyright in the case of the monkey selfie. It would be fascinating for a philosopher to sit down and, from first principles, tease out the detail of all these situations… and also this current situation: a proprietary AI model and the ownership of the prompted output.)
TAKEAWAY: compute costs $$$.
Like I said, half the fun is the community figuring out magic incantations to inspire the AI to do what you want. For instance you’re going to get a certain kind of result if you start your prompt with “A photo of…”.
There was a 24 hour flurry of excitement when it appeared that DALL-E had an internal angelic language. Here’s the pre-print on arXiv: Discovering the Hidden Vocabulary of DALLE-2.
it seems that ‘Apoploe vesrreaitais’ means birds and ‘Contarra ccetnxniams luryca tanniounons’ (sometimes) means bugs or pests.
Swiftly debunked. Which is a shame because there’s a long folkloric history of a secret and perfect language of the birds and it would be hilarious to discover it like this in the Jungian collective unconscious, crystallised out from the distillation of the entire corpus of human text hoovered up from the internet.
ALSO there are other similar AI phenomena, just as mysterious. For instance adding the words
zoning tapping fiennes to movie reviews prevents an AI from classifying the review as good or bad.
It’s a process.
What’s happening is a new practice of prompt engineering – or rather let’s call it prompt whispering. Prompt whisperers have a sophisticated mental model of the AI’s dynamic behaviour… or its psychology even?
Look at Astral Codex Ten think though this, while he tries to generate a stained glass window of bearded 17th century astronomer Tycho Brahe accompanied by his pet moose.
It doesn’t work:
I think what’s going on here is - nobody depicts a moose in stained glass. A man scrying the heavens through a telescope is exactly the sort of dignified thing people make stained glass windows about. A moose isn’t. So DALL-E loses confidence and decides you don’t really mean it should be stained glass style.
Also, is it just me, or does Brahe look kind of like Santa here? Is it possible that wise old man + member of the family Cervidae gets into a Santa-shaped attractor region in concept space? I decided to turn the moose into a reindeer to see what happened: …
So the AI is no longer a black box.
It’s a landscape, maybe, with hills and gradients and watersheds. But it has internal structure, and the knowledge that concepts can be near or far or have gravity isn’t sufficient to generate great imagines, but it informs the prompt whisperer to dowse in the right place.
Not everyone will be good at this. Some people will be naturals! It’s like being good at using google or researching with a library, a combination of aptitude and eye and practice. A skill.
Also I didn’t know this Brahe fact before:
Sometimes he would let the moose drink beer, and one time it got so drunk that it fell down the stairs and died.
DALL-E/Imagen/etc output is high quality. Photorealistic, sharp, with the appearance of being professionally produced, etc (if that’s what you ask for).
But also, and this is what seems different, the AIs seem to have a reliable level of Do What I Mean.
If you can ask for a particular type of image with intent, get a result in the ballpark and then iterate it – well, one possibility of text-to-image synthesis is that it replaces a decent amount of creative work. Maybe not the super imaginative or high-end or novel or award-winning work, but definitely the category that takes a bunch of time and bedrocks agency P&Ls.
- Why get a model and a set for a product photoshoot when you can get dial in the mood you want, then comp your product in later?
- Why grind out a hundred icon variations if you can batch produce them from words?
- Why set a creative off on making a dozen logo options for a brand when you can just feed the brief to an AI?
- Why spend hours finding art for your low circulation industry magazine to give it some colour when you can hand the caption to a bot and create ideals images in whatever size you want?
That’s the threshold that has been crossed.
BUT what this early work shows is that prompt engineering is a skill, just like using Adobe Photoshop or Figma. You can use it in a utilitarian way or creatively but it’s a skill none-the-less. And required.
So the future of the creative sector, at least in the near term,
- IS NOT: creatives get replaced by software, the client right-clicking on an app for themselves to auto-fill an image, like some kind of visual autocomplete, but instead
- IS (or might well be soon): agencies use software to 100x their productivity.
As long as there is skill in using these AI models to produce great results, there is a place for creatives. My take is that the existing industry relationships will remain in place. But the performed work will change. Less time drawing, more time prompt whispering.
What’s the ideal tool for prompt engineering? There isn’t one yet, it’s all too new.
The ideal tool is primarily about workflow and exposing all the various parameters.
It shouldn’t blow any minds, except for what people do with it.
It should be boring as heck because that’s what a workbench is.
I’m imagining an app that allows for managing many concurrent projects, and includes features like:
- Colour palette management
- Fine-tuning of custom styles to apply to any output
- A visual method to explore variations with an infinite stored history, like a tree map crossed with a lightboard
- A prompt builder using stored tokens, with prompts stored for all images so that old projects can be picked up at any time
- Management of queues (image generation is slow) and metered compute
- Automation to crank out large numbers of variations based on spreadsheets of input data, upsized, saved and named appropriately
Bonus points: a user interface with epistemic agents to provide inspiration in exploring concept space.
This isn’t a limited plug-in just for smart in-painting. This would be a full-blown application for expert, rapid use of DALL-E and the like.
The business model might be different?
The economic boundary condition is that compute is remote and expensive, so you’re going to have to keep on feeding the meter. In designing my imaginary app, I would build this in pretty deep: projects should have budgets associated with them.
Given this, the app should be free. Make it really easy to top up your compute balance in-app then charge 5% on top.
The Export menu item would include a drop down for the file format, and a checkbox for a commercial license (also paid).
It would fit right into Adobe Creative Suite. It’s the right blend of creativity, production, and workflow.
Maybe Midjourney will build it first? Someone should anyway.
It’s kinda insane how fast this is going. DALL-E’s natural language understanding builds on GPT-3, which was radically better at human language – and dropped only 18 months ago.
At the time I ran across the idea of an AI overhang: what if there are no fundamental technical constraints and AI can get 10x or 100x better in a matter of months?
We’re thoroughly in overhang territory now. If text and images then why not 3D models and characters, then why not maps of virtual environments and VR worlds, then why not auto-generated gameplay. Why not narrative and music and entire movies.
Practically (from that post above):
Intel’s expected 2020 revenue is $73bn. What if they could train a $1bn A.I. to design computer chips that are 100x faster per watt-dollar?
(Still waiting for that.)
We’ll need tools for this kind of work that are as practical and reliable as any office or CAD software.
One thing is: image generation is still slow.
How much storage and compute would you need to run something like Midjourney/DALL-E/Imagen/or whatever locally? How about with real-time generation time (sub 150ms feels interactive)? How about 4K 30fps?
I don’t know the answers. I’m trying to get a handle on how many Moore’s law doublings we are away from that kind of experience.
Because the wild kind of prompt whispering would be where it’s not prompt-and-wait but as-you-speak and conversational. Like the difference between ancient mainframes with batch processing and real-time desktop GUIs with the powerful illusion of direct manipulation – imagine the creative power unleashed with the first Macintosh, the personal computer “for the rest of us,” but for interactively driving these inhumanly powerful AIs.
Bicycles for the mind indeed.