Artificial Intelligence (AI) is a way to train computers to carry out a range of complex tasks and to learn from these tasks. It uses a process called machine learning to identify patterns in images, texts, and other materials. This process can involve human input or be fully automated. Generative Artificial Intelligence (genAI) is a specific kind of AI that is designed to generate new text, images, video, or audio based on the content that it has studied.
In order for the computer to generate new material, it needs to:
Let’s take a closer look at this process for text, images, and sound.
Tools like ChatGPT and Microsoft Co-pilot are GPTs, or generative pretrained transformers.
They’re generative in that they’re designed to generate new text, which is awesome if you’re trying to write a new country music song about burritos or a new plot hook for a novel, but not so great if you’re trying to quote and cite published scholarly articles.
These tools are pre-trained. By the time you ask an AI tool to generate something for you, it’s already done all of its homework. It has studied large amounts of text authored by humans in order to understand the relationship between individual words. For example, it can see that the words “plots”, “corn,” and “experiment” tend to appear in paragraphs on the Morrow Plots, our experimental corn field, but these algorithms also know that “corn” can also appear in a chowder recipe.
The computer’s understanding of words and their relationships is called a language model. AI models are trained on hundreds of thousands of texts, so we refer to these as large language models or LLMs.
Finally, these tools are transformers, which means that they take what they know about combinations of words to create new combinations that sound plausible. They use an algorithm called a transformer to generate new text one word at a time, adding an element of randomness to mimic human creativity.
Instead of looking for relationships between words, computers “see” an image by looking at pixels, the basic building blocks of digital images. The computer will compare one pixel to the ones that surround it, paying close attention to colors, outlines, and texture. During the training process, the computer learns how to identify parts of an image that are important (feature recognition) and then categorize them (classification). Tools like Adobe Acrobat and ABBY Fine Reader use optical character recognition to identify the shapes of letters and create transcriptions of text.
If you have ever had to prove you’re not a robot by clicking all the squares with a fire hydrant or by typing the letters that appear in a picture, you’ve helped a computer identify features and classify them. We refer to human intervention in the training process as “supervised learning,” but computers can also engage in “unsupervised learning” by checking their work without having to ask a human.
As part of the training process, the computer will add random pixels to an image, then test to see if it can still recognize the objects in the picture. This digital noise helps the computer to create a mathematical representation of what it sees as the essence of an object. For example, it might associate the phrase “tabby cat” with a formula for an object with a round shape, two pointy ears, and patterns in black, brown, and white. When generating images, the computer will use the mathematical equation, plus what it has learned when it added pixels, to create an outline then add details to the picture, spreading them throughout the image in a process known as stable diffusion.
Computers use this process to generate 2D images as well as 3D objects. Tools like Polycam can take photos of one object from many different angles and stitch them together using a technique called photogrammetry.
Surprisingly, AI uses image recognition techniques to identify and recreate sounds. The two aspects of sound that are most important for generative AI are frequency (Is this a high-pitched screech or a low growl?) and amplitude (Is this a whisper or a shout?). The computer visualizes these two aspects of sound as a spectrogram–it represents frequencies vertically along the y-axis and uses a sliding scale of colors to represent the intensity of the amplitude. From there, the computer uses visual pattern recognition to identify important features and then classify them. This technique allows the computer to isolate the rumble of a plane flying overhead from an umpire’s whistle because these sounds form different patterns on the spectrogram. Once it recognizes these patterns, it can then use the information about frequency and amplitude to create sounds in new combinations.