Artificial Intelligence (AI) is a way to train computers to carry out a range of complex tasks and to learn from these tasks. It uses a process called machine learning to identify patterns in images, texts, and other materials. This process can involve human input or be fully automated. Generative Artificial Intelligence (genAI) is a specific kind of AI that is designed to generate new text, images, video, or audio based on the content that it has studied.
In order for the computer to generate new material, it needs to:
Let’s take a closer look at this process for text, images, and sound.
Generative AI relies on pretrained models to reduce the computing power needed to create text, images, and other outputs. While pre-training makes tools faster and easier to use, it means that some tools do not have access to information in real time. For example, the free version of ChatGPT has studied text authored before September 2021 and has no context for events after that date. Even if a GPT is connected to a search engine, it can only link to sources available on the internet. As any historian will tell you, there are lots of sources that are not available digitally, and the internet is an expansive, but incomplete, representation of all human knowledge and creative works. Searching the library’s catalog offers more robust access to scholarly works.
Likewise, image and audio-based AI reflect the limits of their training sets. Image recognition tools that were trained on pictures from the early 2000s on might misidentify objects in photos from the 1890s.
OpenAI's page "What is ChatGPT?" address limited knowledge from its training data.
Tools like ChatGPT and Microsoft Co-pilot are GPTs, or generative pretrained transformers. They’re generative in that they’re designed to generate new text, which is awesome if you’re trying to outline a cover letter, but not so great if you’re trying to quote and cite published scholarly articles.
These tools are pre-trained and have studied large amounts of text authored by humans in order to understand the relationship between individual words. For example, it can see that the words “plots”, “corn,” and “experiment” tend to appear in paragraphs on the Morrow Plots, our experimental corn field, but these algorithms also know that “corn” can also appear in a chowder recipe.
The computer’s understanding of words and their relationships is called a language model. AI models are trained on hundreds of thousands of texts, so we refer to these as large language models or LLMs.
Finally, these tools are transformers. They take what they know about combinations of words to create new combinations that sound plausible. They use an algorithm called a transformer to generate new text one word at a time, adding an element of randomness to mimic human creativity.
Digital Humanities Librarian Mary Ton deconstructs GPTS in her "What is a GPT?" video.
Instead of looking for relationships between words, computers “see” an image by looking at pixels, the basic building blocks of digital images. The computer will compare one pixel to the ones that surround it, paying close attention to colors, outlines, and texture. During the training process, the computer learns how to identify parts of an image that are important (feature recognition) and then categorize them (classification). Tools like Adobe Acrobat and ABBY Fine Reader use optical character recognition to identify the shapes of letters and create transcriptions of text.
If you have ever had to prove you’re not a robot by clicking all the squares with a fire hydrant or by typing the letters that appear in a picture, you’ve helped a computer identify features and classify them. We refer to human intervention in the training process as “supervised learning,” but computers can also engage in “unsupervised learning” by checking their work without having to ask a human.
As part of the training process, the computer will add random pixels to an image, then test to see if it can still recognize the objects in the picture. This digital noise helps the computer to create a mathematical representation of what it sees as the essence of an object. For example, it might associate the phrase “tabby cat” with a formula for an object with a round shape, two pointy ears, and patterns in black, brown, and white. When generating images, the computer will use the mathematical equation, plus what it has learned when it added pixels, to create an outline then add details to the picture, spreading them throughout the image in a process known as stable diffusion.
Computers use this process to generate 2D images as well as 3D objects. Tools like Polycam can take photos of one object from many different angles and stitch them together using a technique called photogrammetry.
Surprisingly, AI uses image recognition techniques to identify and recreate sounds. The two aspects of sound that are most important for generative AI are frequency (Is this a high-pitched screech or a low growl?) and amplitude (Is this a whisper or a shout?). The computer visualizes these two aspects of sound as a spectrogram–it represents frequencies vertically along the y-axis and uses a sliding scale of colors to represent the intensity of the amplitude. From there, the computer uses visual pattern recognition to identify important features and then classify them. This technique allows the computer to isolate the rumble of a plane flying overhead from an umpire’s whistle because these sounds form different patterns on the spectrogram. Once it recognizes these patterns, it can then use the information about frequency and amplitude to create sounds in new combinations.