Making Images Accessible with Generative AI
Elizaveta Cheremisina
Intern
Writing good alt text, though, takes time and effort. It needs to be clear, meaningful, and capture the purpose of an image. With advancements in technology and the growing need for alternative text descriptions, driven in part by the EU Accessibility Act, we were inspired to explore how AI could support this effort. This led us to a key question: Could large language models (LLMs) help automate the process of generating useful, context-aware captions for book illustrations? And if so, how well would they perform?
To find out, we ran an experiment using illustrations from two very different books: Alice’s Adventures in Wonderland, a classic full of surreal and whimsical imagery, and De aard van het beestje, a Dutch book where the illustrations are primarily used for comical effect. The goal was to see if LLMs could accurately describe images in accordance with the EU Accessibility Act guidelines.
In this article, we’ll walk through how we approached the project, what techniques we used, the results we got, and some of the challenges that came up along the way.
Let's dive in!
We quickly realised that this task was not a trivial one. Alt text must convey an image’s purpose, not just its content. But book illustrations pose unique hurdles:
Challenge 1: According to EU guidelines, alternative text should describe the purpose of the image rather than repeat what is stated in the caption (if present) or adjacent text. However, images often appear much earlier or later than the actual scene they depict in the book—creating
Challenge 2: How do we define the purpose of book illustrations? Are they purely decorative, requiring no alternative text? Or do they serve a role in creating a unique sensory experience, making at least a short description necessary?
Challenge 3: How can we provide enough context to ensure these images are described accurately?
For this project, we decided to treat book illustrations as requiring alt text. As for context, we tested both scenarios—with and without it.
Parameter Optimization
The default temperature (1.0) was compared to lower values (0.0 and 0.1) to evaluate its impact on performance.
Evaluation Criteria
The generated alt-texts were assessed based on:
- Accessibility Compliance – Did it meet the EU Accessibility Act’s standards for alt text?
- Accuracy – Did the description correctly represent the visual elements?
- Contextual Relevance – Was the model able to capture the context of the book, if provided?
Each generated caption was assigned one of four labels (from best to worst), and final scores were calculated for each model:
- Correct - Accurate, well-formed, and ready to use without edits.
- Mediocre – Recognizes key details like character names and scene context but needs minor edits for clarity, conciseness, or phrasing improvements.
- Generic - Factually correct but lacking key details like character names or story elements, making the description too generic (e.g., “a young girl” instead of “Alice”).
- Incorrect – Contains one or more factual errors or does not comply with EU guidelines.
To explore the potential of LLMs in automated image captioning, we selected illustrations from two distinct books:
- Alice’s Adventures in Wonderland – A world-famous classic, available for free on Project Gutenberg. Given its popularity, a big question here was whether LLMs already "know" enough about Alice to generate accurate captions without extra context—or if they still need guidance to get things right.
- De aard van het beestje by Frits Vaandrager – A Dutch book written by a biologist in which he compares common beliefs about animals with scientific insights into the way they live. Unlike Alice, it’s not a traditional work of fiction—there’s no storyline or characters in the usual sense, just observations and humor-driven illustrations. The images are often quite abstract, contributing to the book’s distinctive style. That raised an interesting question: Would adding context even make a difference, or would the AI generate the same kinds of captions regardless?
The project leveraged state-of-the-art LLMs to generate alternative text descriptions. The models tested included GPT-4 (OpenAI): chatgpt-4o-latest & gpt-4o-mini, Gemini (Google): gemini-2.0-flash-001 (stable), and Claude (Anthropic) - Sonnet 3.5 v2.
Each model generated captions under two conditions:
- No context: The model generated image captions based solely on the image and a prompt, following EU Accessibility Act guidelines.
- Summaries combined: The model received the image along with a prompt that included book chapter summaries. These summaries were structured with key characters, important scenes, notable quotes, and a character tracker for new appearances, ensuring context was effectively incorporated.
Parameter Optimization
The default temperature (1.0) was compared to lower values (0.0 and 0.1) to evaluate its impact on performance.
Evaluation Criteria
The generated alt-texts were assessed based on:
- Accessibility Compliance – Did it meet the EU Accessibility Act’s standards for alt text?
- Accuracy – Did the description correctly represent the visual elements?
- Contextual Relevance – Was the model able to capture the context of the book, if provided?
Each generated caption was assigned one of four labels (from best to worst), and final scores were calculated for each model:
- Correct - Accurate, well-formed, and ready to use without edits.
- Mediocre – Recognizes key details like character names and scene context but needs minor edits for clarity, conciseness, or phrasing improvements.
- Generic - Factually correct but lacking key details like character names or story elements, making the description too generic (e.g., “a young girl” instead of “Alice”).
- Incorrect – Contains one or more factual errors or does not comply with EU guidelines.
LLM Alt Captioning in Action
First Experiments with GPT Models
We began by running our first exploratory experiments with GPT models on Alice in Wonderland to get a sense of what to expect using a very simple setup: default settings, just the image, and some prompt variations.
We discovered that LLMs tended to be overly verbose and inconsistent. Although they recognized the book (as expected) and sometimes identified some of the characters, many descriptions remained too generic, and character names were sometimes incorrect. These insights helped us refine our approach as we later expanded our experiments to include other models.
We also discovered the differences between two GPT models: gpt4o-mini tends to give more generic descriptions, while gpt4o-latest tends to give more concrete descriptions, but they often contain incorrect information (see the graph below). Since our goal is for descriptions to accurately reflect character names and scenes, the need to incorporate context became undeniable.
Let’s take a look at the most common mistakes:
Incorrect information:
GPT-4o-mini: "Humpty Dumpty" and a young girl in a colourful garden setting.
GPT-4o-latest: "A girl in a white dress stands beside a figure resembling the Queen of Hearts in a garden surrounded by flowers and greenery."
Caption in the book: "The Duchess tucked her arm affectionately into Alice's."
Correct description: Alice walks with the Duchess in a lush garden filled with flowers and trimmed hedges.
Generic description:
GPT-4o-mini: A young girl watches as a woman stirs a pot, while an older woman holds a baby and a cat sits nearby.
GPT-4o-latest: A young girl faces a seated woman with a child, an older woman holding a ladle, and a striped cat in a rustic kitchen setting.
Caption: Alice in the Room of the Duchess.
Correct description: Alice stands in a kitchen facing a stern cook, a large red-clad Duchess holding a crying baby, and a grinning Cheshire Cat on the floor.
GPT-4o-mini: Two frogs in aristocratic attire interact with a large envelope adorned with a crown.
GPT-4o-latest: Two anthropomorphic fish, dressed in 18th-century attire, exchange a large envelope with a royal seal, illustration by Sir John Tenniel for "Alice's Adventures in Wonderland".
Correct description: The Fish-Footman hands over a great letter to the Frog-Footman.
Further Experiments
To address the issues identified during our initial testing phase, we decided to lower the temperature setting, create summaries of each chapter highlighting key elements, and include all summaries in the final prompt. This approach allowed us to incorporate the book's context and ensure that enough context was provided to achieve accurate image description (addressing Challenge 3). Additionally, we adapted the best-performing prompt for other models. See the results below:
We identified two clear leaders: ChatGPT-4o-Latest and Gemini, both of which produced no generic descriptions and had a very low percentage of incorrect ones. In contrast, Claude’s performance was surprisingly poor.
However, we knew we had to stay skeptical—since this experiment was based on a well-known English classic, the models may have had an advantage. Would their performance decline with the Dutch book? And if so, by how much? Would context be as crucial here as it was for Alice?
See the results below:
Indeed, this book posed a greater challenge not just for the models but also for our annotators, as some images were quite difficult to decipher without having read the book or examining them more closely.
Overall, we observed a decline in performance across all models and found that context played a less significant role for this particular book—though it still provided a slight improvement.
Let's take a look at some common errors:
GPT: Een groep dieren, waaronder een hond en een mens, rijdt op een motorfiets in een dynamische en speelse setting.
(Eng: A group of animals, including a dog and a human, is riding a motorcycle in a dynamic and playful setting.)
Gemini: In de afbeelding rijden antropomorfe dieren op motorfietsen over het water.
(Eng: In the image, anthropomorphic animals are riding motorcycles over the water.)
Claude: Abstracte zwart-witte compositie met vloeeiende, golvende vormen die doen denken aan dansende figuren in beweging.
(Eng: Abstract black-and-white composition with flowing, wavy shapes reminiscent of dancing figures in motion.)
GPT: Twee antropomorfe vliegen doen de was, één wast een gestreepte sok in een teil en de ander strijkt een kledingstuk.
(Eng: Two anthropomorphic flies are doing the laundry; one is washing a striped sock in a tub, while the other is ironing a piece of clothing.)
Gemini: Twee vliegen, waarvan er één een gestreepte handdoek in een kom doopt en de andere op een strandstoel zit.
(Eng: Two flies, one dipping a striped towel into a bowl and the other sitting on a beach chair.)
Claude: Twee schetsmatige figuren met grote ronde hoofden: één schenkt een vloeistof uit een fles in een kom, de andere zit op een stoel.
(Eng: Two sketch-like figures with large round heads: one pours a liquid from a bottle into a bowl, while the other sits on a chair.)
Our learnings
Throughout our experiments, we encountered several challenges that impacted the quality of AI-generated alt text. Below are the key obstacles we faced and the strategies we implemented to address them.
Learning 1: Not all the illustrations play a key role in the story
Not all images have a function. Some are purely decorative and may not require alt text, while others contribute to the storytelling or enhance the reader’s understanding. Determining when a description was necessary—and how detailed it should be—was a crucial consideration.
For this project, we decided to treat all illustrations as requiring at least a short alt text, erring on the side of inclusion. However, this decision ultimately remains at the editor's discretion.
Learning 2: Illustration Placement vs. Story Timing
In many books, illustrations are not directly connected to the adjacent text, making it difficult to extract the relevant context needed for accurate descriptions. This presents an engineering challenge—without a clear link between the image and its corresponding textual reference, determining which part of the book provides the most useful context becomes a complex task.
That’s why we introduced structured chapter summaries in the prompts, highlighting key characters, notable scenes, and essential details. This approach significantly improved the relevance and accuracy of the generated descriptions, ensuring the AI had enough information to place each illustration within its proper narrative context.
Learning 3: Optimise model parameters to address their specific limitations
Different LLMs exhibited distinct tendencies—GPT-4o Mini produced a lot of generic descriptions, while GPT-4o Latest and Gemini seemed to have a good balance between correct and incorrect captions. Claude's overall performance was notably poor, and the reasons for this are yet to be investigated. All models struggled with recognizing less mainstream content, as seen in the Dutch book De aard van het beestje.
We refined prompts, adjusted temperature settings, and compared models to find the most reliable option. Contextual summaries reduced errors, but human oversight remained essential. The final model choice also depends on factors like cost, context window, and usability.
Learning 4: Balancing Detail and Readability
LLMs often generated overly verbose captions, including unnecessary details that could create auditory clutter for users relying on screen readers.
We experimented with prompt phrasing to encourage concise yet descriptive outputs. Lowering the temperature setting also helped reduce verbosity, leading to more balanced descriptions.
Final Thoughts
Prompted by the EU Accessibility Act, the goal of this project was to see just how well modern LLMs can handle automated image captioning for book illustrations. While the models generate captions that are often relevant, they can be inconsistent, overly generic, or flat-out incorrect—especially when it comes to lesser-known works or abstract images.
The solutions we explored, such as structured prompts and context incorporation, helped improve performance. However, to ensure the descriptions truly capture the right details—like character names and scene context— human oversight remains crucial. The results give us a solid baseline, but there’s plenty of room for improvement—better prompt engineering and smarter context integration could take things further. Future research could explore using the entire book as context instead of summaries. Additionally, LLMs could act as judges to assess caption quality, automating evaluation and further reducing manual effort. Ultimately, LLMs are a powerful tool, but they’re not quite ready to go solo just yet.