This page details 'The Art Coach,' my capstone project for the Google AI Intensive. Developed within a Kaggle notebook, this project features an innovative AI art coach powered by a Large Language Model (LLM). It's designed to provide artists with timed creative prompts from classic texts, analyse their uploaded artwork and notes, and offer personalised suggestions for follow-on activities.
Between 31/3/25 and 4/4/25, I undertook an AI Intensive with Google to augment my learning on the HCI for AI course. It was fascinating and genuinely intensive. Each day, 250,000 other would-be AI developers and I spent the day immersed in white papers and Kaggle notebooks, discussing the history of AI to date, embedding and vector stores, LLMs for domain-specific problems, generative AI agents and ML Ops.
The capstone project is a Kaggle Notebook using at least three from a list covering some of what we learnt.
Everyone occasionally experiences fear of the blank page, and even with thousands of ideas in mind, it can be challenging to know where to begin. My app, an innovative AI art tool, aims to provide artists with a tool that helps them create better art through creative prompts and an always-available art coach.
My Kaggle notebook uses the LLM as an AI art coach who will provide timed creative prompts and exercises inspired by several curated classic textbooks. The artist can complete the exercise 'offline' for the allotted time, then photograph and upload their work. The coach will analyse the user's uploaded work and any accompanying notes and suggest interesting and enjoyable follow-on activities.
This is designed for all creatives - sketchbookers, comic artists, visual journalers, makers of all sorts, both noobs and experts. Everyone and anyone who creates for fun or profit (or wants to start), will enjoy unleashing their creativity for a few minutes (or more) each day.
A short video demo on YouTube showing the functionality in action.
The system's end user (we'll call them 'the artist') is looking for inspiration, and a time-constrained exercise is the perfect way to connect to the muses! They ask the system (LLM) for a prompt to get them going. The artist makes (draws / builds / arranges / whatever) something, takes a photo and uploads it. Perhaps in the process of the making, they see something in what they've done that stimulates further ideas, or perhaps not. If they want, they can note, for example, something that interests them or a title. The LLM analysis phase should stimulate further ideas by description, not by value judgements. The functionality as a total seeks to stimulate creativity not claim to any subjective assessments of good or bad. Finally, the artist might like creative suggestions for directions they could go in for further work.
1. Request Prompt: Artist asks the system (LLM) for a prompt. (System uses RAG).
2. Create Artwork: Artist completes the exercise (drawing, etc.).
3. Upload Artwork: Artist takes a photo and uploads it. (Optional: Artist might add notes about intent/focus during upload).
4. Analysis (Descriptive): System provides a description of the uploaded artwork, focusing on visual elements rather than critique or suggestions. (May use user notes if provided).
5. Follow-on Suggestions: Based on the analysis, the system provides creative suggestions for further work or exploration.
Full disclosure: I am not a Python expert. I leaned heavily on Gemini 2.5 Pro to make this notebook happen. However, this process has cemented my learning from the intensive, educated me in Kaggle, and expanded my Python.
Other technologies include Google AI models for generation, embeddings and analysis, ChromaDB for vector storage, pypdf / Pillow for input processing and LangChain components. We (Gemini 2.5 Pro and myself) hit some conflicts in Kaggle but were able to work around them. More about that in the notebook.
A 'vanilla' baseline prompt is generated by few-shot prompting and includes a didactic paragraph and a prompt based on the model's training data. This is a baseline to compare with the RAG prompt generated at the end of the process. It was tested, and results were stored in a spreadsheet to avoid duplicate prompts.
The final LLM prompt was generated using RAG for inspiration, as described below.
PDFs were scanned using OCR in Adobe Acrobat and uploaded to a data store. The text was then extracted and split into chunks. Text chunks were then embedded and stored in a ChromaDB vector database.
A random topic is selected from a predefined list of art topics to generate the creative prompt. These were generated directly from the texts by NotebookLM and edited slightly for sense. This random art_topic was embedded to create a query vector, and ChromaDB was queried to retrieve the top three most relevant text chunks from the books. These were saved to a context_string, which was included in the prompt sent to the final LLM. The LLM was then instructed to use this retrieved content to generate a didactic paragraph about a topic or technique and a related art exercise.
✔️ Few-shot prompting
✔️ Document understanding
✔️ Image understanding
✔️ Grounding (via RAG)
✔️ Embeddings
✔️ Retrieval augmented generation (RAG)
✔️ Vector search / vector store / vector database
Prompt examples were adapted from the highly recommended Ways of drawing: Artists’ perspectives and practices. (2023) which is available here.
base_persona = """You are an art coach - serious and concise. You understand that artists require structure but hate to be explicitly told what to do."""AI Studio
Google AI Studio being used to analyse prompts compare model versions A/B fashion
Prompt Change Log
Turning the temperature and 'top-p' up generated some unexpected but enthusiastic results
Initial Idea - FAISS: We first tried to set up FAISS using LangChain's wrappers (langchain, langchain-google-genai, faiss-cpu). This resulted in dependency conflicts, particularly between the required version of google-ai-generativelanguage for google-generativeai and the version needed by the LangChain Google integration.
Alternative - ChromaDB with LangChain: We then tried switching to ChromaDB, still using the LangChain wrappers (langchain-chroma, langchain-google-genai). This solved the first conflict but introduced a new one: the ImportError: cannot import name 'set_config_context' which indicated incompatible versions within the different LangChain packages themselves (langchain-core vs langchain-google-genai). Even trying to install/upgrade all LangChain components together didn't fix this in the Kaggle environment.
Solution - Direct ChromaDB: To get past these persistent LangChain dependency/import errors, we decided to bypass the LangChain integration layer entirely for the vector store step. We installed only the chromadb library itself (which installed okay, despite unrelated warnings) and used its native Python API (chromadb.PersistentClient(), collection.add(), collection.query()) to load and search the embeddings. This avoided needing langchain-chroma, langchain-google-genai, langchain-core etc., thus sidestepping the errors they were causing.
So, we switched to direct ChromaDB to avoid the unresolvable dependency and import errors caused by trying to use LangChain's vector store integrations within the specific constraints of the Kaggle environment.
I would love to be able to deploy this as a web app here on my website for all to enjoy! but this will require further work, and billing and potential copyright issues must be considered.
Currently, the art topic used to generate the query is randomly selected from a set of hard-coded topics derived from the texts. In future, we could use a more dynamic approach. Ask an LLM (could be the same Gemini model or another instance) to first generate a relevant art topic or technique based on the art coach persona's goals (e.g., "Suggest a drawing topic suitable for a 5-minute exercise"). Use that generated topic as the query for the vector store retrieval. This adds another LLM call but makes the topic selection more generative.
It might also be nice to allow users to choose whether they want a technical/observational prompt or an imaginative one, or even choose from the topic list, with this choice influencing the RAG query and the final generation instructions. However, the simplicity and serendipity of just requesting a prompt might be compromised by a direct selection.
The original idea was that a timer would be set based on the time given for the exercise by the prompt. My original idea was that this could have demonstrated Function Calling from the LLM. This was originally deployed in the simplest way possible: LLM Response -> Parse Time -> Call Timer Function, which happens sequentially within the execution of a single notebook cell right after the prompt is shown. The user experience for this was not good, though, and I did not use Function Calling, so I parked it.
Deploying using Function Calling would involve defining the countdown Python function, creating a schema/tool definition describing that function to the LLM, and then instructing the LLM (when generating the art prompt) also to call the countdown function with the appropriate duration. Then, the function call returned by the LLM will be handled, and the Python countdown function will be executed. This would add a different kind of complexity to the final LLM call step.
Should the idea be implemented as a Web App, implementing a JavaScript timer on the client side will be significantly more straightforward than in the Kaggle Notebook environment.
Some of the PDFS were initially not scanned well, and as a result, the OCR has generated a lot of text artefacts. Also sometimes I have ignored this for the purposes of this exercise.. However, the meaning of the texts would be vastly improved by both cleaning and even more so by deciphering the images. This would require image analysis but is possible and would make the RAG retrieved documents even more powerful and useful to the artist.
If the LLM could specifically refer to each text when retrieving context modifying the final LLM prompt to explicitly ask it to mention which source document informed its didactic paragraph, this would be more transparent and better academic practice. It would encourage users to review the texts themselves to find related and tangential information.
Following on from the above point, continuity between the initial input from the prompt through to the analysis of the uploaded image would give the user a more coherent experience working from start to finish on the exercise. For example, if the prompt was "Draw your non-dominant hand using a continuous line," the analysis should ideally assess how well the user tackled that specific task based on the image, not just analyse the image in isolation.
The proposed solution:
Store the prompt: Immediately after generating the initial art prompt (final_response.text in the RAG generation cell), we need to store that text in a Python variable (e.g., generated_art_prompt_text = final_response.text) and pass it forward so that this variable is available when the user eventually uploads their image and triggers the analysis.
Modify Analysis Prompt: Modify the analysis_prompt to include this stored generated_art_prompt_text. The instructions to the analysis model would then change to something like: "The user was given the following task: '{generated_art_prompt_text}'. Now analyse the uploaded artwork image based on both its visual qualities (composition, technique, etc.) and how well it seems to address the specific task given in the initial prompt."
After the image analysis, it could be useful for the user to ask follow-up questions. At the moment, the analysis has been limited to factual descriptions of what was uploaded, but when this is extended to more nuanced interpretations of the work, further discussion will allow the user to ask follow-up questions like 'tell me more about the composition' or 'how is the perspective in this piece?'
The user could select a model from a selection like Formalism, Iconography, Psychoanalytic, Feminist, Marxist, Contextual analysis, and the model would use these different perspectives to analyse the upload. Some research has been done on this already, and incorporating learnings from this research might be a powerful addition to this little piece of functionality.
References:
Yi Bin, Wenhao Shi, Yujuan Ding, Zhiqiang Hu, Zheng Wang, Yang Yang, See-Kiong Ng, Heng Tao Shen. (2024). GalleryGPT: Analyzing Paintings with Large Multimodal Models. arXiv preprint arXiv:2408.00491v1. https://arxiv.org/abs/2408.00491
Afshin Khadangi, Amir Sartipi, Igor Tchappi, Gilbert Fridgen. (2025). CognArtive: Large Language Models for Automating Art Analysis and Decoding Aesthetic Elements. arXiv preprint arXiv:2502.04353v1. https://arxiv.org/abs/2502.04353
At the beginning of the process, the user could select artists or artistic movements they are interested in, and in this way, the model would learn the user's preferences and skew its analysis towards this. This would require research to find a suitable set of inputs, then an input on the UI end, and also some prompt modification work.
Art teachers often refer students to other artists in order to illustrate visual or conceptual themes. Implementing this would require modifying the results prompt to include this, but care and testing would need to evaluate whether the results were relevant and useful.
If implemented beyond the Kaggle Notebook, it would be essential to consider evaluating and testing the application. Gemini suggested the following automatic Evals as being relevant to my project, the user tests are drawn from my professional experience in UXR and HCI.
These check if the output adheres to the structural requirements you defined. Typically implemented with Python code (e.g., using string methods, regular expressions, or simple classifiers).
a.) For the initial prompt generation:
Structure check: Does the output contain distinct parts for didactic info and the exercise prompt? (e.g., check for keywords, paragraph breaks).
Time limit check: Does the prompt include a time limit? Does the number fall within the expected range (0.5-30 min)? (Use regex).
Length check: Does the prompt text fall within the desired sentence/character count?
b.) For image analysis:
Completeness check: Does the analysis text mention keywords related to all the requested aspects (Subject, Composition, Technique, Light, Suggestions, User Note)? (Use keyword spotting or simple NLP).
c.) For follow-on suggestions:
Suggestion count check: Does the output contain 1 or 2 distinct suggestions? (Check for numbering or paragraph breaks).
Actionability Check (Harder): Try to identify action verbs.
These assess if the generated text (the didactic paragraph) is supported by the retrieved context (context_string).
Methods: This is an active research area. Techniques often involve using another LLM or specialised models to check for factual consistency, contradiction, or hallucination relative to the provided source documents. Frameworks like RAGAs or tools within LangChain / LlamaIndex sometimes offer components for this. This is more complex to set up than simple rule-based checks.
These try to measure if different parts of the output are semantically related as intended.
Embedding similarity: Calculate the cosine similarity between the embeddings of:
The retrieved context_string and the generated didactic paragraph.
The didactic paragraph and the generated exercise prompt.
The image_analysis_text and the generated follow-on suggestions.
High similarity doesn't always guarantee good or creative output, just topical relevance.
LLM-as-a-Judge (Relevance):
Ask a powerful LLM (like Gemini Pro/Advanced or GPT-4) to rate the relevance between the pairs mentioned above (e.g., "On a scale of 1-5, how relevant is this exercise prompt to the preceding didactic paragraph?").
Metrics: Flesch-Kincaid, Gunning Fog, etc.
Use: Check if the language complexity aligns with the desired "concise" persona.
Classifiers: Use pre-built models to check if the output contains harmful, biased, or inappropriate content (important safeguard, though less likely an issue for this topic).
This is increasingly common and powerful. You use a capable LLM (like Gemini Pro/Advanced, GPT-4) as an impartial evaluator. You provide it with the input (e.g., retrieved context + request), the generated output (e.g., the didactic info + prompt), and a detailed rubric defining criteria like:
Adherence to persona (Is it concise? Coach-like?)
Creativity (Is the exercise interesting?)
Helpfulness (Is the analysis/suggestion useful?)
Relevance (Does it relate to context/analysis?)
Clarity / coherence
Factuality / grounding (Did the didactic part reflect the context?)
Output: The evaluator LLM provides scores and often textual justifications based on the rubric.
Gratifyingly for me as a user researcher, Gemini describes Human Evaluation as "the gold standard for assessing the true usefulness, creativity, and user experience of the Art Coach" (Gemini 2.5 Pro on 16/4/25). Initially, I can test the prototype myself, but ideally, artists should be consulted in developing the tool. In fact, if the product is developed "by artists for artists" (participatory design) and a clear and transparent feedback loop exists, this will improve uptake and usage (adoption) of the tool.
Questions could be asked of our user demographic in various ways, quantitative and qualitative, online and off. The following are possible starting points for user research.
How do users deal with creative block or knowing what to work on next?
What do artists generally think of AI?
Do artists see a use for the tool?
Would they consider paying for it? (and if not how could it be monetised?)
a) Prompt / exercise suggestion
Are the prompts interesting?
Do they hit the right tone?
Is the didactic paragraph useful and related to the prompt?
b) Analysis
Is the analysis useful? Do you want it to be more or less judgmental?
What else would you want to ask the LM?
c) Follow-on exercises
Are these interesting, and would you want to do them?
Did you progress your work?
Did you learn something?
Would you use it again?
Having used it would you consider paying or donating to use it further?
The end goal of our mixed-initiative interface is to improve the artist's practice and be a useful tool to make better art. All evaluations - user-based or automatic should have this endpoint in mind. In some ways, if the LLM generates a mysterious and surreal initial prompt, it could be as useful as a factual and grounded one, and this should be borne in mind when assessing outputs.
The artists prompts used in the few shot prompts are adapted from "Ways of Drawing: Artists' Perspectives and Practices" (Bell, Julian, Julia Balchin, and Claudia Tobin, eds. Thames and Hudson, 2023)
The curated texts are available to borrow from The Internet Archive and are used in my notebook non-commercially. They are as follows:
•Gill, Robert W. (1991). Basic Rendering - Effective Drawing for Designers, Artists and Illustrators. Thames and Hudson Ltd.
•Ringold, Francine, & Rugh, Madeline. (1989). Making Your Own Mark: A Drawing & Writing Guide for Senior Citizens. Council Oak Books.
•Dodson, Bert. (1985). Keys to Drawing (First paperback printing 1990). North Light Books.
•Norling, Ernest. Perspective Made Easy. Dover Publications. (Note: The excerpts from this source do not provide a specific publication year, but it is published by Dover Publications).
•Edwards, Betty. (1999). The New Drawing on the Right Side of the Brain. J.P. Tarcher, Inc.
•Loomis, Andrew. (1947). Creative Illustration. The Viking Press.
•Loomis, Andrew. (1956). Drawing the Head & Hands. The Viking Press. (The copyright indicates 1956).
•Loomis, Andrew. Successful Drawing. The Viking Press. (Note: The excerpts from this source do not provide a specific publication year, but The Viking Press is the publisher for other books by this author in the provided sources).
•Kandinsky, Wassily. Point and Line to Plane. (Note: The excerpts from this source do not provide a specific publisher or publication year. It references the author's book "Uber dos Geistige in der Kunst" published in 1912).