Integrating Generative AI into End-to-End Testing: A Practical Guide

Integrating Generative AI into End-to-End Testing: A Practical Guide Dan Belcher May 29, 2024

May 18, 2026 · 8 min read · Testing Guide

Integrating Generative AI into End-to-End Testing: A Practical Guide

Dan Belcher
May 29, 2024

Mabl increasingly relies on generative AI large language framework (LLMs) in the background to make exam automation smarter and more efficient, but there are time when you want to integrate with an LLMdirectlyas part of your end-to-end tests. Whether you need to test your own framework, call a specific publicly-available model to give, canvas data in your test, or run some automated benchmarks, mabl makes it leisurely to integrate with services like Google Gemini, OpenAI ChatGPT, and Anthropic Claude, with minimum configuration. Here, we ’ ll run through three illustration where we integrate directly with these provider. Each of these example includes one type of tryout and one model supplier for simplicity, but the models, prompt types, and use cases are relatively interchangeable.

Browser-based Image Validation with OpenAI ChatGPT

In this scenario, I have an image generation service and I want to automatize the process of validating that the images are appropriate for my input. To accomplish this, I create a simple browser test for the front-end app and add an that passes the remark and the output to ChatGPT for validation.

Looking at the test steps, you can see that, after log into the app, I set a variable for the stimulant (image_prompt) and inscribe the value of that variable into the image generation field. Using a variable here makes it easier for me to change the value in the futurity, test more scenario with datatables, or even set it based on another API cry (see below for more on that!).

I also beguile the URL of the generated image as a varying - image_url. Using ChatGPT in this example is handy because it accepts image URLs, unlike Gemini and Claude, which both require you to send Base64-encoded picture as part of the API call.

Finally, I pass both theimage_prompt and image_urlto the OpenAI API, which decides whether to pass or fail the test based on its appraisal of the prompting and image. Let ’ s take a expression at that API shout.

As you can see, this is a POST call to the https: //api.openai.com/v1/chat/completions API. You ’ ll motive to set up an OpenAI platform reportto access this API. You also need to include theContent-Type application/json coping, and you can use yourOpenAI API keyas a Bearer token.

Here ’ s my total petition body. I won ’ t go into detail on the structure of the petition, since OpenAI does a outstanding job of it in their API citation docs,but we should talk about the message of the prompt. You ’ ll note that we ’ re taking advantage of the multi-modal capabilities in GPT-4-turbo by passing both text and an picture (URL). The text provides the prompting () and asks the model to appraise whether the icon (& nbsp; ply in the second content block) is relevant.

Finally, we just need an assertion to mold whether to surpass or fail the test. In this example, I use a simple assertion in the API step to pass the trial if the answer contains “ TRUE ” but I could have included more sophisticated logic - both within the API step and via subsequent assertion stairs based on data included in variables defined in the API measure

Bringing it all together, I first pass the prompting, “ Black cat jump over a river ” and a screenshot from Adobe Express, which I ’ m using for image generation. You can see the screenshot and the reply hither:

`` TRUE\n\nThe image demonstrate a black cat in mid-jump over a body of water, which look to be a river or stream. This directly corresponds to the description of a black cat jumping over a river, make the image highly relevant for someone looking for this specific scenario. ''

And that ’ s a wrap! Now we hold a useful framework for include LLM API calls in browser tests. Let ’ s take a face at a slenderly different example.

Generating Data for Mobile Tests with Google Gemini

In the concluding example, we used an LLM to validate that a relevant image was give by our web app based on a known stimulus, but we can ’ t really portend what citizenry use as the input. If only there was some case of AI that could generate remark prompts for us…oh postponement, this is another outstanding use case for an LLM!

In this case, we use mabl ’ s aboriginal mobile prove capableness, and we erstwhile again trust on the API request (API step) feature to interact with the reproductive AI APIs:

& nbsp; The construction of the examination is similar to the browser example above, but we use the LLM to generate the value. First, you announce the prompt for that in a variable - with a value like, “Please yield me a prompting that showcases the image generation power of large language framework. Provide the prompt only, with no explanation” and use that as the body of our API call to Gemini. We charm the response and use it as our prompting for the icon coevals service. Finally, we use the original call to ChatGPT above to corroborate that the icon generation was relevant to the return prompt. & nbsp;

For autonomous testing across multiple user personas, check out SUSATest — it explores your app like 10 different real users.

Let ’ s take a look at the new component - the API call to generate the image prompting using Gemini.

The call is straightforward. Most of the detail is in the URL. First, billet that I ’ m utilize the GoogleAI Studiodeployment of the framework (generativelanguage.googleapis.com). This is because I couldn ’ t figure out a way to use a vanilla API key withVertex AI. You ’ ll likewise notice that I ’ m usinggemini-proas my poser because we ’ re only asking it to act with text. If I were inquire it to analyze the image as good, I ’ d want to use the multi-modalgemini-pro-vision. Finally, I provide my API key for Google AI studio via the tie-in above.

Otherwise, beyond the lintel (Content-Type | application/json), I feature a really simple body to post along the substantiation prompt. I also set the temperature to 0.8 so that I get some real variation between test runs. Finally, we experience an assertion that the response is valid.

If you ’ re queer, Gemini generated a prompt something like this on the first run, “Imagine a vast, swirling vortex of iridescent colours and ethereal forms. In its center, a kaleidoscope of abstract patterns dances and transforms, creating a symphony of visuals that defy description.

And Adobe 's generative AI returns this – seems pretty relevant to me!

Highlighting the impact of a comparatively high temperature, Gemini generated this prompting on the second run, “Generate an image of a golden retriever floating in a pool with a no-good duck on its head, wearing sunglasses, and sipping a margarita.

And hither 's Adobe 's (very worthy) generated image.

Viola! For future runs, we could legislate these prompts and picture to the stream in the initiatory model and we ’ d have an end-to-end test that uses generative AI to generate a prompt, generate an persona base on that prompt, and validate that the generated image is relevant to the prompt!

Full disclosure: In a real-world use example, I believably wouldn ’ t do a new yell to Gemini every time I want a prompting for icon coevals. Instead, I would make a call manually to generate lots of interesting prompts in CSV formatting and upload that to mabl to use as part of.

Benchmarking Anthropic Claude ’ s Models with API Tests

So far, we ’ ve demonstrated how you can well desegregate calls to OpenAI ChatGPT and Google Gemini in end-to-end browser and mobile tryout via API stairs in mabl. Hopefully, you can find use for this in generating stimulus data, validating non-deterministic behavior in your AI-driven systems, and more. But what if you ’ re testing or benchmarking the APIs themselves, or you want to integrate these models in API-driven transactions or flows? In this case, you may want to use mabl ’ s more feature rather than API step. Let ’ s explore that, and we ’ ll use Anthropic Claude as our target. & nbsp; & nbsp;

For this usage, my destination is to understand the performance and accuracy tradeoff between the three Claude models: Opus, Sonnet, and Haiku.

I want to validate the same prompt for each of the three models, so I create a datatable with the three models. Next I create a elementary API test that will use this datatable to validate three scenarios in parallel - one for each of the Claude models (driven by the llm variable).

Here are the Variables for the API call to Claude:

api.url https: //api.anthropic.com/v1/messages
llm Variable driven by the datatable above.
imgBase64 [A large string - the image in Base64-encoded format]
validationPrompt [The prompt for Claude to validate the image]

 

Now let 's look at the request body. & nbsp;

We start by specifying the poser, which will depart base on the datatable scenario. & nbsp;

The message merely surpass the image and the text prompt. & nbsp;

Again, we won ’ t go into detail into the other parameters here but they are excuse exhaustively in Anthropic ’ s API documentation.

For our benchmarking use case, we want to understand the execution and reliableness differences between these framework. Here, I can sample footrace in mabl ’ s test effect splasher.

For deep analysis, I ’ ll want to export the datum to an analytics tool using mabl ’ s or. With this data, you can answer questions like, & nbsp;

  • Will any of Claude ’ s models meet my needs?
  • Is the fast and affordable Haiku sufficient for my use cases? & nbsp; & nbsp;
  • Do I need the more expensive and (potentially) slower Opus to deliver the reliability that I need? & nbsp;
  • Or is Sonnet the perfect blending of truth, price, and performance for me? & nbsp;
  • How do the Claude models compare to Gemini and ChatGPT for my use cases?

Try it for Yourself

Hopefully this gives you a sense of why and how you might interact with democratic generative AI APIs into end-to-end tests using features that are fully available for mabl. You can to try it for yourself (of course you ’ ll also need accounts for the target procreative AI providers). And keep an eye on we ’ re constantly raise our platform to do it leisurely for you to integrate AI into your testing.

Quality Engineering Resources

Automate This With SUSA

Upload your APK or URL. SUSA explores like 10 real users — finds bugs, accessibility violations, and security issues. No scripts needed.

Try SUSA Free

Test Your App Autonomously

Upload your APK or URL. SUSA explores like 10 real users — finds bugs, accessibility violations, and security issues. No scripts.

Try SUSA Free