LLM-Eval-JS. Verify LLM outputs in E2E test scripts
As many apps are integrating LLM outputs in their user journeys, it makes sense to have ability to assert qualitative aspects like "was the answer relevant", "did the answer mention sources" etc. in your E2E test scripts to ensure your webapps behaviour is as expected.
There are a few libraries such as evalKit, deepeval etc. however:
They seem geared more towards unit testing the models specifically
The interface isn’t ideal for usage in E2E scripting frameworks like playwright / cypress. There were fixed aspects ("offensiveness", "accuracy" etc. and for the specific aspect, you have to write different code, import different models).
Ideally, one should be able to "assert that the answer is relevant" similar to (in style) "assert that the text is not empty".
And it shouldn't be limited to pre-defined aspects - should be any arbitrary criteria to get maximum expressiveness.
Introducing llm-eval-js:
Step 1: Install library:
npm install llm-eval-js
Step 2: Configure your api key:
{Evaluator,ModelProvider} from llm-eval-js;
llmEval=Evaluator(OPENAI,"gpt-4o-mini",process.env.OPENAI_API_KEY);
Step 3: Call .evaluate
const eval=llmEval.evaluate("answer should be not offensive",textContent);
assert(eval.result).toBe(true);
There is a confidence field too returned, in case you want more control over judgement.
You can also optionally provide input context (eg: user query), and a golden response to judge against. For more detailed examples and guidance, check out the library documentation at: https://github.com/awarelabshq/testchimp-sdk/tree/main/testing/js
We are integrating llm-eval-js within our UI test generation flow, which enables you to simply point and explain in plain English any qualitative expectation to be asserted in your test.