Today, the EvalEval Coalition launched EveryEvalEver: a standardized metadata schema for AI evaluation results, built with input from NIST CAISI, Hugging Face, IBM Research, Noma Security, Trustible, and others. It’s a JSON schema, a set of validation tools, and a growing public dataset designed to make evaluation results interoperable, comparable, and resistant to gaming.

We’re proud to be part of this effort. Here’s why it matters for AI security.

The Evaluation Trust Gap Is a Security Problem

Enterprise security teams are making deployment decisions based on evaluation results every day. Which model is safe enough to put in front of customers? Which agent framework introduces the least risk? Which vendor’s claims hold up under scrutiny?

The honest answer, too often, is that nobody knows. Not because the evaluations don’t exist, but because they can’t be meaningfully compared.

One vendor reports a five-shot score from a custom harness. Another pulls numbers from a leaderboard that doesn’t disclose generation parameters. A third runs the same benchmark but with an optimized prompt template that inflates results by 10+ points. The headline numbers look comparable. They aren’t.

This isn’t a theoretical concern. The EvalEval Coalition’s research found that first-party evaluation reporting is sparse, often superficial, and declining in key areas. When the organizations building AI models are also the ones grading their own homework, and when the grading methodology is opaque, security teams are flying blind.

 

Why Noma Contributed

Our work at Noma centers on giving security teams visibility and control over AI systems they didn’t build and don’t fully understand, as well as continuous red teaming to ensure consistent evaluations. That mission extends naturally to the evaluation infrastructure those systems depend on.

We contributed to EveryEvalEver because standardized evaluation metadata serves the same function as a blast radius map: it makes hidden dependencies and assumptions visible so that decisions can be made on solid ground.

The three aspects of the schema:

Evaluator relationship tracking. The schema explicitly records whether a result is first-party, third-party, or collaborative. This is the kind of provenance data that security teams need to assess credibility. Self-reported scores and independently verified scores should never be treated as equivalent, and now there’s a structured way to distinguish them.

Generation configuration transparency. Temperature, sampling parameters, prompt templates, inference engines: these are the hidden variables that can make the difference between a model that looks safe and one that actually is. Recording them as structured metadata means security teams can ask the right questions instead of accepting numbers at face value.

Instance-level data for failure analysis. Aggregate scores tell you how a model performed overall. Instance-level records tell you where it failed. For security teams evaluating AI agents, the failure distribution matters far more than the average. An agent that scores 95% overall but fails catastrophically on permission-boundary questions is not a 95%-safe agent.

What This Means for AI Agent Security

The evaluation landscape for agentic AI is still nascent, and that’s precisely why getting the infrastructure right now matters. EveryEvalEver already supports multi-turn interactions, tool call metadata, and agentic evaluation traces. As the community builds more rigorous evaluations for agent behavior (excessive capabilities, permissions and autonomy), having a standard format for recording and comparing those results will be essential.

This connects directly to the work we’re doing at Noma around agent discovery and security posture management. You can’t secure what you can’t see, and you can’t evaluate what you can’t compare. Our evaluations in red teaming are following the same transparency principles to ensure that evaluations are standardized and transparent. 

An Invitation

EveryEvalEver is open source and open to contributions. If you’re building AI evaluations, running benchmarks, or making deployment decisions based on evaluation results, we encourage you to explore the schema and consider adopting it.

The AI security community has a shared interest in evaluation integrity. When eval results are trustworthy, security teams can do their jobs. When they aren’t, everyone is guessing.

We’d rather not guess.

Noma Security is a contributor to the EvalEval Coalition’s EveryEvalEver initiative. Learn more and explore the schema and tooling on Github.

5 min read

Category:

Table of Contents

Share this: