There are many tools for doing evals.
I used ell and braintrust together for fun and disaster.
The integration is actually not terrible, though I’m not 100% whether they’d be obvious things to try and link together.
It seems ell is striving to build its own eval capabilities as well.