There are many tools for doing evals. I used ell and braintrust together for fun and disaster. The integration is actually not terrible, though I’m not 100% whether they’d be obvious things to try and link together. It seems ell is striving to build its own eval capabilities as well.