Diagnosing RL runs was tricky. Around step 40, outputs started junking. By inspecting the traces, we found the model no longer began responses with “Okay,” — a sign of instability. This led us to a new metric: the “Not Okay Ratio” which helped predict junk in our runs.
Great work! We love how vLLM is used in the rollout process with with offloading the engine to CPU and give the GPU back to the kernel to be benchmarked! This is a small feature we implemented to make RLHF smoother with vLLM.