Add automatic Downstream Evaluation to Modalities - Simplified Implementation#455
Add automatic Downstream Evaluation to Modalities - Simplified Implementation#455CYHSM wants to merge 1 commit into
Conversation
le1nux
left a comment
There was a problem hiding this comment.
I think this is a good draft but we need to consider a few architectural / workflow adaptations for large-scale setups.
When we train large scale models the required training GPUs exceed the required eval GPUs by far. Also the training typically utilized most of the GPU memory, leaving no memory for evals potentially.
That's why I would not include the downstream eval into the training / eval loop of modalities and decouple this instead.
HF checkpoint conversion and downstream eval should run on dedicated, non-training nodes as new checkpoints are written out during the training.
So a better way in my opinion would be to have a CP conversion and eval service running in the background, polling e.g., every 60s for new checkpoints including explicit sentinel.
CP conversion and downstream eval can be CMD entry points within modalities.
| mfu_calculator: PydanticMFUCalculatorABCType | None = None | ||
| scheduled_pipeline: PydanticPipelineType | None = None | ||
| device_mesh: PydanticDeviceMeshIFType | None = None | ||
| downstream_evaluator: Optional[PydanticDownstreamEvaluatorType] = None |
There was a problem hiding this comment.
| downstream_evaluator: Optional[PydanticDownstreamEvaluatorType] = None | |
| downstream_evaluator: PydanticDownstreamEvaluatorType | None = None |
| ) | ||
|
|
||
|
|
||
| class DownstreamEvaluator: |
There was a problem hiding this comment.
maybe we should have this implement an EvaluationIF interface
| cmd = self.olmes_command_template.format( | ||
| hf_model_dir=str(hf_model_dir), |
There was a problem hiding this comment.
I think running eval on the same node(s) as the training is unrealistic. Typically especially for larger models GPU memory headroom for evaluation is rather limited.
| logger.error(f"Failed to launch downstream evaluation: {e}") | ||
|
|
||
| def wait_for_evaluations(self) -> None: | ||
| if not hasattr(self, "active_processes") or not self.active_processes: |
There was a problem hiding this comment.
why so complicated? It's an instance variable.
| if not hasattr(self, "active_processes") or not self.active_processes: | |
| if len(self.active_processes) == 0: |
| if getattr(components, "downstream_evaluator", None) is not None: | ||
| print_rank_0("\n" + "="*80) | ||
| print_rank_0("Training loop complete! Waiting for background evaluations to finish...") | ||
| print_rank_0("="*80 + "\n") | ||
| components.downstream_evaluator.wait_for_evaluations() | ||
| print_rank_0("All background evaluations completed successfully!") | ||
|
|
There was a problem hiding this comment.
as written in the overall feedback, I think we should exclude evaluation from the eval loop.
What does this PR do?
This PR is the bare version of #450 which works with a converted HF checkpoint. For this to work we need to first merge the conversion code from #424 and then merge this on top.
In #450 it integrates the conversion + the evaluation but @le1nux suggested to first merge @BlueCrescent PR. For more details see original PR.
Checklist before submitting final PR
python tests/tests.py)CHANGELOG_DEV.md)