Skip to content

Add automatic Downstream Evaluation to Modalities - Simplified Implementation#455

Open
CYHSM wants to merge 1 commit into
Modalities:mainfrom
CYHSM:downstream-evaluator-simple
Open

Add automatic Downstream Evaluation to Modalities - Simplified Implementation#455
CYHSM wants to merge 1 commit into
Modalities:mainfrom
CYHSM:downstream-evaluator-simple

Conversation

@CYHSM

@CYHSM CYHSM commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

This PR is the bare version of #450 which works with a converted HF checkpoint. For this to work we need to first merge the conversion code from #424 and then merge this on top.

In #450 it integrates the conversion + the evaluation but @le1nux suggested to first merge @BlueCrescent PR. For more details see original PR.

Checklist before submitting final PR

  • My PR is minimal and addresses one issue in isolation
  • I have merged the latest version of the target branch into this feature branch
  • I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
  • I have run a sample config for model training
  • I have checked that all tests run through (python tests/tests.py)
  • I have updated the internal changelog (CHANGELOG_DEV.md)

@le1nux le1nux left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good draft but we need to consider a few architectural / workflow adaptations for large-scale setups.
When we train large scale models the required training GPUs exceed the required eval GPUs by far. Also the training typically utilized most of the GPU memory, leaving no memory for evals potentially.

That's why I would not include the downstream eval into the training / eval loop of modalities and decouple this instead.

HF checkpoint conversion and downstream eval should run on dedicated, non-training nodes as new checkpoints are written out during the training.

So a better way in my opinion would be to have a CP conversion and eval service running in the background, polling e.g., every 60s for new checkpoints including explicit sentinel.

CP conversion and downstream eval can be CMD entry points within modalities.

mfu_calculator: PydanticMFUCalculatorABCType | None = None
scheduled_pipeline: PydanticPipelineType | None = None
device_mesh: PydanticDeviceMeshIFType | None = None
downstream_evaluator: Optional[PydanticDownstreamEvaluatorType] = None

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
downstream_evaluator: Optional[PydanticDownstreamEvaluatorType] = None
downstream_evaluator: PydanticDownstreamEvaluatorType | None = None

)


class DownstreamEvaluator:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should have this implement an EvaluationIF interface

Comment on lines +250 to +251
cmd = self.olmes_command_template.format(
hf_model_dir=str(hf_model_dir),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think running eval on the same node(s) as the training is unrealistic. Typically especially for larger models GPU memory headroom for evaluation is rather limited.

logger.error(f"Failed to launch downstream evaluation: {e}")

def wait_for_evaluations(self) -> None:
if not hasattr(self, "active_processes") or not self.active_processes:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why so complicated? It's an instance variable.

Suggested change
if not hasattr(self, "active_processes") or not self.active_processes:
if len(self.active_processes) == 0:

Comment thread src/modalities/main.py
Comment on lines +239 to +245
if getattr(components, "downstream_evaluator", None) is not None:
print_rank_0("\n" + "="*80)
print_rank_0("Training loop complete! Waiting for background evaluations to finish...")
print_rank_0("="*80 + "\n")
components.downstream_evaluator.wait_for_evaluations()
print_rank_0("All background evaluations completed successfully!")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as written in the overall feedback, I think we should exclude evaluation from the eval loop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants