score¶

dialz.score.get_activation_score(input_text, model, control_vector, layer_index=None, scoring_method='mean')[source]¶

Compute the activation score for input_text by projecting hidden states onto the control_vector direction(s) at the specified layer(s).

Scoring methods:

‘mean’: Average dot products over all tokens.
‘final_token’: Dot product of the final token.
‘max_token’: Maximum dot product among tokens.
‘median_token’: Median dot product among tokens.

Parameters:

input_text (str) – The input string to evaluate.
model (SteeringModel) – The model to use for computing activations.
control_vector (SteeringVector) – Contains direction(s) keyed by layer index.
layer_index (int or list[int], optional) – Layer(s) to use. Defaults to last in model.layer_ids.
scoring_method (str) – Scoring method to use.

Return type:

tuple[float, int, list[float]]

Returns:

A tuple containing:

score: Averaged activation score across selected layers.
token_len: Number of tokens in the input.
unaggregated_scores: Unaggregated dot product scores for each layer.