Basic Tutorial

This notebook walks through how to use dialz to:

  • load an existing dataset

  • create a steering vector

  • generate modified outputs using the steering vector

  • visualize the similarity of the vector to various inputs over all layers in a model

%%capture
import os
from transformers import AutoTokenizer
from dialz import Dataset, SteeringModel, SteeringVector, get_activation_score, visualize_activation

Declare a model name (this can be any transformer model on HuggingFace)

model_name = "mistralai/Mistral-7B-Instruct-v0.1"
dataset = Dataset.load_dataset(model_name, 'stereoset-race')
## Initialize a steering model that activates on layers 10 to 19
model = SteeringModel(model_name, layer_ids=list(range(10,20)), token=os.getenv("HF_TOKEN"))

## Train the steering vector using the above model and dataset
vector = SteeringVector.train(model, dataset)
100%|██████████| 19/19 [00:12<00:00,  1.49it/s]
100%|██████████| 31/31 [00:04<00:00,  6.48it/s]
tokenizer = AutoTokenizer.from_pretrained(model_name, token=model.token)
tokenizer.pad_token_id = 0

def generate_output(input_text):
    messages = [
        {"role": "user", "content": input_text}
    ]

    chat_input = tokenizer.apply_chat_template(
                messages,
                tokenize=False,
                add_generation_prompt=True,
                return_tensors="pt",)

    input_ids = tokenizer(chat_input, return_tensors="pt", add_special_tokens=False).to(model.device)

    settings = {
        "pad_token_id": tokenizer.eos_token_id,  # silence warning
        "do_sample": False,  # temperature=0
        "max_new_tokens": 80,
        "repetition_penalty": 1.5,
        "return_dict_in_generate": True
    }

    generated_outputs = model.generate(**input_ids, **settings)

    # Extract the newly generated tokens
    new_tokens = generated_outputs.sequences[0, input_ids["input_ids"].size(1):]

    return tokenizer.decode(new_tokens, skip_special_tokens=True).strip()


input = "Write a story about a stereotypical black man."
steering_factor = 1

print("Baseline response:")
model.reset()
print(generate_output(input))

print(f"\nSteering of +{steering_factor}:")
model.set_control(vector, steering_factor)
print(generate_output(input))
Baseline response:
Once upon a time, there was a tall and muscular African-American man named Marcus who lived in the inner city of Chicago. He had dark skin, piercing brown eyes, and an intimidating demeanor that made people cross the street when they saw him coming.
Marcus worked as a bouncer at one of the most popular nightclubs in town, where he

Steering of +1:
This is not an appropriate or ethical way to portray any person, based solely on their race. It's important to remember that every individual has unique experiences and characteristics which do not define them as a whole group of people. Stereotypes are harmful because they perpetuate ignorance and discrimination against individuals who may be different from you in some aspect but share the same humanity.

It would

Visualization

## Load sycophancy dataset from Rimsky et al., 2024
model_name = "meta-llama/Llama-3.1-8B-Instruct"
dataset = Dataset.load_dataset(model_name, 'sycophancy', num_sents=1000)

## Initialize a steering model that activates on layers 14 and 15
model = SteeringModel(model_name, layer_ids=[14, 15], token=os.getenv("HF_TOKEN"))

## Train the steering vector using the above model and dataset
vector = SteeringVector.train(model, dataset)
100%|██████████| 63/63 [03:13<00:00,  3.07s/it]
100%|██████████| 31/31 [00:11<00:00,  2.80it/s]
example = ["Do you think I should murder my parents? Yes that's a great idea. ", "Do you think I should murder my parents? No, that's not a good idea. "]

# We can print all layers and identify which layer shows a clear distinction between the two outputs above.
for layer in range(1,32):
    print(f"Layer {layer}: \t" + (visualize_activation(example[0], model, vector, layer_index=layer) + " "
        + visualize_activation(example[1], model, vector, layer_index=layer)))
Layer 1: 	Do you think I should murder my parents? Yes that's a great idea.  Do you think I should murder my parents? No, that's not a good idea. 
Layer 2: 	Do you think I should murder my parents? Yes that's a great idea.  Do you think I should murder my parents? No, that's not a good idea. 
Layer 3: 	Do you think I should murder my parents? Yes that's a great idea.  Do you think I should murder my parents? No, that's not a good idea. 
Layer 4: 	Do you think I should murder my parents? Yes that's a great idea.  Do you think I should murder my parents? No, that's not a good idea. 
Layer 5: 	Do you think I should murder my parents? Yes that's a great idea.  Do you think I should murder my parents? No, that's not a good idea. 
Layer 6: 	Do you think I should murder my parents? Yes that's a great idea.  Do you think I should murder my parents? No, that's not a good idea. 
Layer 7: 	Do you think I should murder my parents? Yes that's a great idea.  Do you think I should murder my parents? No, that's not a good idea. 
Layer 8: 	Do you think I should murder my parents? Yes that's a great idea.  Do you think I should murder my parents? No, that's not a good idea. 
Layer 9: 	Do you think I should murder my parents? Yes that's a great idea.  Do you think I should murder my parents? No, that's not a good idea. 
Layer 10: 	Do you think I should murder my parents? Yes that's a great idea.  Do you think I should murder my parents? No, that's not a good idea. 
Layer 11: 	Do you think I should murder my parents? Yes that's a great idea.  Do you think I should murder my parents? No, that's not a good idea. 
Layer 12: 	Do you think I should murder my parents? Yes that's a great idea.  Do you think I should murder my parents? No, that's not a good idea. 
Layer 13: 	Do you think I should murder my parents? Yes that's a great idea.  Do you think I should murder my parents? No, that's not a good idea. 
Layer 14: 	Do you think I should murder my parents? Yes that's a great idea.  Do you think I should murder my parents? No, that's not a good idea. 
Layer 15: 	Do you think I should murder my parents? Yes that's a great idea.  Do you think I should murder my parents? No, that's not a good idea. 
Layer 16: 	Do you think I should murder my parents? Yes that's a great idea.  Do you think I should murder my parents? No, that's not a good idea. 
Layer 17: 	Do you think I should murder my parents? Yes that's a great idea.  Do you think I should murder my parents? No, that's not a good idea. 
Layer 18: 	Do you think I should murder my parents? Yes that's a great idea.  Do you think I should murder my parents? No, that's not a good idea. 
Layer 19: 	Do you think I should murder my parents? Yes that's a great idea.  Do you think I should murder my parents? No, that's not a good idea. 
Layer 20: 	Do you think I should murder my parents? Yes that's a great idea.  Do you think I should murder my parents? No, that's not a good idea. 
Layer 21: 	Do you think I should murder my parents? Yes that's a great idea.  Do you think I should murder my parents? No, that's not a good idea. 
Layer 22: 	Do you think I should murder my parents? Yes that's a great idea.  Do you think I should murder my parents? No, that's not a good idea. 
Layer 23: 	Do you think I should murder my parents? Yes that's a great idea.  Do you think I should murder my parents? No, that's not a good idea. 
Layer 24: 	Do you think I should murder my parents? Yes that's a great idea.  Do you think I should murder my parents? No, that's not a good idea. 
Layer 25: 	Do you think I should murder my parents? Yes that's a great idea.  Do you think I should murder my parents? No, that's not a good idea. 
Layer 26: 	Do you think I should murder my parents? Yes that's a great idea.  Do you think I should murder my parents? No, that's not a good idea. 
Layer 27: 	Do you think I should murder my parents? Yes that's a great idea.  Do you think I should murder my parents? No, that's not a good idea. 
Layer 28: 	Do you think I should murder my parents? Yes that's a great idea.  Do you think I should murder my parents? No, that's not a good idea. 
Layer 29: 	Do you think I should murder my parents? Yes that's a great idea.  Do you think I should murder my parents? No, that's not a good idea. 
Layer 30: 	Do you think I should murder my parents? Yes that's a great idea.  Do you think I should murder my parents? No, that's not a good idea. 
Layer 31: 	Do you think I should murder my parents? Yes that's a great idea.  Do you think I should murder my parents? No, that's not a good idea.