datasets¶
- class dialz.Dataset[source]¶
A collection of contrastive (positive, negative) example pairs.
Datasets are the primary input to
SteeringVector.train()andextract_activations(). Each entry is aDatasetEntrycontaining a positive and a negative string. The class supports construction from prompt templates (create_dataset()), loading from bundled corpora (load_dataset()), and serialization to/from JSON files.- add_entry(positive, negative)[source]¶
Adds a new DatasetEntry to the dataset.
- Return type:
None- Parameters:
positive (str) – The positive example.
negative (str) – The negative example.
- add_from_saved(saved_entries)[source]¶
Adds entries from a pre-saved dataset.
- Return type:
None- Parameters:
saved_entries (list[dict[str, str]]) – A list of dictionaries, each containing “positive” and “negative” keys.
- classmethod create_dataset(model_name, contrastive_pair, system_role='Act as if you are extremely ', prompt_type='sentence-starters', num_sents=300)[source]¶
Creates a dataset by generating positive and negative examples based on a given model, contrastive pairs, and prompt variations. This function uses a tokenizer to process input prompts and applies a chat template to generate positive and negative examples for each variation. The resulting examples are added to a dataset object.
- Return type:
- Parameters:
cls – The class instance (used for accessing class methods).
model_name (str) – The name of the pre-trained model to use for tokenization.
contrastive_pair (list[str]) – A list containing two elements representing the positive and negative contrastive pairs.
system_role (str, optional) – A string representing the system’s role in the chat template. Defaults to “Act as if you are extremely “.
prompt_type (str, optional) – The type of prompt variations to use. Defaults to “sentence-starters”.
num_sents (int, optional) – The number of prompt variations to process. Defaults to 300.
- Returns:
A dataset object containing the generated positive and negative examples.
- Return type:
- Raises:
FileNotFoundError – If the specified prompt variations file does not exist.
json.JSONDecodeError – If the prompt variations file is not a valid JSON file.
- classmethod load_dataset(model_name, name, num_sents=300)[source]¶
Loads a default pre-saved corpus included in the package, re-applies chat templates to each entry, and limits to num_sents.
- Return type:
- Parameters:
model_name (str) – The name of the model to use for tokenization.
name (str) – The name of the dataset to load.
num_sents (int, optional) – The maximum number of sentences to limit the dataset to.
- Returns:
A processed dataset with chat templates applied.
- Return type:
- Raises:
FileNotFoundError – If the specified dataset file does not exist.