Strategies
SMI
- class trust.strategies.smi.SMI(labeled_dataset, unlabeled_dataset, query_dataset, net, nclasses, args={})[source]
Bases:
trust.strategies.strategy.StrategyThis strategy implements the Submodular Mutual Information (SMI) selection paradigm discuss in the paper SIMILAR: Submodular Information Measures Based Active Learning In Realistic Scenarios 1. In this selection paradigm, points from the unlabeled dataset are chosen in such a way that the submodular mutual information between this set of points and a provided query set is maximized. Doing so allows a practitioner to select points from an unlabeled set that are SIMILAR to points that they have provided in a active learning query.
These submodular mutual information functions rely on formulating embeddings for the points in the query set and the unlabeled set. Once these embeddings are formed, one or more similarity kernels (depending on the SMI function used) are formed from these embeddings based on a similarity metric. Once these similarity kernels are formed, they are used in computing the value of each submodular mutual information function. Hence, common techniques for submodular maximization subject to a cardinality constraint can be used, such as the naive greedy algorithm, the lazy greedy algorithm, and so forth.
In this framework, we set the cardinality constraint to be the active learning selection budget; hence, a list of indices with a total length less than or equal to this cardinality constraint will be returned. Depending on the maximization configuration, one can ensure that the length of this list will be equal to the cardinality constraint.
Currently, five submodular mutual information functions are implemented: fl1mi, fl2mi, gcmi, logdetmi, and com. Each function is obtained by applying the definition of a submodular mutual information function using common submodular functions. Facility Location Mutual Information (fl1mi) models pairwise similarities of points in the query set to points in the unlabeled dataset AND pairwise similarities of points within the unlabeled datasets. Another variant of Facility Location Mutual Information (fl2mi) models pairwise similarities of points in the query set to points in the unlabeled dataset ONLY. Graph Cut Mutual Information (gcmi), Log-Determinant Mutual Information (logdetmi), and Concave-Over-Modular Mutual Information (com) are all obtained by applying the usual submodular function under this definition. For more information-theoretic discussion, consider referring to the paper Submodular Combinatorial Information Measures with Applications in Machine Learning 2.
- Parameters
labeled_dataset (torch.utils.data.Dataset) – The labeled dataset to be used in this strategy. For the purposes of selection, the labeled dataset is not used, but it is provided to fit the common framework of the Strategy superclass.
unlabeled_dataset (torch.utils.data.Dataset) – The unlabeled dataset to be used in this strategy. It is used in the selection process as described above. Importantly, the unlabeled dataset must return only a data Tensor; if indexing the unlabeled dataset returns a tuple of more than one component, unexpected behavior will most likely occur.
query_dataset (torch.utils.data.Dataset) – The query dataset to be used in this strategy. It is used in the selection process as described above. Notably, the query dataset should be labeled; hence, indexing the query dataset should return a data/label pair. This is done in this fashion to allow for gradient embeddings.
net (torch.nn.Module) – The neural network model to use for embeddings and predictions. Notably, all embeddings typically come from extracted features from this network or from gradient embeddings based on the loss, which can be based on hypothesized gradients or on true gradients (depending on the availability of the label).
nclasses (int) – The number of classes being predicted by the neural network.
args (dict) –
- A dictionary containing many configurable settings for this strategy. Each key-value pair is described below:
- ’batch_size’: int
The batch size used internally for torch.utils.data.DataLoader objects. Default: 1
- ’device’: string
The device to be used for computation. PyTorch constructs are transferred to this device. Usually is one of ‘cuda’ or ‘cpu’. Default: ‘cuda’ if a CUDA-enabled device is available; otherwise, ‘cpu’
- ’loss’: function
The loss function to be used in computations. Default: torch.nn.functional.cross_entropy
- ’smi_function’: string
The submodular mutual information function to use in optimization. Must be one of ‘fl1mi’, ‘fl2mi’, ‘gcmi’, ‘logdetmi’, ‘com’. REQUIRED
- ’optimizer’: string
The optimizer to use for submodular maximization. Can be one of ‘NaiveGreedy’, ‘StochasticGreedy’, ‘LazyGreedy’ and ‘LazierThanLazyGreedy’. Default: ‘NaiveGreedy’
- ’metric’: string
The similarity metric to use for similarity kernel computation. This can be either ‘cosine’ or ‘euclidean’. Default: ‘cosine’
- ’eta’: float
A magnification constant that is used in all but gcmi. It is used as a value of query-relevance vs diversity trade-off. Increasing eta tends to increase query-relevance while reducing query-coverage and diversity. Default: 1
- ’embedding_type’: string
The type of embedding to compute for similarity kernel computation. This can be either ‘gradients’ or ‘features’. Default: ‘gradients’
- ’gradType’: string
When ‘embedding_type’ is ‘gradients’, this defines the type of gradient to use. ‘bias’ creates gradients from the loss function with respect to the biases outputted by the model. ‘linear’ creates gradients from the loss function with respect to the last linear layer features. ‘bias_linear’ creates gradients from the loss function using both. Default: ‘bias_linear’
- ’layer_name’: string
When ‘embedding_type’ is ‘features’, this defines the layer within the neural network that is used to extract feature embeddings. Namely, this argument must be the name of a module used in the forward() computation of the model. Default: ‘avgpool’
- ’stopIfZeroGain’: bool
Controls if the optimizer should cease maximization if there is zero gain in the submodular objective. Default: False
- ’stopIfNegativeGain’: bool
Controls if the optimizer should cease maximization if there is negative gain in the submodular objective. Default: False
- ’verbose’: bool
Gives a more verbose output when calling select() when True. Default: False
- select(budget)[source]
Selects a set of points from the unlabeled dataset to label based on this strategy’s methodology.
- Parameters
budget (int) – Number of points to choose from the unlabeled dataset
- Returns
chosen – List of selected data point indices with respect to the unlabeled dataset
- Return type
list
SCG
- class trust.strategies.scg.SCG(labeled_dataset, unlabeled_dataset, private_dataset, net, nclasses, args={})[source]
Bases:
trust.strategies.strategy.StrategyThis strategy implements the Submodular Conditional Gain (SCG) selection paradigm discuss in the paper SIMILAR: Submodular Information Measures Based Active Learning In Realistic Scenarios 1. In this selection paradigm, points from the unlabeled dataset are chosen in such a way that the submodular conditional gain between this set of points and a provided private set is maximized. Doing so allows a practitioner to select points from an unlabeled set that are dissimilar to points provided in the private set.
These submodular conditional gain functions rely on formulating embeddings for the points in the unlabeled set and the private set. Once these embeddings are formed, similarity kernels are formed from these embeddings based on a similarity metric. Once these similarity kernels are formed, they are used in computing the value of each submodular conditional gain function. Hence, common techniques for submodular maximization subject to a cardinality constraint can be used, such as the naive greedy algorithm, the lazy greedy algorithm, and so forth.
In this framework, we set the cardinality constraint to be the active learning selection budget; hence, a list of indices with a total length less than or equal to this cardinality constraint will be returned. Depending on the maximization configuration, one can ensure that the length of this list will be equal to the cardinality constraint.
Currently, two submodular conditional gain functions are implemented: ‘flcg’, ‘gccg’, and ‘logdetcg’. Each function is obtained by applying the definition of a submodular conditional gain function using common submodular functions. For more information-theoretic discussion, consider referring to the paper Submodular Combinatorial Information Measures with Applications in Machine Learning 2.
- Parameters
labeled_dataset (torch.utils.data.Dataset) – The labeled dataset to be used in this strategy. For the purposes of selection, the labeled dataset is not used, but it is provided to fit the common framework of the Strategy superclass.
unlabeled_dataset (torch.utils.data.Dataset) – The unlabeled dataset to be used in this strategy. It is used in the selection process as described above. Importantly, the unlabeled dataset must return only a data Tensor; if indexing the unlabeled dataset returns a tuple of more than one component, unexpected behavior will most likely occur.
private_dataset (torch.utils.data.Dataset) – The private dataset to be used in this strategy. It is used in the selection process as described above. Notably, the private dataset should be labeled; hence, indexing the query dataset should return a data/label pair. This is done in this fashion to allow for gradient embeddings.
net (torch.nn.Module) – The neural network model to use for embeddings and predictions. Notably, all embeddings typically come from extracted features from this network or from gradient embeddings based on the loss, which can be based on hypothesized gradients or on true gradients (depending on the availability of the label).
nclasses (int) – The number of classes being predicted by the neural network.
args (dict) –
- A dictionary containing many configurable settings for this strategy. Each key-value pair is described below:
- ’batch_size’: int
The batch size used internally for torch.utils.data.DataLoader objects. Default: 1
- ’device’: string
The device to be used for computation. PyTorch constructs are transferred to this device. Usually is one of ‘cuda’ or ‘cpu’. Default: ‘cuda’ if a CUDA-enabled device is available; otherwise, ‘cpu’
- ’loss’: function
The loss function to be used in computations. Default: torch.nn.functional.cross_entropy
- ’scg_function’: string
The submodular mutual information function to use in optimization. Must be one of ‘flcmi’ or ‘logdetcmi’. REQUIRED
- ’optimizer’: string
The optimizer to use for submodular maximization. Can be one of ‘NaiveGreedy’, ‘StochasticGreedy’, ‘LazyGreedy’ and ‘LazierThanLazyGreedy’. Default: ‘NaiveGreedy’
- ’metric’: string
The similarity metric to use for similarity kernel computation. This can be either ‘cosine’ or ‘euclidean’. Default: ‘cosine’
- ’nu’: float
A parameter that governs the hardness of the privacy constraint. Default: 1.
- ’embedding_type’: string
The type of embedding to compute for similarity kernel computation. This can be either ‘gradients’ or ‘features’. Default: ‘gradients’
- ’gradType’: string
When ‘embedding_type’ is ‘gradients’, this defines the type of gradient to use. ‘bias’ creates gradients from the loss function with respect to the biases outputted by the model. ‘linear’ creates gradients from the loss function with respect to the last linear layer features. ‘bias_linear’ creates gradients from the loss function using both. Default: ‘bias_linear’
- ’layer_name’: string
When ‘embedding_type’ is ‘features’, this defines the layer within the neural network that is used to extract feature embeddings. Namely, this argument must be the name of a module used in the forward() computation of the model. Default: ‘avgpool’
- ’stopIfZeroGain’: bool
Controls if the optimizer should cease maximization if there is zero gain in the submodular objective. Default: False
- ’stopIfNegativeGain’: bool
Controls if the optimizer should cease maximization if there is negative gain in the submodular objective. Default: False
- ’verbose’: bool
Gives a more verbose output when calling select() when True. Default: False
- select(budget)[source]
Selects a set of points from the unlabeled dataset to label based on this strategy’s methodology.
- Parameters
budget (int) – Number of points to choose from the unlabeled dataset
- Returns
chosen – List of selected data point indices with respect to the unlabeled dataset
- Return type
list
SCMI
- class trust.strategies.scmi.SCMI(labeled_dataset, unlabeled_dataset, query_dataset, private_dataset, net, nclasses, args={})[source]
Bases:
trust.strategies.strategy.StrategyThis strategy implements the Submodular Conditional Mutual Information (SCMI) selection paradigm discuss in the paper SIMILAR: Submodular Information Measures Based Active Learning In Realistic Scenarios 1. In this selection paradigm, points from the unlabeled dataset are chosen in such a way that the submodular conditional mutual information between this set of points and a provided query set is maximized, conditioned on a private dataset. Doing so allows a practitioner to select points from an unlabeled set that are SIMILAR to points that they have provided in the query set while being dissimilar to points provided in the private set.
These submodular conditional mutual information functions rely on formulating embeddings for the points in the query set, the unlabeled set, and the private set. Once these embeddings are formed, similarity kernels are formed from these embeddings based on a similarity metric. Once these similarity kernels are formed, they are used in computing the value of each submodular conditional mutual information function. Hence, common techniques for submodular maximization subject to a cardinality constraint can be used, such as the naive greedy algorithm, the lazy greedy algorithm, and so forth.
In this framework, we set the cardinality constraint to be the active learning selection budget; hence, a list of indices with a total length less than or equal to this cardinality constraint will be returned. Depending on the maximization configuration, one can ensure that the length of this list will be equal to the cardinality constraint.
Currently, two submodular conditional mutual information functions are implemented: ‘flcmi’ and ‘logdetcmi’. Each function is obtained by applying the definition of a submodular conditional mutual information function using common submodular functions. For more information-theoretic discussion, consider referring to the paper Submodular Combinatorial Information Measures with Applications in Machine Learning 2.
- Parameters
labeled_dataset (torch.utils.data.Dataset) – The labeled dataset to be used in this strategy. For the purposes of selection, the labeled dataset is not used, but it is provided to fit the common framework of the Strategy superclass.
unlabeled_dataset (torch.utils.data.Dataset) – The unlabeled dataset to be used in this strategy. It is used in the selection process as described above. Importantly, the unlabeled dataset must return only a data Tensor; if indexing the unlabeled dataset returns a tuple of more than one component, unexpected behavior will most likely occur.
query_dataset (torch.utils.data.Dataset) – The query dataset to be used in this strategy. It is used in the selection process as described above. Notably, the query dataset should be labeled; hence, indexing the query dataset should return a data/label pair. This is done in this fashion to allow for gradient embeddings.
private_dataset (torch.utils.data.Dataset) – The private dataset to be used in this strategy. It is used in the selection process as described above. Notably, the private dataset should be labeled; hence, indexing the query dataset should return a data/label pair. This is done in this fashion to allow for gradient embeddings.
net (torch.nn.Module) – The neural network model to use for embeddings and predictions. Notably, all embeddings typically come from extracted features from this network or from gradient embeddings based on the loss, which can be based on hypothesized gradients or on true gradients (depending on the availability of the label).
nclasses (int) – The number of classes being predicted by the neural network.
args (dict) –
- A dictionary containing many configurable settings for this strategy. Each key-value pair is described below:
- ’batch_size’: int
The batch size used internally for torch.utils.data.DataLoader objects. Default: 1
- ’device’: string
The device to be used for computation. PyTorch constructs are transferred to this device. Usually is one of ‘cuda’ or ‘cpu’. Default: ‘cuda’ if a CUDA-enabled device is available; otherwise, ‘cpu’
- ’loss’: function
The loss function to be used in computations. Default: torch.nn.functional.cross_entropy
- ’scmi_function’: string
The submodular mutual information function to use in optimization. Must be one of ‘flcmi’ or ‘logdetcmi’. REQUIRED
- ’optimizer’: string
The optimizer to use for submodular maximization. Can be one of ‘NaiveGreedy’, ‘StochasticGreedy’, ‘LazyGreedy’ and ‘LazierThanLazyGreedy’. Default: ‘NaiveGreedy’
- ’metric’: string
The similarity metric to use for similarity kernel computation. This can be either ‘cosine’ or ‘euclidean’. Default: ‘cosine’
- ’eta’: float
A magnification constant that is used in all but gcmi. It is used as a value of query-relevance vs diversity trade-off. Increasing eta tends to increase query-relevance while reducing query-coverage and diversity. Default: 1
- ’nu’: float
A parameter that governs the hardness of the privacy constraint. Default: 1.
- ’embedding_type’: string
The type of embedding to compute for similarity kernel computation. This can be either ‘gradients’ or ‘features’. Default: ‘gradients’
- ’gradType’: string
When ‘embedding_type’ is ‘gradients’, this defines the type of gradient to use. ‘bias’ creates gradients from the loss function with respect to the biases outputted by the model. ‘linear’ creates gradients from the loss function with respect to the last linear layer features. ‘bias_linear’ creates gradients from the loss function using both. Default: ‘bias_linear’
- ’layer_name’: string
When ‘embedding_type’ is ‘features’, this defines the layer within the neural network that is used to extract feature embeddings. Namely, this argument must be the name of a module used in the forward() computation of the model. Default: ‘avgpool’
- ’stopIfZeroGain’: bool
Controls if the optimizer should cease maximization if there is zero gain in the submodular objective. Default: False
- ’stopIfNegativeGain’: bool
Controls if the optimizer should cease maximization if there is negative gain in the submodular objective. Default: False
- ’verbose’: bool
Gives a more verbose output when calling select() when True. Default: False
- select(budget)[source]
Selects a set of points from the unlabeled dataset to label based on this strategy’s methodology.
- Parameters
budget (int) – Number of points to choose from the unlabeled dataset
- Returns
chosen – List of selected data point indices with respect to the unlabeled dataset
- Return type
list
References
- 1(1,2,3)
Suraj Kothawade, Nathan Beck, Krishnateja Killamsetty, and Rishabh Iyer. Similar: submodular information measures based active learning in realistic scenarios. arXiv preprint arXiv:2107.00717, 2021.
- 2(1,2,3)
Rishabh Iyer, Ninad Khargoankar, Jeff Bilmes, and Himanshu Asanani. Submodular combinatorial information measures with applications in machine learning. In Algorithmic Learning Theory, 722–754. PMLR, 2021.