classes module¶

The classes used to run the Pharmacogenomic ML pipeline.

This module contains two classes, tuning is used to define a hyper parameter search space to optimize the performance of a method on a validation set. The drug class allows us to create, train and test a drug resistance prediction model for a specific drug. It includes each of the methods necessary for preprocessing, normalization, feature selection, domain adaptation, drug resistance prediction and retrieval of results.

class classes.drug(name, ge, dr)¶

Bases: object

Class used for training drug resistance model

The drug class allows us to create, train and test a drug resistance prediction model for a specific drug. It includes each of the methods necessary for preprocessing, normalization, feature selection, domain adaptation, drug resistance prediction and retrieval of results.

Parameters

name (str) – the name of the drug for which the model is trained
ge (dict) – a dictionary containing the different domains as keys and the gene expression data from that domain as value
dr (dict) – a dictionary containing the different domains as keys and the gene expression data from that domain as value

ajive(joint)¶

Performs Angle-based Joint and Individual Variation (AJIVE), a type of Domain Adaptation

This method performs domain adaptation on the drug data by using the group_ajive() method. It then stores the transformed data as a dataframe on the da variable.

Parameters: joint (int) – determines the rank of the joint space of the domains

combine(metric='AUC_IC50')¶

Combines drug resistance and gene expression data

This method puts gene expression and drug resistance data together into one dataframe and then stores this in the drug’s data object. It does so by using the method combine()

Parameters: metric (str) – determines the drug resistance measure that will be used.

feda()¶

Performs Frustratingly Easy Domain Adaptation, a type of Domain Adaptation.

This method performs domain adaptation on the drug data by using the feda() method. It then stores the transformed data as a dataframe on the da variable.

fs(model, n=0, tuning=None)¶

Selects features using the given model

This method selects the best features based on a given model. Then the optimal features are stored so they can be selected to be retrieved when using get(). The features are selected using fs() and then we store the subset of columns selected on the col variable.

Parameters

model (estimator object) – this is the method that will be used to determine the best features
n (float) – determines what percentage of the features will be selected
tuning (classes.tuning) – defines a hyper parameter search space over which the feature selection method can be optimized.

get(data, split)¶

Returns the data requested

This method returns the appropriate data that is requested by specifying split and whether labels or features are needed. It takes into account if preprocessing, feature selection and domain adaptation have been performed already.

Parameters

data (str) – can be either ‘X’ or ‘y’ which will return features or labels respectively.
split (str) – can be either ‘train’ or ‘test’ which will return either the train or test split

Returns

A pandas dataframe containing the split of features or labels that was requested

metrics(arr)¶

Calculates the prediction scores as defined on arr

Given the predicted labels and the actual labels it calculates all of the metrics defined on arr. It then stores the resulting scores on the drug object and returns them.

Parameters: X (DataFrame or numpy array) – samples for which the labels should be calculated
Returns: A dictionary containing the name of the scoring function as key and the actual score as value.

norm(model)¶

Applies normalization to data

Retrieves the gene expression data and normalizes it using the method given by model. Then it stores the normalized data on the gene expression dataframe. For this the method norm() is used.

Parameters: model (sklearn.base.TransformerMixin) – a normalization method on which fit_transform can be called

pre(p=0.01, t=4)¶

Applies pre-processing to data

Performs pre-processing on the gene expression data and stores it on the data pandas dataframe. To do this it uses the preprocessing method pre()

Parameters

t (float) – determines the threshold below which genes are considered to be unexpressed
p (float) – is in the range ]0,1] and determines what is the minimum percentage of the CCLs that needs to be expressed. If the actual percentage is smaller then that specific gene will be dropped.

predict(X=Empty DataFrame Columns: [] Index: [])¶

Predicts the labels of the given samples based on the features using the trained model

Predicts the labels of the samples in X using the trained model. In case no X is provided the test split previously defined will be used. The predicted labels are both stored on the drug object and returned.

Parameters: X (DataFrame or numpy array) – samples for which the labels should be calculated
Returns: A pandas dataframe or numpy array containing the predicted labels.

split(test=None)¶

Splits the data on train and test set

This method splits the data into train and test set so methods can be trained on the train set and evaluated on the test set. For this sklearn’s train_test_split() method is used, stratifying on the domain. Then the indices of the elements in the train and test split are stored on two dictionaries so they can be retrieved from the drug object.

Parameters: test (float) – determines what percentage of the data is put into the test split. None represents the sklearn default value which is 0.25

train(model, tuning=None)¶

Trains model on drug data.

Trains the model specified on model on the train split of the data by using the drp() method. It then stores the model on the drug’s model variable.

Parameters

model (estimator object) – model that will be trained on the data
tuning (classes.tuning) – hyper parameter search space over which the model will be optimized. If not specified default values will be used.

class classes.tuning(space, iterations=100, scoring='r2', cv=3, jobs=- 2)¶

Bases: object

Class used for defining a hyper parameter search space where a model is optimized.

This class helps us describe the parameters needed for a Randomized search on a number of hyper parameters

Parameters

space (dict) – contains the parameters as keys and the possible values for the parameters as values
iterations (int) – the number of iterations to test before determining the optimal hyper parameters
scoring (scoring method) – a scoring function to determine which hyper parameters are best
cv (int) – the number of folds for the cross validation
jobs (int) – the number of cpus used to perform the search