methods module

The methods used to run the Pharmacogenomic ML pipeline.

It includes each of the methods necessary for preprocessing, normalization, feature selection, domain adaptation and drug resistance prediction

methods.combine(ge, dr, drug, metric='AUC_IC50')

Combines drug resistance and gene expression data

This method puts gene expression and drug resistance data together into one dataframe,, drug resistance is incorporated using the defined metric

Parameters
  • ge (pandas.DataFrame) – gene expression data

  • dr (pandas.DataFrame) – drug resistance data

  • drug (str) – drug name

  • metric (str) – determines the drug resistance measure that will be used.

Returns

A pandas dataframe containing both gene expression and drug resistance measurements for the given drug.

methods.drp(model, X, y, tuning=None)

Trains model on drug data.

Trains the model specified on model on the given data by using the specified method. It then stores the model on the drug’s model variable.

Parameters
  • model (estimator object) – model that will be trained on the data

  • X (pandas DataFrame or numpy array) – features of the samples on which the model is trained

  • y (pandas DataFrame or numpy array) – labels of the samples on which the model is trained

  • tuning (classes.tuning) – hyper parameter search space over which the model will be optimized. If not specified default values will be used.

Returns

The fitted model

methods.feda(domains)

Performs Frustratingly Easy Domain Adaptation, a type of Domain Adaptation.

This method performs domain adaptation on the drug data by using Frustratingly Easy Domain Adaptation.

Parameters

domains (pandas DataFrame) – data on which FEDA will be performed, it should contain the domain on the 0 level of the keys

Returns

A DataFrame with the transformed data

methods.fs(model, X_train, X_test, y, n=0, tuning=None)

Selects features using the given model

This method selects the best features based on a given model.

Parameters
  • model (estimator object) – this is the method that will be used to determine the best features.

  • X_train (pandas DataFrame or numpy array) – train data on which the best features will be selected.

  • X_test (pandas DataFrame or numpy array) – test data, this will only be transformed by selecting the specified features.

  • y (pandas DataFrame or numpy array) – labels of the train set

  • n (float) – determines what percentage of the features will be selected

  • tuning (classes.tuning) – defines a hyper parameter search space over which the feature selection method can be optimized.

Returns

Two DataFrames, each containing the train and test set with only the selected features and a dictionary containing the selected features and their respective weights.

methods.group_ajive(data, joint)

Performs Angle-based Joint and Individual Variation (AJIVE), a type of Domain Adaptation

This method performs domain adaptation by learning the joint space projections from the train set and then projecting the test set onto the joint space. For this, the methods ajive and ajive_predict are used respectively.

Parameters
  • data (pandas DataFrame) – the data for which domain adaptation is being performed

  • joint (int) – determines the rank of the joint space of the domains

methods.norm(model, ge)

Applies normalization to data

Normalizes the given data using the given model then returns the normalized data.

Parameters
  • model (sklearn.base.TransformerMixin) – a normalization method on which fit_transform can be called

  • ge (pandas Dataframe or numpy array) – the data that should be normalized

Returns

A pandas DataFrame or numpy aarray containing the transformed data.

methods.pre(data, p=0.1, t=4)

Applies pre-processing to data

Performs pre-processing on the gene expression data and returns a dataframe containing the selected genes. This pre-processing is used to find unexpressed genes that a microarray could detect as background noise.

Parameters
  • data (pandas.DataFrame) – contains the data for which pre-processing will be made, rows should be cancer cell lines and columns should be genes.

  • t (float) – determines the threshold below which genes are considered to be unexpressed.

  • p (float) – is in the range ]0,1] and determines what is the minimum percentage of the CCLs that needs to be expressed. If the actual percentage is smaller then that specific gene will be dropped.

Returns

A Pandas dataframe with only the selected genes included.