SNF clustering


Preprocess data


Select Parameters


Description

SNF is a novel computational method for genomic data integration that was developed in the lab of Anna Goldenberg. SNF constructs patient similarity networks for each of the data types and in a second step integrates iteratively these networks until it converges to a final fused network. SNF is able to deal with various data types and since the integration happens in the sample space one can combine various data type without worrying for different scales or noise among the different data.



Citation

B Wang, A Mezlini, F Demir, M Fiume, T Zu, M Brudno, B Haibe-Kains, A Goldenberg (2014) Similarity Network Fusion: a fast and effective method to aggregate multiple data types on a genome wide scale. Nature Methods. Online. Jan 26, 2014

Tutorial

File format

The files to be uploaded need to be formated as tables where rows correspond to the different samples and columns to the different features that have been measured in a specific assay (e.g gene expression). Note that the sample names should match among the different tables and only samples that have measurements in every assay will be taken into account

Preprocessing of the data

  • One can remove samples and features that have more than a predefined percentage of missing values.
  • Imputation can then be appplied to the remaining data (impute package, KNN imputation ). The number of neighbors to be used in the imputation can be defined by the user
  • Finally the user can choose to normalize the features in order to have mean 0 and standard deviation of 1

Choose parameters

  • Number of neighbors in the K-nearest neighbors part of the algorithm, usually (10~30)
  • Hyperparameter used in constructing similarity network, usually (0.3~0.8)
  • Number of Iterations in the diffusion process, usually (10~20)
  • Number of Clusters
  • The fused similarity matrix produces a fully conected network. One can select to display the top x% of the edges ranked by similarity weights

Output

Heatmap of the similarity matrix. Samples are ordered by the clusters provided by the argument groups with group information displayed with a color bar

A table with the estimates of the number of clusters given the two heuristics that are being used: 1) Eigengaps 2) Rotation cost

Network representation of the similarity matrix.Nodes correspond to samples and the colors correspond to the clustering groups. Only the the top selected interactions (based on similarity weight) are beeing displayed

Download files

The user can download the following files
  • The fused similarity matrix
  • The assignment of patients/samples into the different groups
  • The interactions among samples and their values. This table can be used into generationg networks with other tools e.g Cytoscape

Display Heatmap

Plot Network