STAGATE – deciphering spatial domains from spatially resolved transcriptomics

Installation (tensorflow1 framework)

Software dependencies

scanpy
tensorflow==1.15.0

The use of the mclust algorithm requires the rpy2 package and the mclust package. See https://pypi.org/project/rpy2/ and https://cran.r-project.org/web/packages/mclust/index.html for detail.

Installation

Downloading STAGATE code from https://github.com/QIFEIDKN/STGATE

cd STAGATE-main
python setup.py build
python setup.py install
import STAGATE

Installation (pyG framework)

Software dependencies

scanpy
pytorch
pyG

The use of the mclust algorithm requires the rpy2 package and the mclust package. See https://pypi.org/project/rpy2/ and https://cran.r-project.org/web/packages/mclust/index.html for detail.

The cell type-aware module has not been supported by STAGATE_pyG yet.

Installation

Downloading STAGATE_pyG code from https://github.com/QIFEIDKN/STGATE_pyG

cd STAGATE_pyG-main
python setup.py build
python setup.py install
import STAGATE_pyG

Tutorial 1: 10x Visium (DLPFC dataset)

Here we present our re-analysis of 151676 sample of the dorsolateral prefrontal cortex (DLPFC) dataset. Maynard et al. has manually annotated DLPFC layers and white matter (WM) based on the morphological features and gene markers.

This tutorial demonstrates how to identify spatial domains on 10x Visium data using STAGATE. The processed data are available at https://github.com/LieberInstitute/spatialLIBD. We downloaded the manual annotation from the spatialLIBD package and provided at https://drive.google.com/drive/folders/10lhz5VY7YfvHrtV40MwaqLmWz56U9eBP?usp=sharing.

Preparation

[1]:
import warnings
warnings.filterwarnings("ignore")
[2]:
import pandas as pd
import numpy as np
import scanpy as sc
import matplotlib.pyplot as plt
import os
import sys
[3]:
from sklearn.metrics.cluster import adjusted_rand_score
[4]:
import STAGATE
[5]:
# the location of R (used for the mclust clustering)
os.environ['R_HOME'] = 'D:\Program Files\R\R-4.0.3'
os.environ['R_USER'] = 'D:\ProgramData\Anaconda3\Lib\site-packages\rpy2'
[6]:
section_id = '151676'
[7]:
input_dir = os.path.join('Data', section_id)
adata = sc.read_visium(path=input_dir, count_file=section_id+'_filtered_feature_bc_matrix.h5')
adata.var_names_make_unique()
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
[8]:
adata
[8]:
AnnData object with n_obs × n_vars = 3460 × 33538
    obs: 'in_tissue', 'array_row', 'array_col'
    var: 'gene_ids', 'feature_types', 'genome'
    uns: 'spatial'
    obsm: 'spatial'
[9]:
#Normalization
sc.pp.highly_variable_genes(adata, flavor="seurat_v3", n_top_genes=3000)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
[10]:
# read the annotation
Ann_df = pd.read_csv(os.path.join('Data', section_id, section_id+'_truth.txt'), sep='\t', header=None, index_col=0)
Ann_df.columns = ['Ground Truth']
[11]:
adata.obs['Ground Truth'] = Ann_df.loc[adata.obs_names, 'Ground Truth']
[12]:
plt.rcParams["figure.figsize"] = (3, 3)
sc.pl.spatial(adata, img_key="hires", color=["Ground Truth"])
... storing 'Ground Truth' as categorical
... storing 'feature_types' as categorical
... storing 'genome' as categorical
_images/T1_DLPFC_14_1.png

Constructing the spatial network

[13]:
STAGATE.Cal_Spatial_Net(adata, rad_cutoff=150)
STAGATE.Stats_Spatial_Net(adata)
------Calculating spatial graph...
The graph contains 20052 edges, 3460 cells.
5.7954 neighbors per cell on average.
_images/T1_DLPFC_16_1.png

Running STAGATE

[14]:
adata = STAGATE.train_STAGATE(adata, alpha=0)
Size of Input:  (3460, 3000)
WARNING:tensorflow:From D:\ProgramData\Anaconda3\lib\site-packages\tensorflow_core\python\ops\math_grad.py:1375: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
100%|████████████████████████████████████████████████████████████████████████████████| 500/500 [00:28<00:00, 17.60it/s]
[15]:
sc.pp.neighbors(adata, use_rep='STAGATE')
sc.tl.umap(adata)
adata = STAGATE.mclust_R(adata, used_obsm='STAGATE', num_cluster=7)
[16]:
obs_df = adata.obs.dropna()
ARI = adjusted_rand_score(obs_df['mclust'], obs_df['Ground Truth'])
print('Adjusted rand index = %.2f' %ARI)
Adjusted rand index = 0.60
[17]:
plt.rcParams["figure.figsize"] = (3, 3)
sc.pl.umap(adata, color=["mclust", "Ground Truth"], title=['STAGATE (ARI=%.2f)'%ARI, "Ground Truth"])
_images/T1_DLPFC_21_0.png
[18]:
plt.rcParams["figure.figsize"] = (3, 3)
sc.pl.spatial(adata, color=["mclust", "Ground Truth"], title=['STAGATE (ARI=%.2f)'%ARI, "Ground Truth"])
_images/T1_DLPFC_22_0.png

Spatial trajectory inference (PAGA)

[19]:
used_adata = adata[adata.obs['Ground Truth']!='nan',]
used_adata
[19]:
View of AnnData object with n_obs × n_vars = 3431 × 33538
    obs: 'in_tissue', 'array_row', 'array_col', 'Ground Truth', 'mclust'
    var: 'gene_ids', 'feature_types', 'genome', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
    uns: 'spatial', 'hvg', 'log1p', 'Ground Truth_colors', 'Spatial_Net', 'neighbors', 'umap', 'mclust_colors'
    obsm: 'spatial', 'STAGATE', 'X_umap'
    obsp: 'distances', 'connectivities'
[20]:
sc.tl.paga(used_adata, groups='Ground Truth')
Trying to set attribute `.uns` of view, copying.
[21]:
plt.rcParams["figure.figsize"] = (4,3)
sc.pl.paga_compare(used_adata, legend_fontsize=10, frameon=False, size=20,
                   title=section_id+'_STGATE', legend_fontoutline=2, show=False)
[21]:
[<matplotlib.axes._axes.Axes at 0x183007d1e48>,
 <matplotlib.axes._axes.Axes at 0x18300411f98>]
_images/T1_DLPFC_26_1.png

Tutorial 2: The usage of cell type-aware module for 10x Visium

To better characterize the spatial similarity at the boundary of spatial domains, STAGATE adopts an attention mechanism to adaptively learn the similarity of neighboring spots, and an optional cell type-aware module through integrating the pre-clustering of gene expressions.

This tutorial demonstrates how to identify spatial domains on 10x Visium data using STAGATE with the cell type-aware module.

In this tutorial, we foucs on the Adult Mouse Brain Section 1 (Coronal) from 10x genomic website (https://www.10xgenomics.com/resources/datasets/adult-mouse-brain-section-1-coronal-stains-dapi-anti-neu-n-1-standard-1-1-0).

Preparation

[1]:
import warnings
warnings.filterwarnings("ignore")
[2]:
import pandas as pd
import numpy as np
import scanpy as sc
import matplotlib.pyplot as plt
import os
import sys
[3]:
import STAGATE
[4]:
adata = sc.read_visium(path='Data', count_file='V1_Adult_Mouse_Brain_Coronal_Section_1_filtered_feature_bc_matrix.h5')
adata.var_names_make_unique()
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
[5]:
adata
[5]:
AnnData object with n_obs × n_vars = 2903 × 32285
    obs: 'in_tissue', 'array_row', 'array_col'
    var: 'gene_ids', 'feature_types', 'genome'
    uns: 'spatial'
    obsm: 'spatial'
[6]:
#Normalization
sc.pp.highly_variable_genes(adata, flavor="seurat_v3", n_top_genes=3000)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
[7]:
plt.rcParams["figure.figsize"] = (3, 3)
sc.pl.spatial(adata, img_key="hires")
... storing 'feature_types' as categorical
... storing 'genome' as categorical
_images/T2_MouseBrain_9_1.png

Constructing the spatial network

[8]:
STAGATE.Cal_Spatial_Net(adata, rad_cutoff=300)
STAGATE.Stats_Spatial_Net(adata)
------Calculating spatial graph...
The graph contains 16982 edges, 2903 cells.
5.8498 neighbors per cell on average.
_images/T2_MouseBrain_11_1.png

Running STAGATE with cell type-aware module

Parameter alpha is the weight of the cell type-aware spatial network, and pre_resolution is the resolution parameter of pre-clustering.

[9]:
adata = STAGATE.train_STAGATE(adata, alpha=0.5, pre_resolution=0.2,
                              n_epochs=1000, save_attention=True)
Size of Input:  (2903, 3000)
WARNING:tensorflow:From D:\ProgramData\Anaconda3\lib\site-packages\tensorflow_core\python\ops\math_grad.py:1375: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
------Pre-clustering using louvain with resolution=0.20
  0%|                                                                                         | 0/1000 [00:00<?, ?it/s]
------Pruning the graph...
16982 edges before pruning.
15058 edges after pruning.
100%|██████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:54<00:00, 18.26it/s]
[10]:
# pre-clustering result
plt.rcParams["figure.figsize"] = (3, 3)
sc.pl.spatial(adata, img_key="hires", color="expression_louvain_label", size=1.5, title='pre-clustering result')
_images/T2_MouseBrain_15_0.png
[11]:
sc.pp.neighbors(adata, use_rep='STAGATE')
sc.tl.umap(adata)
sc.tl.louvain(adata, resolution=1)
[12]:
adata.uns['louvain_colors']=['#aec7e8', '#9edae5', '#d62728', '#dbdb8d', '#ff9896',
                             '#8c564b', '#696969', '#778899', '#17becf', '#ffbb78',
                             '#e377c2', '#98df8a', '#aa40fc', '#c5b0d5', '#c49c94',
                             '#f7b6d2', '#279e68', '#b5bd61', '#ad494a', '#8c6d31',
                             '#1f77b4', '#ff7f0e']
[13]:
plt.rcParams["figure.figsize"] = (4, 4)
sc.pl.spatial(adata, img_key="hires", color="louvain", size=1.5)
_images/T2_MouseBrain_18_0.png
[14]:
plt.rcParams["figure.figsize"] = (3, 3)
sc.pl.umap(adata, color="louvain", legend_loc='on data', s=10)
_images/T2_MouseBrain_19_0.png

Visualization of the attention layer

[15]:
import matplotlib as mpl
import networkx as nx
[16]:
att_df = pd.DataFrame(adata.uns['STAGATE_attention'][0].toarray(), index=adata.obs_names, columns=adata.obs_names)
att_df = att_df.values
for it in range(att_df.shape[0]):
    att_df[it, it] = 0
[17]:
G_atten = nx.from_numpy_matrix(att_df)
M = G_atten.number_of_edges()
edge_colors = range(2, M + 2)
[18]:
coor_df = pd.DataFrame(adata.obsm['spatial'].copy(), index=adata.obs_names)
coor_df[1] = -1 * coor_df[1]
image_pos = dict(zip(range(coor_df.shape[0]), [np.array(coor_df.iloc[it,]) for it in range(coor_df.shape[0])]))
[19]:
labels = nx.get_edge_attributes(G_atten,'weight')
[20]:
fig, ax = plt.subplots(figsize=[9,10])
nx.draw_networkx_nodes(G_atten, image_pos, node_size=5, ax=ax)
cmap = plt.cm.plasma
edges = nx.draw_networkx_edges(G_atten, image_pos, edge_color=labels.values(),width=4, ax=ax,
                               edge_cmap=cmap,edge_vmax=0.25,edge_vmin=0.05)
ax = plt.gca()

sm = plt.cm.ScalarMappable(cmap=cmap, norm=plt.Normalize(vmin = 0.05, vmax=0.25))
sm._A = []
plt.colorbar(sm)

ax.set_axis_off()
_images/T2_MouseBrain_26_0.png

Running STAGATE without cell type-aware module (for comparison)

[21]:
adata = STAGATE.train_STAGATE(adata, alpha=0, n_epochs=1000, save_attention=True)
Size of Input:  (2903, 3000)
100%|██████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:47<00:00, 20.97it/s]
[22]:
sc.pp.neighbors(adata, use_rep='STAGATE')
sc.tl.umap(adata)
sc.tl.louvain(adata, resolution=1)
[23]:
adata.uns['louvain_colors']=['#aec7e8', '#9edae5', '#d62728', '#dbdb8d', '#ff9896',
                             '#8c564b', '#696969', '#778899', '#17becf', '#ffbb78',
                             '#e377c2', '#98df8a', '#aa40fc', '#c5b0d5', '#c49c94',
                             '#f7b6d2', '#279e68', '#b5bd61', '#ad494a', '#8c6d31',
                             '#1f77b4', '#ff7f0e']
[24]:
plt.rcParams["figure.figsize"] = (4, 4)
sc.pl.spatial(adata, img_key="hires", color="louvain", size=1.5)
_images/T2_MouseBrain_31_0.png
[25]:
plt.rcParams["figure.figsize"] = (3, 3)
sc.pl.umap(adata, color="louvain", legend_loc='on data', s=10)
_images/T2_MouseBrain_32_0.png

Visualization of the attention layer (without cell type-aware module)

[26]:
att_df = pd.DataFrame(adata.uns['STAGATE_attention'][0].toarray(), index=adata.obs_names, columns=adata.obs_names)
att_df = att_df.values
for it in range(att_df.shape[0]):
    att_df[it, it] = 0
[27]:
G_atten = nx.from_numpy_matrix(att_df)
M = G_atten.number_of_edges()
edge_colors = range(2, M + 2)
[28]:
coor_df = pd.DataFrame(adata.obsm['spatial'].copy(), index=adata.obs_names)
coor_df[1] = -1 * coor_df[1]
image_pos = dict(zip(range(coor_df.shape[0]), [np.array(coor_df.iloc[it,]) for it in range(coor_df.shape[0])]))
[29]:
labels = nx.get_edge_attributes(G_atten,'weight')
[30]:
fig, ax = plt.subplots(figsize=[9,10])
nx.draw_networkx_nodes(G_atten, image_pos, node_size=5, ax=ax)
cmap = plt.cm.plasma
edges = nx.draw_networkx_edges(G_atten, image_pos, edge_color=labels.values(),width=4, ax=ax,
                               edge_cmap=cmap,edge_vmax=0.25,edge_vmin=0.05)
ax = plt.gca()

sm = plt.cm.ScalarMappable(cmap=cmap, norm=plt.Normalize(vmin = 0.05, vmax=0.25))
sm._A = []
plt.colorbar(sm)

ax.set_axis_off()
_images/T2_MouseBrain_38_0.png

Tutorial 3: Slide-seqV2 mouse olfactory bulb

This tutorial demonstrates how to identify spatial domains on Slide-seqV2 data.

In this tutorial, we foucs on the Slide-seqV2 mouse olfactory bulb data (Puck_200127_15 from https://singlecell.broadinstitute.org/single_cell/study/SCP815/highly-sensitive-spatial-transcriptomics-at-near-cellular-resolution-with-slide-seqv2#study-summary).

We removed spots outside the main tissue area, and the used spots can be downloaded from https://drive.google.com/drive/folders/10lhz5VY7YfvHrtV40MwaqLmWz56U9eBP?usp=sharing.

Preparation

[1]:
import warnings
warnings.filterwarnings("ignore")
[2]:
import pandas as pd
import numpy as np
import scanpy as sc
import matplotlib.pyplot as plt
import os
import sys
[3]:
import STAGATE
[4]:
input_dir = 'Data'
counts_file = os.path.join(input_dir, 'Puck_200127_15.digital_expression.txt')
coor_file = os.path.join(input_dir, 'Puck_200127_15_bead_locations.csv')
[5]:
counts = pd.read_csv(counts_file, sep='\t', index_col=0)
coor_df = pd.read_csv(coor_file, index_col=0)
print(counts.shape, coor_df.shape)
(21220, 21724) (21724, 2)
[6]:
adata = sc.AnnData(counts.T)
adata.var_names_make_unique()
coor_df = coor_df.loc[adata.obs_names, ['xcoord', 'ycoord']]
adata.obsm["spatial"] = coor_df.to_numpy()
[7]:
sc.pp.calculate_qc_metrics(adata, inplace=True)
[8]:
adata
[8]:
AnnData object with n_obs × n_vars = 21724 × 21220
    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes'
    var: 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
    obsm: 'spatial'
[9]:
plt.rcParams["figure.figsize"] = (6,5)
sc.pl.embedding(adata, basis="spatial", color="log1p_total_counts",s=6, show=False)
plt.title('')
plt.axis('off')
[9]:
(-289.81710000000004, 6312.7151, 173.30850000000004, 5709.8615)
_images/T3_Slide-seqV2_11_1.png
[ ]:
# can be downloaded from https://drive.google.com/drive/folders/10lhz5VY7YfvHrtV40MwaqLmWz56U9eBP?usp=sharing
used_barcode = pd.read_csv('Data/used_barcodes.txt', sep='\t', header=None)
used_barcode = used_barcode[0]
[11]:
adata = adata[used_barcode,]
[12]:
plt.rcParams["figure.figsize"] = (5,5)
sc.pl.embedding(adata, basis="spatial", color="log1p_total_counts",s=10, show=False, title='Removing spots outside the main tissue area')

plt.axis('off')
[12]:
(588.545, 5108.555, 847.6700000000001, 5670.73)
_images/T3_Slide-seqV2_14_1.png
[13]:
sc.pp.filter_genes(adata, min_cells=50)
print('After flitering: ', adata.shape)
Trying to set attribute `.var` of view, copying.
After flitering:  (20139, 11750)
[14]:
#Normalization
sc.pp.highly_variable_genes(adata, flavor="seurat_v3", n_top_genes=3000)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

Constructing the spatial network

[15]:
STAGATE.Cal_Spatial_Net(adata, rad_cutoff=50)
STAGATE.Stats_Spatial_Net(adata)
------Calculating spatial graph...
The graph contains 228288 edges, 20139 cells.
11.3356 neighbors per cell on average.
_images/T3_Slide-seqV2_18_1.png

Running STAGATE

[16]:
adata = STAGATE.train_STAGATE(adata, alpha=0)
Size of Input:  (20139, 3000)
WARNING:tensorflow:From D:\ProgramData\Anaconda3\lib\site-packages\tensorflow_core\python\ops\math_grad.py:1375: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
100%|████████████████████████████████████████████████████████████████████████████████| 500/500 [14:09<00:00,  1.70s/it]
[17]:
sc.pp.neighbors(adata, use_rep='STAGATE')
sc.tl.umap(adata)
[18]:
sc.tl.louvain(adata, resolution=0.5)
[19]:
adata.obsm["spatial"] = adata.obsm["spatial"] * (-1)
[20]:
plt.rcParams["figure.figsize"] = (3, 3)
sc.pl.embedding(adata, basis="spatial", color="louvain",s=6, show=False, title='STAGATE')
plt.axis('off')
[20]:
(-5108.555, -588.545, -5670.73, -847.6700000000001)
_images/T3_Slide-seqV2_24_1.png
[21]:
sc.pl.umap(adata, color='louvain', title='STAGATE')
_images/T3_Slide-seqV2_25_0.png

SCANPY results (for comparison)

[22]:
sc.pp.pca(adata, n_comps=30)
[23]:
sc.pp.neighbors(adata, use_rep='X_pca')
sc.tl.louvain(adata, resolution=0.5)
sc.tl.umap(adata)
[24]:
plt.rcParams["figure.figsize"] = (3, 3)
sc.pl.embedding(adata, basis="spatial", color="louvain",s=6, show=False, title='SCANPY')
plt.axis('off')
[24]:
(-5108.555, -588.545, -5670.73, -847.6700000000001)
_images/T3_Slide-seqV2_29_1.png
[25]:
sc.pl.umap(adata, color='louvain', title='SCANPY')
_images/T3_Slide-seqV2_30_0.png

Tutorial 4: Stereo-seq mouse olfactory bulb

This tutorial demonstrates how to identify spatial domains on Stereo-seq data.

In this tutorial, we foucs on the Stereo-seq mouse olfactory bulb data (https://github.com/JinmiaoChenLab/SEDR_analyses/).

We removed spots outside the main tissue area, and the used spots can be downloaded from https://drive.google.com/drive/folders/10lhz5VY7YfvHrtV40MwaqLmWz56U9eBP?usp=sharing.

Preparation

[1]:
import warnings
warnings.filterwarnings("ignore")
[2]:
import pandas as pd
import numpy as np
import scanpy as sc
import matplotlib.pyplot as plt
import os
import sys
[3]:
import STAGATE
[4]:
counts_file = os.path.join('Data/RNA_counts.tsv')
coor_file = os.path.join('Data/position.tsv')
[5]:
counts = pd.read_csv(counts_file, sep='\t', index_col=0)
coor_df = pd.read_csv(coor_file, sep='\t')
print(counts.shape, coor_df.shape)
(27106, 19527) (19527, 3)
[6]:
counts.columns = ['Spot_'+str(x) for x in counts.columns]
coor_df.index = coor_df['label'].map(lambda x: 'Spot_'+str(x))
coor_df = coor_df.loc[:, ['x','y']]
[7]:
coor_df.head()
[7]:
x y
label
Spot_1 12555.007833 6307.537859
Spot_2 12623.666667 6297.166667
Spot_3 12589.567164 6302.552239
Spot_4 12642.495050 6307.386139
Spot_5 13003.333333 6307.990991
[8]:
adata = sc.AnnData(counts.T)
adata.var_names_make_unique()
[9]:
adata
[9]:
AnnData object with n_obs × n_vars = 19527 × 27106
[10]:
coor_df = coor_df.loc[adata.obs_names, ['y', 'x']]
adata.obsm["spatial"] = coor_df.to_numpy()
sc.pp.calculate_qc_metrics(adata, inplace=True)
[11]:
plt.rcParams["figure.figsize"] = (5,4)
sc.pl.embedding(adata, basis="spatial", color="n_genes_by_counts", show=False)
plt.title("")
plt.axis('off')
[11]:
(6002.432692307693, 12486.580128205129, 9908.545833333334, 15086.093055555555)
_images/T4_Stereo_13_1.png
[12]:
used_barcode = pd.read_csv(os.path.join('Data/used_barcodes.txt'), sep='\t', header=None)
used_barcode = used_barcode[0]
adata = adata[used_barcode,]
[13]:
adata
[13]:
View of AnnData object with n_obs × n_vars = 19109 × 27106
    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes'
    var: 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
    obsm: 'spatial'
[14]:
plt.rcParams["figure.figsize"] = (5,4)
sc.pl.embedding(adata, basis="spatial", color="n_genes_by_counts", show=False)
plt.title("")
plt.axis('off')
[14]:
(6005.190789473685, 12428.6600877193, 9986.774763741741, 15062.302776957436)
_images/T4_Stereo_16_1.png
[15]:
sc.pp.filter_genes(adata, min_cells=50)
print('After flitering: ', adata.shape)
Trying to set attribute `.var` of view, copying.
After flitering:  (19109, 14376)
[16]:
#Normalization
sc.pp.highly_variable_genes(adata, flavor="seurat_v3", n_top_genes=3000)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

Constructing the spatial network

[17]:
STAGATE.Cal_Spatial_Net(adata, rad_cutoff=50)
STAGATE.Stats_Spatial_Net(adata)
------Calculating spatial graph...
The graph contains 144318 edges, 19109 cells.
7.5524 neighbors per cell on average.
_images/T4_Stereo_20_1.png

Running STAGATE

[18]:
adata = STAGATE.train_STAGATE(adata, alpha=0)
Size of Input:  (19109, 3000)
WARNING:tensorflow:From D:\ProgramData\Anaconda3\lib\site-packages\tensorflow_core\python\ops\math_grad.py:1375: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
100%|████████████████████████████████████████████████████████████████████████████████| 500/500 [12:33<00:00,  1.51s/it]
[19]:
sc.pp.neighbors(adata, use_rep='STAGATE')
sc.tl.umap(adata)
[20]:
sc.tl.louvain(adata, resolution=0.8)
[21]:
plt.rcParams["figure.figsize"] = (3, 3)
sc.pl.embedding(adata, basis="spatial", color="louvain",s=6, show=False, title='STAGATE')
plt.axis('off')
[21]:
(6005.190789473685, 12428.6600877193, 9986.774763741741, 15062.302776957436)
_images/T4_Stereo_25_1.png
[22]:
sc.pl.umap(adata, color='louvain', title='STAGATE')
_images/T4_Stereo_26_0.png

SCANPY results (for comparison)

[23]:
sc.pp.pca(adata, n_comps=30)
[24]:
sc.pp.neighbors(adata, use_rep='X_pca')
sc.tl.louvain(adata, resolution=0.8)
sc.tl.umap(adata)
[25]:
plt.rcParams["figure.figsize"] = (3, 3)
sc.pl.embedding(adata, basis="spatial", color="louvain",s=6, show=False, title='SCANPY')
plt.axis('off')
[25]:
(6005.190789473685, 12428.6600877193, 9986.774763741741, 15062.302776957436)
_images/T4_Stereo_30_1.png
[26]:
sc.pl.umap(adata, color='louvain', title='SCANPY')
_images/T4_Stereo_31_0.png
[ ]:

Tutorial 5: 3D spatial domain identification

We extended STAGATE for 3D spatial domain identification by simultaneously considering the 2D SNN within each section and neighboring spots between adjacent section. The key idea of the usage of 3D SNN is that the biological differences between consecutive sections should be continuous, so we can enhance the similarity between adjacent sections to eliminate the discontinuous independent technical noises.

In this tutorial, we applied STAGATE onto a pseudo 3D ST data constructed by aligning the spots of the “cord-like” structure in seven hippocampus sections profiled by Slide-seq.

The original Slide-seq can be downloaded from https://portals.broadinstitute.org/single_cell/study/slide-seq-study. We provided the processed data and aligned coordinates at https://drive.google.com/drive/folders/10lhz5VY7YfvHrtV40MwaqLmWz56U9eBP?usp=sharing.

Preparation

[1]:
import warnings
warnings.filterwarnings("ignore")
[2]:
import pandas as pd
import numpy as np
import scanpy as sc
import matplotlib.pyplot as plt
import os
import sys
[3]:
import STAGATE
[4]:
data = pd.read_csv('Data/3D_Hippo_expression.txt', sep='\t', index_col=0)
Aligned_coor = pd.read_csv('Data/ICP_Align_Coor.txt', sep='\t', index_col=0)
[5]:
Aligned_coor.head()
[5]:
X Y Z Section
AAAAACAACCAAT-13 3946.072727 3599.436364 6000 Puck_180531_13
AAAAGTCCATTAT-13 2401.628788 3550.333333 6000 Puck_180531_13
AAACAACGCGCGA-13 2921.968153 3353.687898 6000 Puck_180531_13
AAACAATTCAATA-13 2711.556338 3365.542254 6000 Puck_180531_13
AAACACGCTGCCC-13 2351.863354 4265.447205 6000 Puck_180531_13
[6]:
adata = sc.AnnData(data)
adata
[6]:
AnnData object with n_obs × n_vars = 10908 × 9420
[7]:
#Normalization
sc.pp.highly_variable_genes(adata, flavor="seurat_v3", n_top_genes=3000)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
[8]:
# loading metadata and aligned coordinates
adata.obs['X'] = Aligned_coor.loc[adata.obs_names, 'X']
adata.obs['Y'] = Aligned_coor.loc[adata.obs_names, 'Y']
adata.obs['Z'] = Aligned_coor.loc[adata.obs_names, 'Z']
adata.obs['Section_id'] = Aligned_coor.loc[adata.obs_names, 'Section']
[9]:
# loading the spatial locations
adata.obsm['spatial'] = adata.obs.loc[:, ['X', 'Y']].values
[10]:
section_colors = ['#02899A', '#0E994D', '#86C049', '#FBB21F', '#F48022', '#DA5326', '#BA3326']
[11]:
fig = plt.figure(figsize=(4, 4))
ax1 = plt.axes(projection='3d')
for it, label in enumerate(np.unique(adata.obs['Section_id'])):
    temp_Coor = adata.obs.loc[adata.obs['Section_id']==label, :]
    temp_xd = temp_Coor['X']
    temp_yd = temp_Coor['Y']
    temp_zd = temp_Coor['Z']
    ax1.scatter3D(temp_xd, temp_yd, temp_zd, c=section_colors[it],s=0.2, marker="o", label=label)

ax1.set_xlabel('')
ax1.set_ylabel('')
ax1.set_zlabel('')

ax1.set_xticklabels([])
ax1.set_yticklabels([])
ax1.set_zticklabels([])

plt.legend(bbox_to_anchor=(1,0.8), markerscale=10, frameon=False)

ax1.elev = 45
ax1.azim = -20

plt.show()
_images/T5_3D_13_0.png

Constructing 3D spatial networks

[12]:
section_order = ['Puck_180531_13', 'Puck_180531_16', 'Puck_180531_17',
                 'Puck_180531_18', 'Puck_180531_19', 'Puck_180531_22',
                 'Puck_180531_23']
[13]:
STAGATE.Cal_Spatial_Net_3D(adata, rad_cutoff_2D=50, rad_cutoff_Zaxis=50,
                           key_section='Section_id', section_order = section_order, verbose=True)
Radius used for 2D SNN: 50
Radius used for SNN between sections: 50
------Calculating 2D SNN of section  Puck_180531_13
Trying to set attribute `.uns` of view, copying.
This graph contains 22974 edges, 1692 cells.
13.5780 neighbors per cell on average.
------Calculating 2D SNN of section  Puck_180531_16
Trying to set attribute `.uns` of view, copying.
This graph contains 20546 edges, 1476 cells.
13.9201 neighbors per cell on average.
------Calculating 2D SNN of section  Puck_180531_17
Trying to set attribute `.uns` of view, copying.
This graph contains 11794 edges, 1138 cells.
10.3638 neighbors per cell on average.
------Calculating 2D SNN of section  Puck_180531_18
Trying to set attribute `.uns` of view, copying.
This graph contains 18278 edges, 1363 cells.
13.4101 neighbors per cell on average.
------Calculating 2D SNN of section  Puck_180531_19
Trying to set attribute `.uns` of view, copying.
This graph contains 22344 edges, 1788 cells.
12.4966 neighbors per cell on average.
------Calculating 2D SNN of section  Puck_180531_22
Trying to set attribute `.uns` of view, copying.
This graph contains 24544 edges, 1835 cells.
13.3755 neighbors per cell on average.
------Calculating 2D SNN of section  Puck_180531_23
Trying to set attribute `.uns` of view, copying.
This graph contains 18922 edges, 1616 cells.
11.7092 neighbors per cell on average.
------Calculating SNN between adjacent section Puck_180531_13 and Puck_180531_16.
Trying to set attribute `.uns` of view, copying.
This graph contains 34482 edges, 3168 cells.
10.8845 neighbors per cell on average.
------Calculating SNN between adjacent section Puck_180531_16 and Puck_180531_17.
Trying to set attribute `.uns` of view, copying.
This graph contains 23078 edges, 2614 cells.
8.8286 neighbors per cell on average.
------Calculating SNN between adjacent section Puck_180531_17 and Puck_180531_18.
Trying to set attribute `.uns` of view, copying.
This graph contains 25284 edges, 2501 cells.
10.1096 neighbors per cell on average.
------Calculating SNN between adjacent section Puck_180531_18 and Puck_180531_19.
Trying to set attribute `.uns` of view, copying.
This graph contains 24944 edges, 3151 cells.
7.9162 neighbors per cell on average.
------Calculating SNN between adjacent section Puck_180531_19 and Puck_180531_22.
Trying to set attribute `.uns` of view, copying.
This graph contains 26128 edges, 3623 cells.
7.2117 neighbors per cell on average.
------Calculating SNN between adjacent section Puck_180531_22 and Puck_180531_23.
Trying to set attribute `.uns` of view, copying.
This graph contains 21110 edges, 3451 cells.
6.1171 neighbors per cell on average.
3D SNN contains 294428 edges, 10908 cells.
26.9919 neighbors per cell on average.

Running STAGATE with 3D spatial networks

[14]:
adata = STAGATE.train_STAGATE(adata, alpha=0)
Size of Input:  (10908, 3000)
WARNING:tensorflow:From /home/xzhou/anaconda3/envs/tensorflow/lib/python3.7/site-packages/tensorflow_core/python/ops/math_grad.py:1375: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
2022-04-29 21:26:31.725355: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2022-04-29 21:26:31.748905: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3499910000 Hz
2022-04-29 21:26:31.750571: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55cb33581890 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2022-04-29 21:26:31.750614: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2022-04-29 21:26:31.754849: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2022-04-29 21:26:32.055109: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55cb35070bb0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2022-04-29 21:26:32.055162: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce RTX 3090, Compute Capability 8.6
2022-04-29 21:26:32.055175: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): NVIDIA GeForce RTX 3090, Compute Capability 8.6
2022-04-29 21:26:32.057154: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties:
name: NVIDIA GeForce RTX 3090 major: 8 minor: 6 memoryClockRate(GHz): 1.695
pciBusID: 0000:65:00.0
2022-04-29 21:26:32.057729: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 1 with properties:
name: NVIDIA GeForce RTX 3090 major: 8 minor: 6 memoryClockRate(GHz): 1.695
pciBusID: 0000:b3:00.0
2022-04-29 21:26:32.057850: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory
2022-04-29 21:26:32.057894: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory
2022-04-29 21:26:32.057927: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory
2022-04-29 21:26:32.057959: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory
2022-04-29 21:26:32.057992: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory
2022-04-29 21:26:32.058026: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory
2022-04-29 21:26:32.058057: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory
2022-04-29 21:26:32.058061: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1662] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2022-04-29 21:26:32.058076: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-04-29 21:26:32.058080: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186]      0 1
2022-04-29 21:26:32.058083: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 0:   N N
2022-04-29 21:26:32.058086: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 1:   N N
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [08:42<00:00,  1.05s/it]
[15]:
sc.pp.neighbors(adata, use_rep='STAGATE')
sc.tl.umap(adata)
[16]:
num_cluster = 4
adata = STAGATE.mclust_R(adata, num_cluster, used_obsm='STAGATE')
R[write to console]: Package 'mclust' version 5.4.7
Type 'citation("mclust")' for citing this R package in publications.

[17]:
adata.uns['Section_id_colors'] = ['#02899A', '#0E994D', '#86C049', '#FBB21F', '#F48022', '#DA5326', '#BA3326']
[18]:
plt.rcParams["figure.figsize"] = (3, 3)
sc.pl.umap(adata, color=['mclust', 'Section_id'])
... storing 'Section_id' as categorical
_images/T5_3D_22_1.png
[19]:
fig = plt.figure(figsize=(4, 4))
ax1 = plt.axes(projection='3d')
for it, label in enumerate(np.unique(adata.obs['mclust'])):
    temp_Coor = adata.obs.loc[adata.obs['mclust']==label, :]
    temp_xd = temp_Coor['X']
    temp_yd = temp_Coor['Y']
    temp_zd = temp_Coor['Z']
    ax1.scatter3D(temp_xd, temp_yd, temp_zd, c=adata.uns['mclust_colors'][it],s=0.2, marker="o", label=label)

ax1.set_xlabel('')
ax1.set_ylabel('')
ax1.set_zlabel('')

ax1.set_xticklabels([])
ax1.set_yticklabels([])
ax1.set_zticklabels([])

plt.legend(bbox_to_anchor=(1.2,0.8), markerscale=10, frameon=False)
plt.title('STAGATE-3D')

ax1.elev = 45
ax1.azim = -20

plt.show()
_images/T5_3D_23_0.png

Running STAGATE with 2D spatial networks (for comparison)

The 2D spatial networks are saved at adata.uns (‘Spatial_Net_2D’). For camparison, we replace the 3D spatial networks with 2D spatial network and run STAGATE.

[20]:
adata_2D = adata.copy()
[21]:
adata_2D.uns['Spatial_Net'] = adata.uns['Spatial_Net_2D'].copy()
[22]:
adata_2D = STAGATE.train_STAGATE(adata_2D, alpha=0)
Size of Input:  (10908, 3000)
2022-04-29 21:36:00.700666: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties:
name: NVIDIA GeForce RTX 3090 major: 8 minor: 6 memoryClockRate(GHz): 1.695
pciBusID: 0000:65:00.0
2022-04-29 21:36:00.701352: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 1 with properties:
name: NVIDIA GeForce RTX 3090 major: 8 minor: 6 memoryClockRate(GHz): 1.695
pciBusID: 0000:b3:00.0
2022-04-29 21:36:00.701516: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: 无法打开共享对象文件: 没有那个文件或目录
2022-04-29 21:36:00.701577: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: 无法打开共享对象文件: 没有那个文件或目录
2022-04-29 21:36:00.701612: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: 无法打开共享对象文件: 没有那个文件或目录
2022-04-29 21:36:00.701646: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: 无法打开共享对象文件: 没有那个文件或目录
2022-04-29 21:36:00.701681: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: 无法打开共享对象文件: 没有那个文件或目录
2022-04-29 21:36:00.701713: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: 无法打开共享对象文件: 没有那个文件或目录
2022-04-29 21:36:00.701747: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudnn.so.7'; dlerror: libcudnn.so.7: 无法打开共享对象文件: 没有那个文件或目录
2022-04-29 21:36:00.701752: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1662] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2022-04-29 21:36:00.701779: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-04-29 21:36:00.701783: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186]      0 1
2022-04-29 21:36:00.701787: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 0:   N N
2022-04-29 21:36:00.701789: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 1:   N N
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [08:15<00:00,  1.01it/s]
[23]:
sc.pp.neighbors(adata_2D, use_rep='STAGATE')
sc.tl.umap(adata_2D)
[24]:
num_cluster = 4
adata_2D = STAGATE.mclust_R(adata_2D, num_cluster, used_obsm='STAGATE')
[25]:
adata_2D.uns['Section_id_colors'] = ['#02899A', '#0E994D', '#86C049', '#FBB21F', '#F48022', '#DA5326', '#BA3326']
[26]:
plt.rcParams["figure.figsize"] = (3, 3)
sc.pl.umap(adata_2D, color=['mclust', 'Section_id'])
_images/T5_3D_32_0.png
[27]:
fig = plt.figure(figsize=(4, 4))
ax1 = plt.axes(projection='3d')
for it, label in enumerate(np.unique(adata_2D.obs['mclust'])):
    temp_Coor = adata_2D.obs.loc[adata_2D.obs['mclust']==label, :]
    temp_xd = temp_Coor['X']
    temp_yd = temp_Coor['Y']
    temp_zd = temp_Coor['Z']
    ax1.scatter3D(temp_xd, temp_yd, temp_zd, c=adata_2D.uns['mclust_colors'][it],s=0.2, marker="o", label=label)

ax1.set_xlabel('')
ax1.set_ylabel('')
ax1.set_zlabel('')

ax1.set_xticklabels([])
ax1.set_yticklabels([])
ax1.set_zticklabels([])

plt.legend(bbox_to_anchor=(1.2,0.8), markerscale=10, frameon=False)
plt.title('STAGATE-2D')

ax1.elev = 45
ax1.azim = -20

plt.show()
_images/T5_3D_33_0.png

STAGATE-2D vs STAGATE-3D

[28]:
fig, axs = plt.subplots(2, 2, figsize=(8,8))
sc.pl.umap(adata_2D, color='mclust', title='mclust (STAGATE-2D)', show=False, ax=axs[0,0])
sc.pl.umap(adata_2D, color='Section_id', title='Section_id (STAGATE-2D)', show=False, ax=axs[0,1])
sc.pl.umap(adata, color='mclust', title='mclust (STAGATE-3D)', show=False, ax=axs[1,0])
sc.pl.umap(adata, color='Section_id', title='Section_id (STAGATE-3D)', show=False, ax=axs[1,1])
[28]:
<AxesSubplot:title={'center':'Section_id (STAGATE-3D)'}, xlabel='UMAP1', ylabel='UMAP2'>
_images/T5_3D_35_1.png
[ ]:

Tutorial 6: Denoising

This tutorial demonstrates how to denoise expressions. The data used here is the same as those in Tutorial 1 (Sample 151676 of the DLPFC dataset).

Preparation

[1]:
import warnings
warnings.filterwarnings("ignore")
[2]:
import pandas as pd
import numpy as np
import scanpy as sc
import matplotlib.pyplot as plt
import os
import sys
[3]:
import STAGATE
[4]:
section_id = '151676'
[5]:
input_dir = os.path.join('Data', section_id)
adata = sc.read_visium(path=input_dir, count_file=section_id+'_filtered_feature_bc_matrix.h5')
adata.var_names_make_unique()
sc.pp.calculate_qc_metrics(adata, inplace=True)
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
[6]:
adata
[6]:
AnnData object with n_obs × n_vars = 3460 × 33538
    obs: 'in_tissue', 'array_row', 'array_col', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes'
    var: 'gene_ids', 'feature_types', 'genome', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
    uns: 'spatial'
    obsm: 'spatial'
[7]:
adata = adata[:,adata.var['total_counts']>100]
adata
[7]:
View of AnnData object with n_obs × n_vars = 3460 × 10747
    obs: 'in_tissue', 'array_row', 'array_col', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes'
    var: 'gene_ids', 'feature_types', 'genome', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
    uns: 'spatial'
    obsm: 'spatial'
[8]:
#Normalization
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
[9]:
# read the annotation
Ann_df = pd.read_csv(os.path.join('Data', section_id, section_id+'_truth.txt'), sep='\t', header=None, index_col=0)
Ann_df.columns = ['Ground Truth']
[10]:
adata.obs['Ground Truth'] = Ann_df.loc[adata.obs_names, 'Ground Truth']
[11]:
plt.rcParams["figure.figsize"] = (3, 3)
sc.pl.spatial(adata, img_key="hires", color=["Ground Truth"])
... storing 'Ground Truth' as categorical
... storing 'feature_types' as categorical
... storing 'genome' as categorical
_images/T6_Denoising_13_1.png

Constructing the spatial network

[12]:
STAGATE.Cal_Spatial_Net(adata, rad_cutoff=150)
STAGATE.Stats_Spatial_Net(adata)
------Calculating spatial graph...
The graph contains 20052 edges, 3460 cells.
5.7954 neighbors per cell on average.
_images/T6_Denoising_15_1.png

Denoising expression using STAGATE

When performing imputation and denoising, set save_reconstrction=True. The reconstructed expressions are saved in ‘STAGATE_ReX’ of adata.layers.

[13]:
adata = STAGATE.train_STAGATE(adata, alpha=0, save_reconstrction=True)
Size of Input:  (3460, 10747)
WARNING:tensorflow:From D:\ProgramData\Anaconda3\lib\site-packages\tensorflow_core\python\ops\math_grad.py:1375: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
100%|████████████████████████████████████████████████████████████████████████████████| 500/500 [00:51<00:00,  9.69it/s]
[14]:
adata
[14]:
AnnData object with n_obs × n_vars = 3460 × 10747
    obs: 'in_tissue', 'array_row', 'array_col', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'Ground Truth'
    var: 'gene_ids', 'feature_types', 'genome', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
    uns: 'spatial', 'log1p', 'Ground Truth_colors', 'Spatial_Net'
    obsm: 'spatial', 'STAGATE'
    layers: 'STAGATE_ReX'
[15]:
plot_gene = 'ATP2B4'
fig, axs = plt.subplots(1, 2, figsize=(8, 4))
sc.pl.spatial(adata, img_key="hires", color=plot_gene, show=False, ax=axs[0], title='RAW_'+plot_gene, vmax='p99')
sc.pl.spatial(adata, img_key="hires", color=plot_gene, show=False, ax=axs[1], title='STAGATE_'+plot_gene, layer='STAGATE_ReX', vmax='p99')
[15]:
[<AxesSubplot:title={'center':'STAGATE_ATP2B4'}, xlabel='spatial1', ylabel='spatial2'>]
_images/T6_Denoising_20_1.png
[16]:
plot_gene = 'RASGRF2'
fig, axs = plt.subplots(1, 2, figsize=(8, 4))
sc.pl.spatial(adata, img_key="hires", color=plot_gene, show=False, ax=axs[0], title='RAW_'+plot_gene, vmax='p99')
sc.pl.spatial(adata, img_key="hires", color=plot_gene, show=False, ax=axs[1], title='STAGATE_'+plot_gene, layer='STAGATE_ReX', vmax='p99')
[16]:
[<AxesSubplot:title={'center':'STAGATE_RASGRF2'}, xlabel='spatial1', ylabel='spatial2'>]
_images/T6_Denoising_21_1.png
[17]:
plot_gene = 'LAMP5'
fig, axs = plt.subplots(1, 2, figsize=(8, 4))
sc.pl.spatial(adata, img_key="hires", color=plot_gene, show=False, ax=axs[0], title='RAW_'+plot_gene, vmax='p99')
sc.pl.spatial(adata, img_key="hires", color=plot_gene, show=False, ax=axs[1], title='STAGATE_'+plot_gene, layer='STAGATE_ReX', vmax='p99')
[17]:
[<AxesSubplot:title={'center':'STAGATE_LAMP5'}, xlabel='spatial1', ylabel='spatial2'>]
_images/T6_Denoising_22_1.png
[18]:
plot_gene = 'NEFH'
fig, axs = plt.subplots(1, 2, figsize=(8, 4))
sc.pl.spatial(adata, img_key="hires", color=plot_gene, show=False, ax=axs[0], title='RAW_'+plot_gene, vmax='p99')
sc.pl.spatial(adata, img_key="hires", color=plot_gene, show=False, ax=axs[1], title='STAGATE_'+plot_gene, layer='STAGATE_ReX', vmax='p99')
[18]:
[<AxesSubplot:title={'center':'STAGATE_NEFH'}, xlabel='spatial1', ylabel='spatial2'>]
_images/T6_Denoising_23_1.png
[19]:
plot_gene = 'NTNG2'
fig, axs = plt.subplots(1, 2, figsize=(8, 4))
sc.pl.spatial(adata, img_key="hires", color=plot_gene, show=False, ax=axs[0], title='RAW_'+plot_gene, vmax='p99')
sc.pl.spatial(adata, img_key="hires", color=plot_gene, show=False, ax=axs[1], title='STAGATE_'+plot_gene, layer='STAGATE_ReX', vmax='p99')
[19]:
[<AxesSubplot:title={'center':'STAGATE_NTNG2'}, xlabel='spatial1', ylabel='spatial2'>]
_images/T6_Denoising_24_1.png
[20]:
plot_gene = 'B3GALT2'
fig, axs = plt.subplots(1, 2, figsize=(8, 4))
sc.pl.spatial(adata, img_key="hires", color=plot_gene, show=False, ax=axs[0], title='RAW_'+plot_gene, vmax='p99')
sc.pl.spatial(adata, img_key="hires", color=plot_gene, show=False, ax=axs[1], title='STAGATE_'+plot_gene, layer='STAGATE_ReX', vmax='p99')
[20]:
[<AxesSubplot:title={'center':'STAGATE_B3GALT2'}, xlabel='spatial1', ylabel='spatial2'>]
_images/T6_Denoising_25_1.png

Tutorial 7: Slide-seqV2 mouse olfactory bulb (pyG framework)

This tutorial demonstrates how to identify spatial domains on Slide-seqV2 data using STAGATE based on pyG (PyTorch Geometric) framework.

Benefit from the optimization of the pyG package for training graph neural networks, it is more than 10x faster than STAGATE based on the tensorflow1 framework, and can use a batch training strategy to deal with large-scale data (See Tutorial 8).

The download and installation of the STAGATE package based on the pyG framework can be found at https://github.com/zhanglabtools/STAGATE_pyG.

In this tutorial, we foucs on the Slide-seqV2 mouse olfactory bulb data (Puck_200127_15 from https://singlecell.broadinstitute.org/single_cell/study/SCP815/highly-sensitive-spatial-transcriptomics-at-near-cellular-resolution-with-slide-seqv2#study-summary).

We removed spots outside the main tissue area, and the used spots can be downloaded from https://drive.google.com/drive/folders/10lhz5VY7YfvHrtV40MwaqLmWz56U9eBP?usp=sharing.

Preparation

[1]:
import pandas as pd
import numpy as np
import scanpy as sc
import matplotlib.pyplot as plt
import os
import sys
[2]:
import torch
import STAGATE_pyG
[3]:
input_dir = 'Data'
counts_file = os.path.join(input_dir, 'Puck_200127_15.digital_expression.txt')
coor_file = os.path.join(input_dir, 'Puck_200127_15_bead_locations.csv')
[4]:
counts = pd.read_csv(counts_file, sep='\t', index_col=0)
coor_df = pd.read_csv(coor_file, index_col=0)
print(counts.shape, coor_df.shape)
(21220, 21724) (21724, 2)
[5]:
adata = sc.AnnData(counts.T)
adata.var_names_make_unique()
coor_df = coor_df.loc[adata.obs_names, ['xcoord', 'ycoord']]
adata.obsm["spatial"] = coor_df.to_numpy()
[6]:
sc.pp.calculate_qc_metrics(adata, inplace=True)
[7]:
adata
[7]:
AnnData object with n_obs × n_vars = 21724 × 21220
    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes'
    var: 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
    obsm: 'spatial'
[8]:
plt.rcParams["figure.figsize"] = (6,5)
sc.pl.embedding(adata, basis="spatial", color="log1p_total_counts",s=6, show=False)
plt.title('')
plt.axis('off')
[8]:
(-289.81710000000004, 6312.7151, 173.30850000000004, 5709.8615)
_images/T7_pyG_10_1.png
[9]:
# can be downloaded from https://drive.google.com/drive/folders/10lhz5VY7YfvHrtV40MwaqLmWz56U9eBP?usp=sharing
used_barcode = pd.read_csv('Data/used_barcodes.txt', sep='\t', header=None)
used_barcode = used_barcode[0]
[10]:
adata = adata[used_barcode,]
[11]:
plt.rcParams["figure.figsize"] = (5,5)
sc.pl.embedding(adata, basis="spatial", color="log1p_total_counts",s=10, show=False, title='Removing spots outside the main tissue area')

plt.axis('off')
[11]:
(588.545, 5108.555, 847.6700000000001, 5670.73)
_images/T7_pyG_13_1.png
[12]:
sc.pp.filter_genes(adata, min_cells=50)
print('After flitering: ', adata.shape)
Trying to set attribute `.var` of view, copying.
After flitering:  (20139, 11750)
[13]:
#Normalization
sc.pp.highly_variable_genes(adata, flavor="seurat_v3", n_top_genes=3000)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

Constructing the spatial network

[14]:
STAGATE_pyG.Cal_Spatial_Net(adata, rad_cutoff=50)
STAGATE_pyG.Stats_Spatial_Net(adata)
------Calculating spatial graph...
The graph contains 228300 edges, 20139 cells.
11.3362 neighbors per cell on average.
_images/T7_pyG_17_1.png

Running STAGATE

[15]:
adata = STAGATE_pyG.train_STAGATE(adata, device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu'))
Size of Input:  (20139, 3000)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:37<00:00, 26.61it/s]
[16]:
sc.pp.neighbors(adata, use_rep='STAGATE')
sc.tl.umap(adata)
[17]:
sc.tl.louvain(adata, resolution=0.5)
[18]:
adata.obsm["spatial"] = adata.obsm["spatial"] * (-1)
[19]:
plt.rcParams["figure.figsize"] = (3, 3)
sc.pl.embedding(adata, basis="spatial", color="louvain",s=6, show=False, title='STAGATE')
plt.axis('off')
[19]:
(-5108.555, -588.545, -5670.73, -847.6700000000001)
_images/T7_pyG_23_1.png
[20]:
sc.pl.umap(adata, color='louvain', title='STAGATE')
_images/T7_pyG_24_0.png

Tutorial 8: Batch training strategy (pyG framework)

This tutorial demonstrates how to train STAGATE in batches on Slide-seqV2 data using STAGATE based on pyG (PyTorch Geometric) framework.

Benefit from the optimization of the pyG package for training graph neural networks, it is more than 10x faster than STAGATE based on the tensorflow1 framework, and can use a batch training strategy to deal with large-scale data.

The download and installation of the STAGATE package based on the pyG framework can be found at https://github.com/zhanglabtools/STAGATE_pyG.

In this tutorial, we foucs on the Slide-seqV2 mouse olfactory bulb data (Puck_200127_15 from https://singlecell.broadinstitute.org/single_cell/study/SCP815/highly-sensitive-spatial-transcriptomics-at-near-cellular-resolution-with-slide-seqv2#study-summary).

We removed spots outside the main tissue area, and the used spots can be downloaded from https://drive.google.com/drive/folders/10lhz5VY7YfvHrtV40MwaqLmWz56U9eBP?usp=sharing.

Strategies for dividing subgraphs

Because we build the spatial network based on spatial location, our network can be directly divided into subgraphs in the following form.

图片1.png

The above picture is an example of num_batch_x=3, num_batch_y=2. Specifically, we divide subgraphs according to quantiles on a single spatial coordinate.

Preparation

[1]:
import pandas as pd
import numpy as np
import scanpy as sc
import matplotlib.pyplot as plt
import os
import sys
from tqdm import tqdm
[2]:
import torch
import STAGATE_pyG
import torch.nn.functional as F
[3]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
[4]:
input_dir = 'Data'
counts_file = os.path.join(input_dir, 'Puck_200127_15.digital_expression.txt')
coor_file = os.path.join(input_dir, 'Puck_200127_15_bead_locations.csv')
[5]:
counts = pd.read_csv(counts_file, sep='\t', index_col=0)
coor_df = pd.read_csv(coor_file, index_col=0)
print(counts.shape, coor_df.shape)
(21220, 21724) (21724, 2)
[6]:
adata = sc.AnnData(counts.T)
adata.var_names_make_unique()
coor_df = coor_df.loc[adata.obs_names, ['xcoord', 'ycoord']]
adata.obsm["spatial"] = coor_df.to_numpy()
[7]:
sc.pp.calculate_qc_metrics(adata, inplace=True)
[8]:
adata
[8]:
AnnData object with n_obs × n_vars = 21724 × 21220
    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes'
    var: 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
    obsm: 'spatial'
[9]:
plt.rcParams["figure.figsize"] = (6,5)
sc.pl.embedding(adata, basis="spatial", color="log1p_total_counts",s=6, show=False)
plt.title('')
plt.axis('off')
[9]:
(-289.81710000000004, 6312.7151, 173.30850000000004, 5709.8615)
_images/T8_Batch_15_1.png
[10]:
# can be downloaded from https://drive.google.com/drive/folders/10lhz5VY7YfvHrtV40MwaqLmWz56U9eBP?usp=sharing
used_barcode = pd.read_csv('Data/used_barcodes.txt', sep='\t', header=None)
used_barcode = used_barcode[0]
[11]:
adata = adata[used_barcode,]
[12]:
plt.rcParams["figure.figsize"] = (5,5)
sc.pl.embedding(adata, basis="spatial", color="log1p_total_counts",s=10, show=False, title='Removing spots outside the main tissue area')

plt.axis('off')
[12]:
(588.545, 5108.555, 847.6700000000001, 5670.73)
_images/T8_Batch_18_1.png
[13]:
sc.pp.filter_genes(adata, min_cells=50)
print('After flitering: ', adata.shape)
Trying to set attribute `.var` of view, copying.
After flitering:  (20139, 11750)
[14]:
#Normalization
sc.pp.highly_variable_genes(adata, flavor="seurat_v3", n_top_genes=3000)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
[15]:
# Different with STAGATE (tensorflow version)
adata = adata[:,adata.var['highly_variable']]
adata.shape
[15]:
(20139, 3000)

Dividing subgraphs

[16]:
adata.obs['X'] = adata.obsm['spatial'][:,0]
adata.obs['Y'] = adata.obsm['spatial'][:,1]
Trying to set attribute `.obs` of view, copying.
[17]:
# grid setting
num_batch_x = 3
num_batch_y = 2
[18]:
Batch_list = STAGATE_pyG.Batch_Data(adata, num_batch_x=num_batch_x, num_batch_y=num_batch_y,
                                    spatial_key=['X', 'Y'], plot_Stats=True)
_images/T8_Batch_25_0.png

Constructing the spatial network

[19]:
# Consturcting network for each batch
for temp_adata in Batch_list:
    STAGATE_pyG.Cal_Spatial_Net(temp_adata, rad_cutoff=50)
    #STAGATE_pyG.Stats_Spatial_Net(temp_adata)
------Calculating spatial graph...
Trying to set attribute `._uns` of view, copying.
The graph contains 41528 edges, 3782 cells.
10.9804 neighbors per cell on average.
------Calculating spatial graph...
Trying to set attribute `._uns` of view, copying.
The graph contains 32236 edges, 2931 cells.
10.9983 neighbors per cell on average.
------Calculating spatial graph...
Trying to set attribute `._uns` of view, copying.
The graph contains 41060 edges, 3590 cells.
11.4373 neighbors per cell on average.
------Calculating spatial graph...
Trying to set attribute `._uns` of view, copying.
The graph contains 33970 edges, 3123 cells.
10.8774 neighbors per cell on average.
------Calculating spatial graph...
Trying to set attribute `._uns` of view, copying.
The graph contains 29512 edges, 2698 cells.
10.9385 neighbors per cell on average.
------Calculating spatial graph...
Trying to set attribute `._uns` of view, copying.
The graph contains 45606 edges, 4017 cells.
11.3532 neighbors per cell on average.
[20]:
data_list = [STAGATE_pyG.Transfer_pytorch_Data(adata) for adata in Batch_list]
for temp in data_list:
    temp.to(device)
[21]:
STAGATE_pyG.Cal_Spatial_Net(adata, rad_cutoff=50)
data = STAGATE_pyG.Transfer_pytorch_Data(adata)
------Calculating spatial graph...
The graph contains 228300 edges, 20139 cells.
11.3362 neighbors per cell on average.

DataLoader for bathces

[22]:
from torch_geometric.loader import DataLoader

# batch_size=1 or 2
loader = DataLoader(data_list, batch_size=1, shuffle=True)

Running STAGATE

[23]:
# hyper-parameters
num_epoch = 1000
lr=0.001
weight_decay=1e-4
hidden_dims = [512, 30]
[24]:
model = STAGATE_pyG.STAGATE(hidden_dims = [data_list[0].x.shape[1]]+hidden_dims).to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay)
[25]:
for epoch in tqdm(range(1, num_epoch+1)):
    for batch in loader:
        model.train()
        optimizer.zero_grad()
        z, out = model(batch.x, batch.edge_index)
        loss = F.mse_loss(batch.x, out) #F.nll_loss(out[data.train_mask], data.y[data.train_mask])
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 5.)
        optimizer.step()
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:55<00:00, 18.14it/s]
[26]:
# The total network
data.to(device)
[26]:
Data(x=[20139, 3000], edge_index=[2, 248439])
[27]:
model.eval()
z, out = model(data.x, data.edge_index)

STAGATE_rep = z.to('cpu').detach().numpy()
adata.obsm['STAGATE'] = STAGATE_rep

Clustering and UMAP

[28]:
sc.pp.neighbors(adata, use_rep='STAGATE')
sc.tl.umap(adata)
[29]:
sc.tl.louvain(adata, resolution=0.5)
[30]:
plt.rcParams["figure.figsize"] = (3, 3)
sc.pl.embedding(adata, basis="spatial", color="louvain",s=6, show=False)#, legend_loc=False)
plt.title('')
plt.axis('off')
[30]:
(588.545, 5108.555, 847.6700000000001, 5670.73)
_images/T8_Batch_41_1.png
[31]:
sc.pl.umap(adata, color='louvain')
_images/T8_Batch_42_0.png

Tutorial 9: ISH-based STARmap dataset

Here we present our re-analysis of the ISH-based mouse visual cortex STARmap dataset (See Fig. S16 of the STAGATE paper for details).

The raw data are available at https://www.dropbox.com/sh/f7ebheru1lbz91s/AADm6D54GSEFXB1feRy6OSASa/visual_1020/20180505_BY3_1kgenes?dl=0&subfolder_nav_tracking=1.

The annotation information and the processed SCANPY object are provided at https://drive.google.com/drive/folders/1I1nxheWlc2RXSdiv24dex3YRaEh780my?usp=sharing.

Preparation

[1]:
import pandas as pd
import numpy as np
import scanpy as sc
import os
import sys
import matplotlib.pyplot as plt
import seaborn as sns
import gc
import warnings
warnings.filterwarnings("ignore")
[2]:
import STAGATE
[3]:
adata = sc.read('STARmap_20180505_BY3_1k.h5ad')
[4]:
adata.obs.head()
[4]:
Total_counts X Y label
Cell_9 310 4980.777664 -49.771247 L6
Cell_10 533 8729.177425 -53.473114 L4
Cell_13 452 11547.566359 -46.315493 L2/3
Cell_15 288 3280.980264 -37.612811 L6
Cell_16 457 7582.793601 -45.350312 L4
[5]:
sc.pp.highly_variable_genes(adata, flavor="seurat_v3", n_top_genes=3000)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

Constructing the spatial network

[6]:
STAGATE.Cal_Spatial_Net(adata, rad_cutoff=400)
STAGATE.Stats_Spatial_Net(adata)
------Calculating spatial graph...
The graph contains 7838 edges, 1207 cells.
6.4938 neighbors per cell on average.
_images/T9_STARmap_9_1.png

Running STAGATE

[7]:
adata = STAGATE.train_STAGATE(adata, alpha=0)
Size of Input:  (1207, 1020)
WARNING:tensorflow:From /home/dkn/anaconda3/envs/STAGATE_tf/lib/python3.6/site-packages/tensorflow_core/python/ops/math_grad.py:1375: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
100%|██████████| 500/500 [00:21<00:00, 22.79it/s]
[8]:
sc.pp.neighbors(adata, use_rep='STAGATE')
sc.tl.umap(adata)
[9]:
adata = STAGATE.mclust_R(adata, used_obsm='STAGATE', num_cluster=7)
R[write to console]:                    __           __
   ____ ___  _____/ /_  _______/ /_
  / __ `__ \/ ___/ / / / / ___/ __/
 / / / / / / /__/ / /_/ (__  ) /_
/_/ /_/ /_/\___/_/\__,_/____/\__/   version 5.4.10
Type 'citation("mclust")' for citing this R package in publications.

fitting ...
  |======================================================================| 100%
[ ]:

[10]:
sc.pl.umap(adata, color='mclust')
_images/T9_STARmap_15_0.png
[11]:
plt.rcParams["figure.figsize"] = (4, 2)
sc.pl.embedding(adata, basis="spatial", color='mclust', s=20, show=False)#, legend_loc=False)
[11]:
<AxesSubplot:title={'center':'mclust'}, xlabel='spatial1', ylabel='spatial2'>
_images/T9_STARmap_16_1.png
[14]:
# layer annotation
plt.rcParams["figure.figsize"] = (4, 2)
sc.pl.embedding(adata, basis="spatial", color='label', s=20, show=False)#, legend_loc=False)
[14]:
<AxesSubplot:title={'center':'label'}, xlabel='spatial1', ylabel='spatial2'>
_images/T9_STARmap_17_1.png

Calculate ARI

[15]:
from sklearn.metrics.cluster import adjusted_rand_score
[16]:
adjusted_rand_score(adata.obs['label'], adata.obs['mclust'])
[16]:
0.5441835384954468

Additional Tutorial 1: Work for multiple sections (without batch effects)

Here we show STAGATE’s ablility to identify spatial domains for multiple sections.

The DLPFC samples (151673-51676) are used in this tutorial.

Preparation

[1]:
import pandas as pd
import numpy as np
import scanpy as sc
import os
import sys
import matplotlib.pyplot as plt
import seaborn as sns
import gc
import warnings
warnings.filterwarnings("ignore")
[2]:
import STAGATE
[3]:
section_list = ['151673', '151674', '151675', '151676']

Load Data

[4]:
adata_list = {}
for section_id in section_list:
    input_dir = os.path.join('../Data', section_id, 'Data')
    temp_adata = sc.read_visium(path=input_dir, count_file=section_id+'_filtered_feature_bc_matrix.h5')
    temp_adata.var_names_make_unique()

    Truth_df = pd.read_csv(os.path.join('../Data', section_id, section_id+'_truth.txt'), sep='\t', header=None, index_col=0)
    temp_adata.obs['Ground Truth'] = Truth_df.loc[temp_adata.obs_names, 1]
    # make the spot name unique
    temp_adata.obs_names = [x+'_'+section_id for x in temp_adata.obs_names]

    adata_list[section_id] = temp_adata.copy()
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
[5]:
fig, axs = plt.subplots(1, 4, figsize=(12, 3))
it=0
for section_id in section_list:
    if it == 3:
        sc.pl.spatial(adata_list[section_id], img_key="hires", ax=axs[it],
                      color=["Ground Truth"], title=section_id, show=False)
    else:
        sc.pl.spatial(adata_list[section_id], img_key="hires", ax=axs[it], legend_loc=None,
                      color=["Ground Truth"], title=section_id, show=False)
    it+=1
... storing 'Ground Truth' as categorical
... storing 'feature_types' as categorical
... storing 'genome' as categorical
... storing 'Ground Truth' as categorical
... storing 'feature_types' as categorical
... storing 'genome' as categorical
... storing 'Ground Truth' as categorical
... storing 'feature_types' as categorical
... storing 'genome' as categorical
... storing 'Ground Truth' as categorical
... storing 'feature_types' as categorical
... storing 'genome' as categorical
_images/AT1_8_1.png

Constructing the spatial network for each secion

[6]:
for section_id in section_list:
    STAGATE.Cal_Spatial_Net(adata_list[section_id], rad_cutoff=150)
    STAGATE.Stats_Spatial_Net(adata_list[section_id])
------Calculating spatial graph...
The graph contains 21124 edges, 3639 cells.
5.8049 neighbors per cell on average.
------Calculating spatial graph...
The graph contains 21258 edges, 3673 cells.
5.7876 neighbors per cell on average.
------Calculating spatial graph...
The graph contains 20762 edges, 3592 cells.
5.7801 neighbors per cell on average.
------Calculating spatial graph...
The graph contains 20052 edges, 3460 cells.
5.7954 neighbors per cell on average.
_images/AT1_10_1.png
_images/AT1_10_2.png
_images/AT1_10_3.png
_images/AT1_10_4.png

Note that the spatial network are saved in adata.uns[‘Spatial_Net’], which can be conbat directly for multiple sections.

[7]:
adata_list['151673'].uns['Spatial_Net']
[7]:
Cell1 Cell2 Distance
0 AAACAAGTATCTCCCA-1_151673 AGAGGCTTCGGAAACC-1_151673 138.061580
2 AAACAAGTATCTCCCA-1_151673 ACTAGTTGCGATCGTC-1_151673 138.003623
3 AAACAAGTATCTCCCA-1_151673 GCATCGGCCGTGTAGG-1_151673 138.061580
4 AAACAAGTATCTCCCA-1_151673 CAGCAGTCCAGACTAT-1_151673 138.003623
5 AAACAAGTATCTCCCA-1_151673 TTCTTATCCGCTGGGT-1_151673 137.927517
... ... ... ...
1 TTGTTTGTGTAAATTC-1_151673 GCCTATTTGCTACACA-1_151673 138.061580
2 TTGTTTGTGTAAATTC-1_151673 GTAGCGCTGTTGTAGT-1_151673 138.798415
4 TTGTTTGTGTAAATTC-1_151673 TTAAACCGGTAGCGAC-1_151673 137.927517
5 TTGTTTGTGTAAATTC-1_151673 GGATCAAAGGACGAGG-1_151673 138.423264
6 TTGTTTGTGTAAATTC-1_151673 CGTAGCGCCGACGTTG-1_151673 137.003650

21124 rows × 3 columns

Conbat the scanpy objects and spatial networks

[8]:
adata = sc.concat([adata_list[x] for x in section_list], keys=None)
[9]:
adata.uns['Spatial_Net'] = pd.concat([adata_list[x].uns['Spatial_Net'] for x in section_list])
[10]:
STAGATE.Stats_Spatial_Net(adata)
_images/AT1_16_0.png

Normalization

[11]:
#Normalization
sc.pp.highly_variable_genes(adata, flavor="seurat_v3", n_top_genes=3000)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

Running STAGATE

[12]:
adata = STAGATE.train_STAGATE(adata, alpha=0)
Size of Input:  (14364, 3000)
WARNING:tensorflow:From /home/dkn/anaconda3/envs/STAGATE_tf/lib/python3.6/site-packages/tensorflow_core/python/ops/math_grad.py:1375: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
100%|██████████| 500/500 [04:57<00:00,  1.68it/s]
[13]:
sc.pp.neighbors(adata, use_rep='STAGATE')
sc.tl.umap(adata)
[14]:
adata = STAGATE.mclust_R(adata, used_obsm='STAGATE', num_cluster=7)
R[write to console]:                    __           __
   ____ ___  _____/ /_  _______/ /_
  / __ `__ \/ ___/ / / / / ___/ __/
 / / / / / / /__/ / /_/ (__  ) /_
/_/ /_/ /_/\___/_/\__,_/____/\__/   version 5.4.10
Type 'citation("mclust")' for citing this R package in publications.

fitting ...
  |======================================================================| 100%
[15]:
adata.obs['Sample'] = [x.split('_')[-1] for x in adata.obs_names]
[16]:
plt.rcParams["figure.figsize"] = (3, 3)
sc.pl.umap(adata, color='Sample')
... storing 'Sample' as categorical
_images/AT1_24_1.png
[17]:
plt.rcParams["figure.figsize"] = (3, 3)
sc.pl.umap(adata, color='Ground Truth')
_images/AT1_25_0.png
[18]:
plt.rcParams["figure.figsize"] = (3, 3)
sc.pl.umap(adata, color='mclust')
_images/AT1_26_0.png
[19]:
temp_adata
[19]:
AnnData object with n_obs × n_vars = 3460 × 33538
    obs: 'in_tissue', 'array_row', 'array_col', 'Ground Truth'
    var: 'gene_ids', 'feature_types', 'genome'
    uns: 'spatial'
    obsm: 'spatial'
[20]:
fig, axs = plt.subplots(1, 4, figsize=(12, 3))
it=0
for section_id in section_list:
    adata_list[section_id].obs['STAGATE'] = adata.obs.loc[adata_list[section_id].obs_names, 'mclust']
    if it == 3:
        sc.pl.spatial(adata_list[section_id], img_key="hires", ax=axs[it],
                      color=["STAGATE"], title=section_id, show=False)
    else:
        sc.pl.spatial(adata_list[section_id], img_key="hires", ax=axs[it], legend_loc=None,
                      color=["STAGATE"], title=section_id, show=False)
    it+=1
_images/AT1_28_0.png

Calculate ARI

[21]:
from sklearn.metrics.cluster import adjusted_rand_score
[22]:
for section_id in section_list:
    temp_adata = adata[adata.obs['Sample']==section_id]
    temp_obs = temp_adata.obs.dropna()
    temp_ARI = adjusted_rand_score(temp_obs['mclust'], temp_obs['Ground Truth'])
    print('ARI of section ID %s: %.3f' %(section_id, temp_ARI))
ARI of section ID 151673: 0.564
ARI of section ID 151674: 0.650
ARI of section ID 151675: 0.627
ARI of section ID 151676: 0.620

Additional Tutorial 2: Work for multiple sections (STAGATE + Harmony)

Here we combine STAGATE and harmony algorithms to remove batch effects.

The mouse olfactory bulb data generated by different platform (Slide-seqV2 and Stereo-seq) are used in this tutorial. The detials of these two datasets can be found in Tutorial 3 and Tutorial 4.

Preparation

[1]:
import pandas as pd
import numpy as np
import scanpy as sc
import os
import sys
import matplotlib.pyplot as plt
import seaborn as sns
import gc
import warnings
warnings.filterwarnings("ignore")
[2]:
import STAGATE

Load Data

[3]:
adata_list = {}

Slide-seqV2

[4]:
input_dir = 'MOB_Data/Slide-seqV2'
counts_file = os.path.join(input_dir, 'Puck_200127_15.digital_expression.txt')
coor_file = os.path.join(input_dir, 'Puck_200127_15_bead_locations.csv')
[5]:
counts = pd.read_csv(counts_file, sep='\t', index_col=0)
coor_df = pd.read_csv(coor_file, index_col=0)
print(counts.shape, coor_df.shape)
(21220, 21724) (21724, 2)
[6]:
adata = sc.AnnData(counts.T)
adata.var_names_make_unique()
coor_df = coor_df.loc[adata.obs_names, ['xcoord', 'ycoord']]
adata.obsm["spatial"] = coor_df.to_numpy()
[7]:
sc.pp.calculate_qc_metrics(adata, inplace=True)
[8]:
adata
[8]:
AnnData object with n_obs × n_vars = 21724 × 21220
    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes'
    var: 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
    obsm: 'spatial'
[9]:
plt.rcParams["figure.figsize"] = (6,5)
sc.pl.embedding(adata, basis="spatial", color="log1p_total_counts",s=6, show=False)
plt.title('')
plt.axis('off')
[9]:
(-289.81710000000004, 6312.7151, 173.30850000000004, 5709.8615)
_images/AT2_13_1.png
[10]:
# can be downloaded from https://drive.google.com/drive/folders/10lhz5VY7YfvHrtV40MwaqLmWz56U9eBP?usp=sharing
used_barcode = pd.read_csv(os.path.join(input_dir, 'used_barcodes.txt'), sep='\t', header=None)
used_barcode = used_barcode[0]
[11]:
adata = adata[used_barcode,]
adata
[11]:
View of AnnData object with n_obs × n_vars = 20139 × 21220
    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes'
    var: 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
    obsm: 'spatial'
[12]:
plt.rcParams["figure.figsize"] = (5,5)
sc.pl.embedding(adata, basis="spatial", color="log1p_total_counts",s=10, show=False, title='Removing spots outside the main tissue area')

plt.axis('off')
[12]:
(588.545, 5108.555, 847.6700000000001, 5670.73)
_images/AT2_16_1.png
[13]:
sc.pp.filter_genes(adata, min_cells=50)
print('After flitering: ', adata.shape)
Trying to set attribute `.var` of view, copying.
After flitering:  (20139, 11750)
[14]:
# make spot name unique
adata.obs_names = [x+'_SlideSeqV2' for x in adata.obs_names]
[15]:
adata_list['SlideSeqV2'] = adata.copy()

Stereo-seq

[16]:
input_dir = 'MOB_Data/Stereo-seq'
counts_file = os.path.join(input_dir, 'RNA_counts.tsv')
coor_file = os.path.join(input_dir, 'position.tsv')
[17]:
counts = pd.read_csv(counts_file, sep='\t', index_col=0)
coor_df = pd.read_csv(coor_file, sep='\t')
print(counts.shape, coor_df.shape)
(27106, 19527) (19527, 3)
[18]:
counts.columns = ['Spot_'+str(x) for x in counts.columns]
coor_df.index = coor_df['label'].map(lambda x: 'Spot_'+str(x))
coor_df = coor_df.loc[:, ['x','y']]
[19]:
coor_df.head()
[19]:
x y
label
Spot_1 12555.007833 6307.537859
Spot_2 12623.666667 6297.166667
Spot_3 12589.567164 6302.552239
Spot_4 12642.495050 6307.386139
Spot_5 13003.333333 6307.990991
[20]:
adata = sc.AnnData(counts.T)
adata.var_names_make_unique()
[21]:
adata
[21]:
AnnData object with n_obs × n_vars = 19527 × 27106
[22]:
coor_df = coor_df.loc[adata.obs_names, ['y', 'x']]
adata.obsm["spatial"] = coor_df.to_numpy()
sc.pp.calculate_qc_metrics(adata, inplace=True)
[23]:
plt.rcParams["figure.figsize"] = (5,4)
sc.pl.embedding(adata, basis="spatial", color="n_genes_by_counts", show=False)
plt.title("")
plt.axis('off')
[23]:
(6002.432692307693, 12486.580128205129, 9908.545833333334, 15086.093055555555)
_images/AT2_28_1.png
[24]:
used_barcode = pd.read_csv(os.path.join(input_dir, 'used_barcodes.txt'), sep='\t', header=None)
used_barcode = used_barcode[0]
adata = adata[used_barcode,]
[25]:
adata
[25]:
View of AnnData object with n_obs × n_vars = 19109 × 27106
    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes'
    var: 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
    obsm: 'spatial'
[26]:
plt.rcParams["figure.figsize"] = (5,4)
sc.pl.embedding(adata, basis="spatial", color="n_genes_by_counts", show=False)
plt.title("")
plt.axis('off')
[26]:
(6005.190789473685, 12428.6600877193, 9986.774763741741, 15062.302776957436)
_images/AT2_31_1.png
[27]:
sc.pp.filter_genes(adata, min_cells=50)
print('After flitering: ', adata.shape)
Trying to set attribute `.var` of view, copying.
After flitering:  (19109, 14376)
[28]:
# make spot name unique
adata.obs_names = [x+'_StereoSeq' for x in adata.obs_names]
[29]:
adata_list['StereoSeq'] = adata.copy()

Constructing the spatial network for each secion

Slide-seqV2

[30]:
STAGATE.Cal_Spatial_Net(adata_list['SlideSeqV2'], rad_cutoff=50)
STAGATE.Stats_Spatial_Net(adata_list['SlideSeqV2'])
------Calculating spatial graph...
The graph contains 228300 edges, 20139 cells.
11.3362 neighbors per cell on average.
_images/AT2_37_1.png
[31]:
STAGATE.Cal_Spatial_Net(adata_list['StereoSeq'], rad_cutoff=50)
STAGATE.Stats_Spatial_Net(adata_list['StereoSeq'])
------Calculating spatial graph...
The graph contains 144318 edges, 19109 cells.
7.5524 neighbors per cell on average.
_images/AT2_38_1.png

Note that the spatial network are saved in adata.uns[‘Spatial_Net’], which can be conbat directly for multiple sections.

[32]:
adata_list['SlideSeqV2'].uns['Spatial_Net']
[32]:
Cell1 Cell2 Distance
0 AAAAAAACAAAAGG_SlideSeqV2 ATAAGTTGCCCCGT_SlideSeqV2 41.494698
1 AAAAAAACAAAAGG_SlideSeqV2 CTCCGGGCTCTTCA_SlideSeqV2 44.777226
2 AAAAAAACAAAAGG_SlideSeqV2 CCAGCAAAGCTACA_SlideSeqV2 29.429237
3 AAAAAAACAAAAGG_SlideSeqV2 GCCTAAAGCTTTTG_SlideSeqV2 25.927784
5 AAAAAAACAAAAGG_SlideSeqV2 CCTCCTTAACGTTA_SlideSeqV2 33.634060
... ... ... ...
9 TTTTTTTTTTTTAT_SlideSeqV2 CCTATAACAGCCTG_SlideSeqV2 30.802922
10 TTTTTTTTTTTTAT_SlideSeqV2 CTTGGGCATATAAG_SlideSeqV2 37.316216
11 TTTTTTTTTTTTAT_SlideSeqV2 AGTAGTTGCGGCCG_SlideSeqV2 13.688316
12 TTTTTTTTTTTTAT_SlideSeqV2 TGTATTCACTTTGC_SlideSeqV2 18.753666
13 TTTTTTTTTTTTAT_SlideSeqV2 TAAAACGCGCGAGA_SlideSeqV2 25.347584

228300 rows × 3 columns

Conbat the scanpy objects and spatial networks

[33]:
adata = sc.concat([adata_list['SlideSeqV2'], adata_list['StereoSeq']], keys=None)
[34]:
adata.uns['Spatial_Net'] = pd.concat([adata_list['SlideSeqV2'].uns['Spatial_Net'], adata_list['StereoSeq'].uns['Spatial_Net']])
[35]:
STAGATE.Stats_Spatial_Net(adata)
_images/AT2_44_0.png

Normalization

[36]:
#Normalization
sc.pp.highly_variable_genes(adata, flavor="seurat_v3", n_top_genes=3000)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

Running STAGATE

[37]:
adata = STAGATE.train_STAGATE(adata, alpha=0)
Size of Input:  (39248, 3000)
WARNING:tensorflow:From /home/dkn/anaconda3/envs/STAGATE_tf/lib/python3.6/site-packages/tensorflow_core/python/ops/math_grad.py:1375: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
100%|██████████| 500/500 [18:06<00:00,  2.17s/it]
[38]:
sc.pp.neighbors(adata, use_rep='STAGATE')
sc.tl.umap(adata)
[39]:
adata.obs['Tech'] = [x.split('_')[-1] for x in adata.obs_names]
[40]:
plt.rcParams["figure.figsize"] = (3, 3)
sc.pl.umap(adata, color='Tech', title='Unintegrated')
... storing 'Tech' as categorical
_images/AT2_51_1.png

Run Harmony on the STAGATE represention

[41]:
import harmonypy as hm
[42]:
data_mat = adata.obsm['STAGATE'].copy()
meta_data = adata.obs.copy()
[43]:
# Run Harmony
ho = hm.run_harmony(data_mat, meta_data, ['Tech'])
2022-10-30 18:54:54,390 - harmonypy - INFO - Iteration 1 of 10
2022-10-30 18:55:05,947 - harmonypy - INFO - Iteration 2 of 10
2022-10-30 18:55:20,272 - harmonypy - INFO - Iteration 3 of 10
2022-10-30 18:55:24,721 - harmonypy - INFO - Converged after 3 iterations
[44]:
# Write the adjusted PCs to a new file.
res = pd.DataFrame(ho.Z_corr)
res.columns = adata.obs_names
[45]:
adata_Harmony = sc.AnnData(res.T)
[46]:
adata_Harmony.obsm['spatial'] = pd.DataFrame(adata.obsm['spatial'], index=adata.obs_names).loc[adata_Harmony.obs_names,].values
adata_Harmony.obs['Tech'] = adata.obs.loc[adata_Harmony.obs_names, 'Tech']
[47]:
sc.pp.neighbors(adata_Harmony)
sc.tl.umap(adata_Harmony)
[48]:
sc.tl.louvain(adata_Harmony, resolution=0.8)
[49]:
plt.rcParams["figure.figsize"] = (3, 3)
sc.pl.umap(adata_Harmony, color='Tech', title='STAGATE + Harmony')
_images/AT2_61_0.png
[50]:
plt.rcParams["figure.figsize"] = (3, 3)
sc.pl.umap(adata_Harmony, color='louvain')
_images/AT2_62_0.png
[ ]:

[60]:
fig, axs = plt.subplots(1, 2, figsize=(6, 3))
it=0
for temp_tech in ['StereoSeq', 'SlideSeqV2']:
    temp_adata = adata_Harmony[adata_Harmony.obs['Tech']==temp_tech, ]
    if it == 1:
        sc.pl.embedding(temp_adata, basis="spatial", color="louvain",s=6, ax=axs[it],
                        show=False, title=temp_tech)
    else:
        sc.pl.embedding(temp_adata, basis="spatial", color="louvain",s=6, ax=axs[it], legend_loc=None,
                        show=False, title=temp_tech)
    it+=1
_images/AT2_64_0.png
[ ]:

_images/StaGATE_Fig1.png

News

2022.03.05 STAGATE based on pyG (PyTorch Geometric) framework is availble at STAGATE_pyG (https://github.com/QIFEIDKN/STAGATE_pyG).

Benefit from the optimization of the pyG package for training graph neural networks, it is more than 10x faster than STAGATE based on the tensorflow1 framework, and can use a batch training strategy to deal with large-scale data (See Tutorials 7 and 8 for details).

The cell type-aware module has not been supported by STAGATE_pyG yet.

Introduction

STAGATE is designed for spatial clustering and denoising expressions of spatial resolved transcriptomics (ST) data.

STAGATE learns low-dimensional latent embeddings with both spatial information and gene expressions via a graph attention auto-encoder. The method adopts an attention mechanism in the middle layer of the encoder and decoder, which adaptively learns the edge weights of spatial neighbor networks, and further uses them to update the spot representation by collectively aggregating information from its neighbors. The latent embeddings and the reconstructed expression profiles can be used to downstream tasks such as spatial domain identification, visualization, spatial trajectory inference, data denoising and 3D expression domain extraction.

Citation

Dong, Kangning, and Shihua Zhang. “Deciphering spatial domains from spatially resolved transcriptomics with an adaptive graph attention auto-encoder.” Nature Communications 13.1 (2022): 1-12.