Open to Research Collaborations

Computational Systems Biology

AI-Driven Biomarker identification · Graph Learning for Biological Networks · Bioinformatics · Deep Learning

Building graph-based computational methods that integrate structured biological priors with learned representations — for interpretable biomarker identification, disease network inference, and multi-omics data integration.

Bhuwan Sharma
BS
GNN
GRN
Biomarker Identification
Multi-omics
PPI
Scroll to explore

About

I'm a computational biologist working at the intersection of graph representation learning, systems biology, and multi-omics integration. My work is driven by a simple conviction: biological systems are not collections of independent parts — they are deeply interconnected networks, and the methods we use to study them should reflect that.

Most of what I build revolves around encoding biological structure directly into machine learning architectures — embedding interaction topology, functional ontologies, and sequence-derived representations into models that preserve mechanistic context rather than treating genes as isolated variables. This means working across graph neural networks, spectral methods, and deep sequence models simultaneously, stitching them into pipelines that can turn high-dimensional molecular data into interpretable biological hypotheses.

I'm particularly interested problems like: how to make ontological knowledge differentiable, how to combine heterogeneous network types without amplifying noise, and how to evaluate predictions when ground truth itself is incomplete and biased. These are questions I think about constantly, and they shape every methodological choice I make.

Computational Biology Graph Neural Networks Biomarker Identification Systems Biology Multi-omics Explainable AI

Education

2023 — 2025 M.Sc. Bioinformatics University of North Bengal Computational systems biology · ML for biological data · Graph-based inference
2020 — 2023 B.Sc. Zoology (Hons.) Kalimpong College, NBU CGPA 7.64 · Molecular biology · Genetics · Biochemistry

Research Interests

  • Graph neural networks for biological network analysis
  • Spectral methods & Laplacian-based network inference
  • Multi-omics data integration & biomarker identification
  • Ontology-driven feature construction
  • Oversmoothing in deep GNNs on biological graphs
  • Causal inference in observational multi-omics data

Research

My research sits at the intersection of spectral graph theory, deep learning on biological networks, and multi-omics integration.

Graph-Based Biomarker Identification

Disease mechanisms rarely arise from single-gene perturbations — they emerge from coordinated disruptions across interacting molecular networks. I model disease-associated molecular data as graphs where nodes carry multi-omics features and edges encode known biological interactions: protein-protein, regulatory, and co-expression.

A GCN architecture learns node-level representations that integrate expression features, ontology embeddings, and sequence-level information. Attention weights surface which interactions are driving predictions, turning a classification output into a biological hypothesis worth testing.

GCNPPI NetworksAttentionMulti-omics

Spectral Network Inference

Standard correlation-based gene network construction conflates direct regulatory relationships with indirect, transitive associations — producing dense, noisy graphs that are hard to interpret and harder to scale. PCA-regularized approaches project the correlation structure onto lower-dimensional subspaces, discarding technical noise and recovering sparser, more biologically coherent networks.

SVDPCnetSpectral MethodsGPU

Ontology-Structured Representations

Gene Ontology annotations encode decades of curated biological knowledge — but as discrete hierarchical labels, they resist direct integration into gradient-based learning. Converting them into dense, semantically coherent vector representations requires IC-weighted ancestor propagation through the GO DAG, followed by dimensionality reduction that respects the original semantic hierarchy.

Gene OntologyEmbeddingsDAGIC-weighting

Open Questions I'm Thinking About

  1. How to encode hierarchical ontologies as differentiable graph structures for end-to-end learning
  2. Principled combination of multiple biological network types with different noise profiles
  3. Where oversmoothing in deep GNNs destroys discriminative signal on biological graphs
  4. Moving beyond correlation toward causal inference in observational multi-omics data
  5. Evaluation when ground truth labels are incomplete and biased toward well-studied genes

Projects

In Progress

Graph-Based Biomarker identification in Disease Networks

Disease-associated molecular data modeled as a graph with multi-omics node features and a GCN architecture for biomarker prediction with attention-based interpretability. Nodes are genes; edges encode known PPI and co-expression relationships.

PyTorch Geometric GCN GAT STRING PPI TCGA

Approach

Nodes are genes with multi-omics features (expression z-scores, GO embeddings at 128d, DNABERT promoter embeddings projected to 128d). Edges come from STRING PPI (confidence ≥ 700) combined with PCnet-inferred co-expression links. A 3-layer GCN with residual connections and GAT-style attention ranks edges by biological importance.

Current Status

Data integration pipeline complete. Baseline GCN trained with measurable improvement over non-graph baselines. Currently working on attention-based interpretation to surface biologically meaningful subnetworks.

Open Questions

Handling missing modalities gracefully; evaluation bias toward well-studied disease genes; attention scalability on graphs with more than 100K edges.

GPU-Accelerated Principal Component Network

Full-transcriptome gene co-expression network inference using GPU-accelerated SVD — achieving roughly 18× speedup over CPU with biologically interpretable, sparse output networks suitable for downstream GNN input.

CUDA cuSOLVER SVD PCnet Python

Problem

Gene co-expression network inference scales as O(n²) in edges and O(n³) in SVD computation, making full-transcriptome analysis (≥20K genes) impractical on CPU hardware.

Approach

GPU-accelerated SVD via cuSOLVER, retaining top-k components explaining over 95% of variance. Reconstructing the reduced correlation matrix suppresses noise; mutual rank thresholding produces sparse, biologically interpretable networks.

Results

~18× speedup over CPU SVD at 15K × 500 dimensions. Full human transcriptome network inference in under 4 minutes on RTX 3090. Recovered known pathway structure (KEGG) with higher precision than standard Pearson correlation networks.

Limitations

GPU memory caps effective matrix size (~25K genes). The linear PCA assumption misses nonlinear regulatory relationships.

Source

Ontology-Structured Feature Engineering

Converting Gene Ontology annotations into dense learnable embeddings using IC-weighted ancestor propagation — yielding 8–12% AUROC improvement over expression-only baselines in gene function prediction tasks.

Gene Ontology SVD IC-weighting Python

Problem

GO annotations (~45K terms) exist as discrete hierarchical labels. Naive one-hot encoding ignores parent-child relationships and produces extremely sparse vectors unsuitable for gradient-based learning.

Approach

Ancestor propagation up the GO DAG captures implicit parent terms. IC-weighted encoding downweights uninformative high-level terms. SVD then produces dense 128-dimensional GO embeddings per gene that preserve semantic relationships.

Results

8–12% AUROC improvement over expression-only baselines. UMAP projection shows meaningful clustering by biological process. Nearest-neighbor analysis recovers known functional modules with higher recall than GO term overlap approaches.

Limitations

Annotation quality varies significantly by organism. The SVD approach assumes linear structure — GNN-based methods may better capture full DAG topology.

Source

Statistical Foundations for Biological Data

Reproducible R pipelines for preprocessing biological datasets: distributional analysis, batch correction, variance-stabilizing transformations, and multiple testing frameworks that serve as the foundation for all downstream analysis.

R DESeq2 ComBat limma

Problem & Approach

Biological datasets exhibit zero-inflated count distributions, complex batch structures, and extreme dimensionality. Built reproducible pipelines covering distributional analysis, VST/rlog transformations, PCA-based batch detection and correction, and multiple testing frameworks.

Results

Identified and corrected batch effects that were reversing the direction of significance for 15% of differentially expressed genes. This preprocessing framework now underpins all subsequent analysis projects.

Source
In Progress

DNABERT Promoter Sequence Embeddings

Leveraging pre-trained DNA language models to generate promoter-region sequence embeddings for each gene, integrated as node features in the GNN biomarker identification pipeline alongside expression and ontology features.

DNABERT Transformers HuggingFace Python CUDA

Motivation

Gene expression is partly determined by promoter sequence features — transcription factor binding sites, GC content, CpG islands. DNABERT encodes these sequence-level signals into dense representations that complement expression-based and ontological features.

Approach

Upstream promoter sequences (2kb) extracted per gene, tokenized with k-mer tokenization, encoded with DNABERT, and projected to 128 dimensions for concatenation with expression z-scores and GO embeddings.

Multi-Network Heterogeneous Graph Fusion

Principled fusion of PPI, co-expression, and regulatory network layers into a single heterogeneous graph for downstream GNN training — addressing the challenge of combining networks with different noise profiles and edge semantics.

PyTorch Geometric NetworkX Heterogeneous GNN Python

Motivation

PPI, co-expression, and transcriptional regulatory networks capture complementary aspects of gene function but differ fundamentally in their noise characteristics and biological semantics. Naive edge union conflates these differences.

Approach

Heterogeneous graph construction with typed edges per network layer. Separate message-passing per edge type, with learned aggregation weights allowing the model to up- or down-weight each network type by task.

Research Pipeline

End-to-end computational pipeline from raw biological data to actionable predictions.

Gene Expression Data

RNA-seq count matrices, microarray profiles, TCGA datasets

DEG Analysis

DESeq2, limma-voom, variance-stabilizing transforms, batch correction

Network Construction

PPI from STRING, GRN inference, PCnet co-expression networks

GO Enrichment

clusterProfiler, IC-weighted embeddings, ontology-aware features

Embedding Generation

GO embeddings, DNABERT promoter features, expression z-scores

Graph Neural Networks

GCN, GAT with residual connections, attention-based edge ranking

Biomarker Prediction

Node classification, subnetwork ranking, disease gene prioritization

Explainability

Attention visualization, subgraph extraction, pathway mapping

Tech Stack

Deep Learning

PyTorch
PyTorch Geometric
DGL
scikit-learn
Transformers (HF)
CUDA / cuSOLVER

Bioinformatics

DESeq2
limma / edgeR
clusterProfiler
biomaRt
Bioconductor
ComBat / SVA

Network & Graph

NetworkX
STRING PPI
Neo4j
Cytoscape
PCnet

Languages

Python
R
Bash / Shell
C
Java

Infrastructure

Linux
Git / GitHub
Docker
Conda
GitHub Actions

Visualization & APIs

Streamlit
Matplotlib / Seaborn
Plotly
FastAPI / Flask
UMAP / t-SNE

Publications & Preprints

Manuscripts in Preparation

Research outputs from current projects are being written up for submission. If you're interested in the methodology or early results, feel free to reach out directly — I'm happy to share what I can.

Google Scholar Profile →

Interactive Network

A simulated protein-protein interaction network. Hover over nodes to explore connections.

Hub Gene Disease Associated Predicted Biomarker

Research Notes

Dec 2024 GNN

Understanding Oversmoothing in Biological GNNs

Why deep graph neural networks lose discriminative power on biological networks, and what residual connections can — and can't — fix.

Read more →
Nov 2024 Methods

PCnet vs Pearson: When Correlation Lies

A practical comparison of PCA-regularized network inference against standard correlation approaches for gene co-expression analysis.

Read more →
Oct 2024 Ontology

Making Gene Ontology Learnable

From discrete DAG annotations to dense vector representations: engineering GO features that neural networks can actually use.

Read more →
bhuwan@research ~
$ cat research_summary.json
{
  "researcher": "Bhuwan Sharma",
  "focus": "Computational Systems Biology",
  "methods": ["GNN", "Spectral Analysis", "Multi-omics"],
  "tools": ["PyTorch", "PyG", "R/Bioconductor"],
  "status": "Building the future of biological AI"
}
$

Achievements

Contact

Let's Collaborate

I'm always open to research discussions, method questions, and collaboration on problems in computational biology, graph learning for biological discovery, or multi-omics integration. If something I'm working on overlaps with your interests, don't hesitate to reach out.