Portfolio

Multi-Omics Pathway-based Data Integration Using Machine Learning and Large Language Models (LLMs)

Duration: March 01, 2024 - October 01, 2024

Aims

  • Use LLM-based semantic searching to map more metabolite names to IDs, widening the scope of pathway analysis (Figure 1A)
  • Apply novel pathway databases augmented with metabolites using deep learning, facilitating more comprehensive pathway mapping and attaining more robust biological predictions (Figure 1B)
  • Develop an extension to PathIntegrate which will provide an unsupervised utility for pathway-based multivariate analysis that can be benchmarked with synthetic data simulations (Figure 1C)


Figure 1: General Workflow of the project. Please click the title for more detailed information on results and outcomes.

Skills and Tools

TaskPackages/Tools
Data Processing, Analysis and Databases
  • pandas, duckdb SQL, fancyimpute, networkx, missforest, statsmodels, plotly, gseapy, matplotlib, seaborn, missforest, scipy
Machine Learning + Web App
  • leidenalg, streamlit (HTML + CSS), elasticsearch, sentence_transformers, HuggingFace, base64, mbpls
  • sklearn: metrics (f1_score, precision_score, recall_score, roc_auc_score, roc_curve, confusion_matrix), model_selection (train_test_split, cross_val_score, GridSearchCV), pipeline (Pipeline), preprocessing (StandardScalar), linear_model (LogisticRegression), decomposition (PCA), manifold (TSNE)
Miscellaneous
  • response, requests, OpenAI, urllib, igraph, json, tdqm, warnings, SLURM
Bioinformatics Tools
  • KEGG API, Reactome API, CytoScape, ssPA, GSEA, MOFA2, pathintegrate, iPATH3

Elucidating Spatial Cell Composition of Neuroblastoma (Group Project)

Duration: January 01, 2024 - March 01, 2024

Aims

  • Meta-analyse various single-cell studies of neuroblastoma, building a single-cell atlas that spans tumour cell diversity (Figure 2.1)
  • Identify cell populations and tissue heterogeneity in single cell spatial transcriptomics (SCST) data using transfer learning from the single-cell reference.
  • Validate, test and scrutinise available tools for exploratory analysis of novel SCST data, including fine tuning and adpatation of a novel, multi-modal deep learning clustering approach (SiGra) (Figure 2.2)


Figure 2.1: Initial, Single-cell RNA-Seq Workflow of the project.

Figure 2.2: Secondary, Single-cell Spatial RNA-seq Workflow of the project. Please click the title for more detailed information on results and outcomes.

Skills and Tools

TaskPackages/Tools
Single-Cell RNA-Seq Data Integration + Analysis
  • Python: scanpy, rpy2, anndata2ri, integration(scanorama, scvi-tools), monocle, palantir
  • R: seurat, integration (harmony, rPCA CCA, BBKNN), CONICSmat, CellChat
Single-Cell Spatial RNA-Seq Data Analysis
  • squidpy, transfer learning (cell2location, singleR, Seurat, RCTD, scarches), nichedb,
Deep Learning (SiGRA Modification)
  • torchvision, matplotlib(v2.1.1), torch, seaborn, tqdm, scikit_learn, torch_geometric, keras, optuna, weights and biases (wandb), xgboost
Shell Computing
  • Unix:, Git (init, clone, add, commit, status), pull, push, branch, merge), ssh, High Performance Computing (HPC), nohup, rsync, Slurm (sbatch), module, chmod (permissions)

Multi-Omics Data Integration Using Machine Learning and Large Language Models (LLMs)

Duration: January 01, 2023 - October 01, 2023

Aims

  • Integrate existing carotid plaque scRNAseq datasets with the current bestperforming integration methods.
  • Benchmark these integration techniques and select the most suitable method for building a carotid plaque single-cell atlas.
  • Identify subtypes of VSMCs and RNA markers for them.
  • Study the abundances of the corresponding protein biomarkers in provided carotid plaque proteomics data and associate them with patient subgroups and plaque characteristics
  • Return offer (internship) -> validating findings with bulk data, spatial data, pseudotime and cell-cell communication


Figure 3: General Workflow of the project. Please click the title for more detailed information on results and outcomes.

Skills and Tools

TaskPackages/Tools
Data Processing and Integration
Single-cell RNA Sequencing
  • Scanpy
Multi-omics Integration
  • MOFA
Proteomics
  • MaxQuant

Regulation and Biological Functions of Alternative Spilcing in Neurones of the Adult Mouse Visual Cortex

Duration: June 01, 2022 - September 01, 2022

Aims

  • Process 4 single cell/single nuclear datasets from the Allen Brain Institute in R
  • Generate high throughput splicing data using specialised tools in Unix (e.g Whippet).
  • Identify evidence to support the mechanism and presence of NMD containing transcripts in the nuclei of PV interneurons
  • Map different isoform profiles of GABAergic and glutamatergic layers
  • Validate the utility of Single-Cell data for splicing analysis compared to Bulk RNA-seq

Figure 4: Poster displaying the outcome of this project investigating alternative splicing during Neurogenesis. Please click the title for more detailed information on results and outcomes.

Skills and Tools

TaskPackages/Tools
Data Analysis
  • Python, R, Unix
Single-cell RNA Sequencing
  • Scanpy
Multi-omics Integration
  • MOFA
Proteomics
  • MaxQuant