Reproducibility

This page explains how to re-run the full pipeline from scratch.

Requirements

  • Python 3.11+
  • API credentials (see .env.example)
# 1. Clone the repository
git clone https://github.com/Zer0Juice/synbio-diversification
cd synbio-diversification

# 2. Install dependencies
pip install -r requirements.txt

# 3. Set up credentials
cp .env.example .env
# Edit .env and fill in your API keys

# 4. Run the pipeline (see notebooks/ for step-by-step walkthroughs)

Step-by-step notebooks

Notebook What it does
01_ingest_papers.ipynb Download papers from OpenAlex
02_ingest_patents.ipynb Download patents from Lens.org
03_ingest_projects.ipynb Load iGEM project and parts data
04_embed.ipynb Generate embeddings for all artifacts
05_cluster.ipynb UMAP reduction and HDBSCAN clustering
06_visualize.ipynb Export data for the website visualizations
walkthrough_carbon_capture.ipynb End-to-end walkthrough for the case study

Configuration

All major parameters (keywords, model name, clustering settings) are in config/settings.yaml. Edit that file to change the pipeline behaviour.

Data

Raw data files are not committed to the repository (too large). Processed CSVs in data/processed/ are committed and can be used to skip the ingestion step.

Rendering the website locally

# Install Quarto: https://quarto.org/docs/get-started/
quarto preview website/