Reproducibility

This page explains how to re-run the full pipeline from scratch.

Requirements

Python 3.11+
API credentials (see .env.example)

# 1. Clone the repository
git clone https://github.com/Zer0Juice/synbio-diversification
cd synbio-diversification

# 2. Install dependencies
pip install -r requirements.txt

# 3. Set up credentials
cp .env.example .env
# Edit .env and fill in your API keys

# 4. Run the pipeline (see notebooks/ for step-by-step walkthroughs)

Step-by-step notebooks

Notebook	What it does
`01_ingest_papers.ipynb`	Download papers from OpenAlex
`02_ingest_patents.ipynb`	Download patents from Lens.org
`03_ingest_projects.ipynb`	Load iGEM project and parts data
`04_embed.ipynb`	Generate embeddings for all artifacts
`05_cluster.ipynb`	UMAP reduction and HDBSCAN clustering
`06_visualize.ipynb`	Export data for the website visualizations
`walkthrough_carbon_capture.ipynb`	End-to-end walkthrough for the case study

Configuration

All major parameters (keywords, model name, clustering settings) are in config/settings.yaml. Edit that file to change the pipeline behaviour.

Data

Raw data files are not committed to the repository (too large). Processed CSVs in data/processed/ are committed and can be used to skip the ingestion step.

Rendering the website locally

# Install Quarto: https://quarto.org/docs/get-started/
quarto preview website/