Methods
This page describes how the corpus was built, how artifacts were embedded and clustered, and how the carbon-capture case study was identified.
Data sources
| Artifact type | Source | Access method |
|---|---|---|
| Academic papers | OpenAlex | Free REST API |
| Patents | Lens.org | Free API (token required) |
| iGEM projects | iGEM community data | CSV download |
| iGEM parts | iGEM Parts Registry | CSV download / API |
All artifacts are normalized to a shared schema before processing. See src/utils/schema.py for field definitions.
Corpus construction
Papers and patents were retrieved by searching for synthetic biology keywords in titles and abstracts. The seed keywords were:
synthetic biologymetabolic engineeringgenetic engineering
Citation expansion was applied to papers: we also retrieved works that cite highly-cited seed papers, to capture related work that uses different terminology.
The keyword list and year range can be adjusted in config/settings.yaml.
Case study identification
Artifacts were tagged as belonging to the carbon capture case study if their title or abstract contained at least one of these keywords:
- carbon capture, carbon sequestration, CO2 fixation, carbon dioxide reduction, carbon neutral, biofuel, carbon cycle, autotrophic, RuBisCO, Calvin cycle, carboxylase
A confidence score (0–1) measures the fraction of keywords matched. Any keyword match sets case_study_flag = True.
Embeddings
Each artifact’s title and abstract were concatenated and encoded with a pre-trained sentence-transformer model (all-MiniLM-L6-v2). This model encodes text into a 384-dimensional vector that captures semantic meaning.
Reimers & Gurevych (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP 2019. arXiv:1908.10084
Dimensionality reduction (UMAP)
Embeddings were projected to 2D with UMAP for visualization. UMAP was chosen over t-SNE because it better preserves global structure alongside local neighborhoods.
McInnes et al. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv:1802.03426
Key parameters (adjustable in config/settings.yaml):
| Parameter | Value | Meaning |
|---|---|---|
n_neighbors |
15 | Local neighborhood size |
min_dist |
0.1 | Minimum spread in projected space |
metric |
cosine | Distance metric for embeddings |
Clustering (HDBSCAN)
Clusters were identified with HDBSCAN, a density-based algorithm that does not require specifying the number of clusters in advance. Points that do not belong to any cluster are labeled as noise (label = -1).
Campello et al. (2013). Density-Based Clustering Based on Hierarchical Density Estimates. ECML PKDD 2013.
Geocoding
Institution locations from OpenAlex and Lens.org were converted to latitude/longitude using Nominatim (OpenStreetMap), with results cached locally. Analysis is at the city level.