Results

Corpus overview

The pipeline currently covers papers and iGEM projects. Patents and parts will be added in a later pipeline run.

Artifact type Count Year range Cities geocoded
Papers 9,449 2000–2026 8,720 (92%)
iGEM projects 4,615 2009–2025 4,615 (100%)
Patents
iGEM parts
Total 14,064 2000–2026 13,335 (95%)

Papers were retrieved from OpenAlex using core synthetic biology keywords (“synthetic biology”, “synthetic genomics”, “BioBrick”) and subfield keywords (“repressilator”, “minimal genome”, “xenobiology”, etc.). iGEM projects were geocoded using a three-tier pipeline: OpenAlex institution lookup → Claude Haiku city extraction → Nominatim coordinates. Paper coordinates were obtained by fetching full institution objects from OpenAlex (which include city and lat/lon) using institution IDs stored during ingestion.

Semantic space

All artifacts were embedded using SPECTER (Lo et al., 2020), a transformer model pre-trained on scientific text to produce 768-dimensional vectors from title and abstract. Embeddings were reduced to 2D with UMAP for visualization.

The shared projection shows papers and iGEM projects occupying overlapping but distinguishable regions of semantic space, consistent with the hypothesis that student projects and academic research share the same knowledge domain.

See the Explorer for the interactive version.

Clustering

HDBSCAN was applied to the full 768-dimensional embedding space.

Count
Total artifacts 14,064
Clusters found 256
Artifacts assigned to a cluster 9,169 (65%)
Noise (unassigned) 4,895 (35%)

The 35% noise rate is typical for HDBSCAN on heterogeneous text corpora — it reflects genuine semantic diversity within the synthetic biology literature rather than a pipeline failure.

City-level patterns

Top 10 cities by total artifact count (papers + projects):

Rank City Papers Projects Total
1 Beijing 290 258 548
2 Cambridge 294 40 334
3 Shanghai 108 216 324
4 London 236 29 265
5 Tokyo 103 56 159
6 Wuhan 22 134 156
7 Shenzhen 10 145 155
8 Paris 117 37 154
9 New York 98 35 133
10 San Diego 127 0 127

Beijing and Shanghai reflect the rapid growth of Chinese synthetic biology research and iGEM participation since 2012. Cambridge (MA/UK) leads for academic papers, anchored by MIT, Harvard, and the Wellcome Sanger Institute.

Carbon capture subset

267 papers (2.8% of corpus) and 141 iGEM projects (3.1% of corpus) were tagged as carbon-capture related. 30 cities have at least one carbon-capture artifact of both types, providing the basis for city-level cross-type analysis.

Full analysis on the Case Study: Carbon Capture page.