Scientific Informatics: Live Integrations and Code
Built for Kymera's Principal Scientific Data Engineer role. All data on this page is fetched live from ChEMBL, UniProt, AACT PostgreSQL, and PubMed. Code samples cover Python, R/Shiny, SQL, Airflow, Docker, and AWS.
Skills Coverage
Kymera JD requirements and where each is demonstrated
Languages
Pipelines, APIs, automation, ETL
Interactive scientific dashboards
AACT live DB, DMTA schemas, views
CLI automation, cron, scripting
Scientific Platforms
Data model + REST API integration
Assay data pipeline patterns
DMTA design data flows
Live bioactivity integration (this page)
Live protein data (this page)
REST + AACT PostgreSQL
Infrastructure
CDK IaC, Fargate, lifecycle rules
Containerized R + Python stack
DMTA workflow orchestration DAG
Automated test + deploy
Data Engineering
DMTA compound/assay/results schema
Validated, paginated, tested
Outlier detection, curation flags
UniProt integration, protein data
Scoped at RCH + Vertex + Pfizer
Target Proteins
STAT6, IRAK4, and IRF5 from UniProt REST API
Signal transducer and activator of transcription 6
Carries out a dual function: signal transduction and activation of transcription. Involved in IL4/interleukin-4- and IL3/interleukin-3-mediated signaling
Interleukin-1 receptor-associated kinase 4
Serine/threonine-protein kinase that plays a critical role in initiating innate immune response against foreign pathogens. Involved in Toll-like receptor (TLR) and IL-1R signaling pathways (PubMed:17878374). Is rapidly recruited by MYD88 to the receptor-signaling complex upon TLR activation to form the Myddosome together with IRAK2. Phosphorylates initially IRAK1, thus stimulating the kinase activ
Interferon regulatory factor 5
Transcription factor that plays a critical role in innate immunity by activating expression of type I interferon (IFN) IFNA and INFB and inflammatory cytokines downstream of endolysosomal toll-like receptors TLR7, TLR8 and TLR9 (PubMed:11303025, PubMed:15695821, PubMed:22412986, PubMed:25326418, PubMed:32433612). Regulates the transcription of type I IFN genes (IFN-alpha and IFN-beta) and IFN-stim
https://rest.uniprot.org/uniprotkb/{P42226,Q9NWZ3,Q13568}.jsonNo key required, 24h cacheChEMBL Bioactivity Data
31,876 total measurements across STAT6, IRAK4, IRF5
| Molecule | Type | Value (nM) | pChEMBL | Assay | Year |
|---|---|---|---|---|---|
| CHEMBL487451 | IC50 | 0.7 | 9.15 | Inhibition of STAT6 activation in FW4 reporter cel… | 2008 |
| CHEMBL593171 | IC50 | 0.7 | 9.15 | Inhibition of STAT6 in IL4-stimulated human FW4 re… | 2009 |
| CHEMBL470963 | IC50 | 1.4 | 8.85 | Inhibition of STAT6 activation in FW4 reporter cel… | 2008 |
| CHEMBL488814 | IC50 | 1.6 | 8.80 | Inhibition of STAT6 activation in FW4 reporter cel… | 2008 |
| CHEMBL519249 | IC50 | 1.8 | 8.74 | Inhibition of STAT6 activation in FW4 reporter cel… | 2008 |
| CHEMBL512874 | IC50 | 1.8 | 8.74 | Inhibition of STAT6 activation in FW4 reporter cel… | 2008 |
| Molecule | Type | Value (nM) | pChEMBL | Assay | Year |
|---|---|---|---|---|---|
| CHEMBL4443947 | IC50 | 0.022 | 10.66 | Binding affinity to human IRK4 using myelin basic … | 2019 |
| CHEMBL4556091 | IC50 | 0.026 | 10.59 | Inhibition of IRAK4 in human whole blood assessed … | 2020 |
| CHEMBL4566431 | IC50 | 0.078 | 10.11 | Inhibition of IRAK4 in human whole blood assessed … | 2020 |
| CHEMBL4545898 | IC50 | 0.081 | 10.09 | Inhibition of IRAK4 in human whole blood assessed … | 2020 |
| CHEMBL6030179 | IC50 | 0.1 | 10.00 | IRAK4 Enzymatic DELFIA Assay, Protocol A: This is … | 2023 |
| CHEMBL4066705 | IC50 | 0.1 | 10.00 | Inhibition of full-length IRAK4 (unknown origin) i… | 2024 |
| Molecule | Type | Value (nM) | pChEMBL | Assay | Year |
|---|---|---|---|---|---|
| CHEMBL463914 | IC50 | 3,400 | 5.47 | Inhibition of human DNA polymerase kappa (19 to 52… | 2016 |
| CHEMBL3960997 | IC50 | 5,600 | 5.25 | Inhibition of human DNA polymerase kappa (19 to 52… | 2016 |
| CHEMBL463914 | IC50 | 5,600 | 5.25 | Inhibition of human DNA polymerase kappa (19 to 52… | 2016 |
| CHEMBL1014 | IC50 | 5,600 | 5.25 | Inhibition of human DNA polymerase kappa (19 to 52… | 2016 |
| CHEMBL1917196 | IC50 | 6,800 | 5.17 | Inhibition of C-terminal His6-tagged human DNA pol… | 2011 |
| CHEMBL1917198 | IC50 | 8,100 | 5.09 | Inhibition of C-terminal His6-tagged human DNA pol… | 2011 |
https://www.ebi.ac.uk/chembl/api/data/activity.json?target_chembl_id=CHEMBL5401&assay_type=B&pchembl_value__isnull=false&order_by=-pchembl_valueAACT PostgreSQL
Live queries against aact-db.ctti-clinicaltrials.org, the PostgreSQL replica of ClinicalTrials.gov
SELECT DISTINCT ON (s.nct_id)
s.nct_id, s.brief_title, s.overall_status, s.phase,
sp.name AS sponsor, s.enrollment, s.start_date
FROM studies s
JOIN interventions i ON i.nct_id = s.nct_id
JOIN sponsors sp ON sp.nct_id = s.nct_id AND sp.lead_or_collaborator = 'lead'
WHERE (
LOWER(i.name) LIKE '%protac%' -- TPD small molecules
OR LOWER(i.name) LIKE '%degrader%'
OR LOWER(s.brief_title) LIKE '%degrader%'
)
AND s.overall_status IN ('RECRUITING', 'ACTIVE_NOT_RECRUITING', 'NOT_YET_RECRUITING')
ORDER BY s.nct_id, s.start_date DESC NULLS LAST
LIMIT 8;Active TPD / degrader trials
By trial phase
Query results loading…
Top sponsors
Loading…
DMTA Cycle: Design, Make, Test, Analyze
Tools, data inputs, and outputs at each stage of drug discovery
Tools
Data in
Target structures, docking scores, ADMET predictions
Output
Virtual compound library (SMILES + predicted properties)
Tools
Data in
Synthesis routes, reaction yields, batch purity (HPLC/NMR)
Output
Registered compound batches with analytical certificates
Tools
Data in
IC50, DC50, Dmax, selectivity panels, DMPK assays
Output
Validated bioactivity dataset loaded to research DB
Tools
Data in
SAR trends, multiparameter optimization, statistical models
Output
Go/no-go decisions, next design hypotheses
Integration Architecture
Instruments and APIs to storage to scientist-facing tools
Data Sources
Ingestion & Processing
Storage & Platform
Applications
Code Samples
Python ETL, R Shiny app, PostgreSQL DMTA schema, Airflow DAG, Docker and AWS CDK
ETL pipeline that fetches compound bioactivity data from ChEMBL, validates records, flags outliers and duplicates, and writes a curated Parquet file. Mirrors the CDD Vault / D360 ingestion pattern.
#!/usr/bin/env python3
"""
ChEMBL → Scientific Database ETL Pipeline
Fetches bioactivity data for Kymera target proteins,
validates and loads into research data store.
Pattern mirrors CDD Vault / D360 integration workflows.
"""
import numpy as np
import pandas as pd
import requests
import logging
from dataclasses import asdict, dataclass, field
from typing import Iterator
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger(__name__)
CHEMBL_BASE = "https://www.ebi.ac.uk/chembl/api/data"
# Kymera priority targets
TARGETS = {
"STAT6": "CHEMBL5401",
"IRAK4": "CHEMBL3778",
"IRF5": "CHEMBL5365",
}
@dataclass
class ActivityRecord:
target: str
molecule_id: str
smiles: str | None
assay_type: str
standard_type: str # IC50, Ki, Kd, etc.
value_nm: float | None
pchembl: float | None
assay_description: str
journal: str | None
year: int | None
flags: list[str] = field(default_factory=list)
def fetch_activities(target_name: str, chembl_id: str) -> Iterator[dict]:
"""Paginate ChEMBL activity endpoint for a given target."""
offset, limit = 0, 100
while True:
url = (
f"{CHEMBL_BASE}/activity.json"
f"?target_chembl_id={chembl_id}"
f"&assay_type=B&pchembl_value__isnull=false"
f"&limit={limit}&offset={offset}"
)
r = requests.get(url, timeout=30)
r.raise_for_status()
data = r.json()
rows = data.get("activities", [])
log.info(f"{target_name}: fetched {len(rows)} records (offset={offset})")
yield from rows
if not data["page_meta"]["next"]:
break
offset += limit
def parse_record(target: str, raw: dict) -> ActivityRecord:
"""Parse and validate a raw ChEMBL activity dict."""
try:
value = float(raw["standard_value"]) if raw.get("standard_value") else None
except (ValueError, TypeError):
value = None
try:
pchembl = float(raw["pchembl_value"]) if raw.get("pchembl_value") else None
except (ValueError, TypeError):
pchembl = None
flags = []
if raw.get("potential_duplicate"):
flags.append("POTENTIAL_DUPLICATE")
if value is not None and value < 0:
flags.append("NEGATIVE_VALUE")
if raw.get("data_validity_comment"):
flags.append(f"DVC:{raw['data_validity_comment']}")
return ActivityRecord(
target=target,
molecule_id=raw.get("molecule_chembl_id", ""),
smiles=raw.get("canonical_smiles"),
assay_type=raw.get("assay_type", ""),
standard_type=raw.get("standard_type", ""),
value_nm=value,
pchembl=pchembl,
assay_description=raw.get("assay_description", ""),
journal=raw.get("document_journal"),
year=raw.get("document_year"),
flags=flags,
)
def validate_dataframe(df: pd.DataFrame) -> pd.DataFrame:
"""
Curate and validate bioactivity data:
- Remove duplicates and flagged records
- Filter to accepted activity types
- Enforce value range (1 pM – 100 µM = 0.001–100000 nM)
"""
initial = len(df)
df = df[~df["flags"].apply(lambda f: "POTENTIAL_DUPLICATE" in f)]
df = df[df["standard_type"].isin(["IC50", "Ki", "Kd", "EC50", "GI50"])]
df = df[df["value_nm"].between(0.001, 100_000)]
df = df.dropna(subset=["smiles", "value_nm"])
log.info(f"Validation: {initial} → {len(df)} records ({initial - len(df)} removed)")
return df
def run_pipeline(output_path: str = "kymera_bioactivity.parquet") -> pd.DataFrame:
"""
Full ETL: Extract → Transform → Validate → Load
Output: curated Parquet file ready for downstream analysis
or CDD Vault / D360 import.
"""
all_records = []
for target_name, chembl_id in TARGETS.items():
for raw in fetch_activities(target_name, chembl_id):
all_records.append(asdict(parse_record(target_name, raw)))
df = pd.DataFrame(all_records)
df = validate_dataframe(df)
df["pIC50"] = df["pchembl"].fillna(9 - np.log10(df["value_nm"]))
df.to_parquet(output_path, index=False)
log.info(f"Loaded {len(df)} curated records → {output_path}")
return df
if __name__ == "__main__":
df = run_pipeline()
print(df.groupby("target")[["value_nm", "pchembl"]].describe().round(2))