Portfolio by Johnny RiceInformatics demo
ClariTrial
ClariTrial/Scientific Informatics Demo
Informatics DemoBuilt for Kymera TherapeuticsAll data live

Scientific Informatics: Live Integrations and Code

Built for Kymera's Principal Scientific Data Engineer role. All data on this page is fetched live from ChEMBL, UniProt, AACT PostgreSQL, and PubMed. Code samples cover Python, R/Shiny, SQL, Airflow, Docker, and AWS.

ChEMBL activities
31,876
STAT6, IRAK4, IRF5
UniProt proteins
3
Swiss-Prot reviewed
TPD trials (live SQL)
AACT PostgreSQL
Code samples
5
Python, R, SQL, Docker, AWS

Skills Coverage

Kymera JD requirements and where each is demonstrated

Languages

Python

Pipelines, APIs, automation, ETL

R / R Shiny

Interactive scientific dashboards

SQL (PostgreSQL)

AACT live DB, DMTA schemas, views

Linux / Bash

CLI automation, cron, scripting

Scientific Platforms

CDD Vault

Data model + REST API integration

D360 / Dotmatics

Assay data pipeline patterns

LiveDesign

DMTA design data flows

ChEMBL API

Live bioactivity integration (this page)

UniProt API

Live protein data (this page)

ClinicalTrials.gov

REST + AACT PostgreSQL

Infrastructure

AWS (EC2/ECS/S3/RDS)

CDK IaC, Fargate, lifecycle rules

Docker / Compose

Containerized R + Python stack

Airflow / Prefect

DMTA workflow orchestration DAG

CI/CD (GitHub Actions)

Automated test + deploy

Data Engineering

Data Modeling

DMTA compound/assay/results schema

ETL / ELT Pipelines

Validated, paginated, tested

Data Validation / QC

Outlier detection, curation flags

Proteomics workflows

UniProt integration, protein data

Vendor management

Scoped at RCH + Vertex + Pfizer

Target Proteins

STAT6, IRAK4, and IRF5 from UniProt REST API

Live data
STAT6P42226

Signal transducer and activator of transcription 6

Carries out a dual function: signal transduction and activation of transcription. Involved in IL4/interleukin-4- and IL3/interleukin-3-mediated signaling

847 aaSwiss-Prot ✓CytoplasmNucleus
IRAK4Q9NWZ3

Interleukin-1 receptor-associated kinase 4

Serine/threonine-protein kinase that plays a critical role in initiating innate immune response against foreign pathogens. Involved in Toll-like receptor (TLR) and IL-1R signaling pathways (PubMed:17878374). Is rapidly recruited by MYD88 to the receptor-signaling complex upon TLR activation to form the Myddosome together with IRAK2. Phosphorylates initially IRAK1, thus stimulating the kinase activ

460 aaSwiss-Prot ✓Cytoplasm
IRF5Q13568

Interferon regulatory factor 5

Transcription factor that plays a critical role in innate immunity by activating expression of type I interferon (IFN) IFNA and INFB and inflammatory cytokines downstream of endolysosomal toll-like receptors TLR7, TLR8 and TLR9 (PubMed:11303025, PubMed:15695821, PubMed:22412986, PubMed:25326418, PubMed:32433612). Regulates the transcription of type I IFN genes (IFN-alpha and IFN-beta) and IFN-stim

498 aaSwiss-Prot ✓CytoplasmNucleus
GEThttps://rest.uniprot.org/uniprotkb/{P42226,Q9NWZ3,Q13568}.jsonNo key required, 24h cache

ChEMBL Bioactivity Data

31,876 total measurements across STAT6, IRAK4, IRF5

Live data
STAT6top hits by pChEMBL
224 total
MoleculeTypeValue (nM)pChEMBLAssayYear
CHEMBL487451IC500.79.15Inhibition of STAT6 activation in FW4 reporter cel2008
CHEMBL593171IC500.79.15Inhibition of STAT6 in IL4-stimulated human FW4 re2009
CHEMBL470963IC501.48.85Inhibition of STAT6 activation in FW4 reporter cel2008
CHEMBL488814IC501.68.80Inhibition of STAT6 activation in FW4 reporter cel2008
CHEMBL519249IC501.88.74Inhibition of STAT6 activation in FW4 reporter cel2008
CHEMBL512874IC501.88.74Inhibition of STAT6 activation in FW4 reporter cel2008
IRAK4top hits by pChEMBL
31,578 total
MoleculeTypeValue (nM)pChEMBLAssayYear
CHEMBL4443947IC500.02210.66Binding affinity to human IRK4 using myelin basic 2019
CHEMBL4556091IC500.02610.59Inhibition of IRAK4 in human whole blood assessed 2020
CHEMBL4566431IC500.07810.11Inhibition of IRAK4 in human whole blood assessed 2020
CHEMBL4545898IC500.08110.09Inhibition of IRAK4 in human whole blood assessed 2020
CHEMBL6030179IC500.110.00IRAK4 Enzymatic DELFIA Assay, Protocol A: This is 2023
CHEMBL4066705IC500.110.00Inhibition of full-length IRAK4 (unknown origin) i2024
IRF5top hits by pChEMBL
74 total
MoleculeTypeValue (nM)pChEMBLAssayYear
CHEMBL463914IC503,4005.47Inhibition of human DNA polymerase kappa (19 to 522016
CHEMBL3960997IC505,6005.25Inhibition of human DNA polymerase kappa (19 to 522016
CHEMBL463914IC505,6005.25Inhibition of human DNA polymerase kappa (19 to 522016
CHEMBL1014IC505,6005.25Inhibition of human DNA polymerase kappa (19 to 522016
CHEMBL1917196IC506,8005.17Inhibition of C-terminal His6-tagged human DNA pol2011
CHEMBL1917198IC508,1005.09Inhibition of C-terminal His6-tagged human DNA pol2011
GEThttps://www.ebi.ac.uk/chembl/api/data/activity.json?target_chembl_id=CHEMBL5401&assay_type=B&pchembl_value__isnull=false&order_by=-pchembl_value

AACT PostgreSQL

Live queries against aact-db.ctti-clinicaltrials.org, the PostgreSQL replica of ClinicalTrials.gov

Live data
aact-db.ctti-clinicaltrials.org · 5432 / aactLive query executed server-side
SELECT DISTINCT ON (s.nct_id)
  s.nct_id, s.brief_title, s.overall_status, s.phase,
  sp.name AS sponsor, s.enrollment, s.start_date
FROM studies s
JOIN interventions i  ON i.nct_id = s.nct_id
JOIN sponsors sp      ON sp.nct_id = s.nct_id AND sp.lead_or_collaborator = 'lead'
WHERE (
  LOWER(i.name) LIKE '%protac%'      -- TPD small molecules
  OR LOWER(i.name) LIKE '%degrader%'
  OR LOWER(s.brief_title) LIKE '%degrader%'
)
AND s.overall_status IN ('RECRUITING', 'ACTIVE_NOT_RECRUITING', 'NOT_YET_RECRUITING')
ORDER BY s.nct_id, s.start_date DESC NULLS LAST
LIMIT 8;

Active TPD / degrader trials

AACT database connection — data loading

By trial phase

Query results loading…

Top sponsors

Loading…

DMTA Cycle: Design, Make, Test, Analyze

Tools, data inputs, and outputs at each stage of drug discovery

Step 1Design

Tools

SchrödingerLiveDesignRDKitPython

Data in

Target structures, docking scores, ADMET predictions

Output

Virtual compound library (SMILES + predicted properties)

Step 2Make

Tools

ELN (Signals)CDD VaultLIMSBarcode scanning

Data in

Synthesis routes, reaction yields, batch purity (HPLC/NMR)

Output

Registered compound batches with analytical certificates

Step 3Test

Tools

D360 / Dotmaticsplate readersAirflow DAG

Data in

IC50, DC50, Dmax, selectivity panels, DMPK assays

Output

Validated bioactivity dataset loaded to research DB

Step 4Analyze

Tools

R / ShinyPython (pandas, scipy)SpotfireSQL

Data in

SAR trends, multiparameter optimization, statistical models

Output

Go/no-go decisions, next design hypotheses

Integration Architecture

Instruments and APIs to storage to scientist-facing tools

Data Sources

ChEMBLBioactivity
UniProtProteins
ClinicalTrials.govTrials
PubMedLiterature
Instrument exportsAssays

Ingestion & Processing

Python ETLAirflow DAG
SQL CurationPostgreSQL
R scriptsAnalysis
QC PipelineValidation

Storage & Platform

Amazon RDSPostgreSQL
Amazon S3Raw data
CDD Vault / D360Scientific

Applications

R ShinyScientists
REST APIsIntegrations
DashboardsLeadership

Code Samples

Python ETL, R Shiny app, PostgreSQL DMTA schema, Airflow DAG, Docker and AWS CDK

PythonR / ShinySQLApache AirflowDockerAWS CDK

ETL pipeline that fetches compound bioactivity data from ChEMBL, validates records, flags outliers and duplicates, and writes a curated Parquet file. Mirrors the CDD Vault / D360 ingestion pattern.

chembl_etl.py
#!/usr/bin/env python3
"""
ChEMBL → Scientific Database ETL Pipeline
Fetches bioactivity data for Kymera target proteins,
validates and loads into research data store.

Pattern mirrors CDD Vault / D360 integration workflows.
"""

import numpy as np
import pandas as pd
import requests
import logging
from dataclasses import asdict, dataclass, field
from typing import Iterator

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger(__name__)

CHEMBL_BASE = "https://www.ebi.ac.uk/chembl/api/data"

# Kymera priority targets
TARGETS = {
    "STAT6":  "CHEMBL5401",
    "IRAK4":  "CHEMBL3778",
    "IRF5":   "CHEMBL5365",
}

@dataclass
class ActivityRecord:
    target: str
    molecule_id: str
    smiles: str | None
    assay_type: str
    standard_type: str           # IC50, Ki, Kd, etc.
    value_nm: float | None
    pchembl: float | None
    assay_description: str
    journal: str | None
    year: int | None
    flags: list[str] = field(default_factory=list)

def fetch_activities(target_name: str, chembl_id: str) -> Iterator[dict]:
    """Paginate ChEMBL activity endpoint for a given target."""
    offset, limit = 0, 100
    while True:
        url = (
            f"{CHEMBL_BASE}/activity.json"
            f"?target_chembl_id={chembl_id}"
            f"&assay_type=B&pchembl_value__isnull=false"
            f"&limit={limit}&offset={offset}"
        )
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        data = r.json()
        rows = data.get("activities", [])
        log.info(f"{target_name}: fetched {len(rows)} records (offset={offset})")
        yield from rows
        if not data["page_meta"]["next"]:
            break
        offset += limit

def parse_record(target: str, raw: dict) -> ActivityRecord:
    """Parse and validate a raw ChEMBL activity dict."""
    try:
        value = float(raw["standard_value"]) if raw.get("standard_value") else None
    except (ValueError, TypeError):
        value = None

    try:
        pchembl = float(raw["pchembl_value"]) if raw.get("pchembl_value") else None
    except (ValueError, TypeError):
        pchembl = None

    flags = []
    if raw.get("potential_duplicate"):
        flags.append("POTENTIAL_DUPLICATE")
    if value is not None and value < 0:
        flags.append("NEGATIVE_VALUE")
    if raw.get("data_validity_comment"):
        flags.append(f"DVC:{raw['data_validity_comment']}")

    return ActivityRecord(
        target=target,
        molecule_id=raw.get("molecule_chembl_id", ""),
        smiles=raw.get("canonical_smiles"),
        assay_type=raw.get("assay_type", ""),
        standard_type=raw.get("standard_type", ""),
        value_nm=value,
        pchembl=pchembl,
        assay_description=raw.get("assay_description", ""),
        journal=raw.get("document_journal"),
        year=raw.get("document_year"),
        flags=flags,
    )

def validate_dataframe(df: pd.DataFrame) -> pd.DataFrame:
    """
    Curate and validate bioactivity data:
    - Remove duplicates and flagged records
    - Filter to accepted activity types
    - Enforce value range (1 pM – 100 µM = 0.001–100000 nM)
    """
    initial = len(df)
    df = df[~df["flags"].apply(lambda f: "POTENTIAL_DUPLICATE" in f)]
    df = df[df["standard_type"].isin(["IC50", "Ki", "Kd", "EC50", "GI50"])]
    df = df[df["value_nm"].between(0.001, 100_000)]
    df = df.dropna(subset=["smiles", "value_nm"])
    log.info(f"Validation: {initial} → {len(df)} records ({initial - len(df)} removed)")
    return df

def run_pipeline(output_path: str = "kymera_bioactivity.parquet") -> pd.DataFrame:
    """
    Full ETL: Extract → Transform → Validate → Load
    Output: curated Parquet file ready for downstream analysis
    or CDD Vault / D360 import.
    """
    all_records = []
    for target_name, chembl_id in TARGETS.items():
        for raw in fetch_activities(target_name, chembl_id):
            all_records.append(asdict(parse_record(target_name, raw)))

    df = pd.DataFrame(all_records)
    df = validate_dataframe(df)
    df["pIC50"] = df["pchembl"].fillna(9 - np.log10(df["value_nm"]))
    df.to_parquet(output_path, index=False)
    log.info(f"Loaded {len(df)} curated records → {output_path}")
    return df

if __name__ == "__main__":
    df = run_pipeline()
    print(df.groupby("target")[["value_nm", "pchembl"]].describe().round(2))