Skip to main content

Seed Dataset Generator Plugin

The Seed Dataset Generator is a built-in Schema Builder translator plugin that converts your active JSON-LD schema into realistic SQL INSERT statements — ready to load into DuckDB or PostgreSQL for analytics dashboards, integration testing, or Superset demos.


The Full Pipeline

JSON-LD Schema (active in Schema Builder)
↓ Mapper View → MappingRow[]
↓ sql-duckdb plugin → schema.sql (CREATE TABLE + FK constraints)
↓ fake-data-sql plugin → inserts.sql (N rows per table, faker.js)
↓ duckdb clinical.duckdb < schema.sql
↓ duckdb clinical.duckdb < inserts.sql
↓ Apache Superset → Live Dashboard

Quick Start

1. Open your schema and map nodes

  1. Open a .jsonld file in Schema Builder.
  2. Open the Mapper View (toolbar → map icon).
  3. Nodes auto-load; review each row's mapTo (table/column name) and targetType.
  4. Select all rows you want to include (checkbox column).

2. Generate the DDL schema

In Target Format, choose SQL DDL (DuckDB) and click Export → save as schema.sql.

3. Generate seed data

In Target Format, choose Seed Data — SQL INSERTs (faker.js) and click Export → save as inserts.sql.

4. Load into DuckDB

Using the Docker sidecar from the SureCentric stack:

# Start the stack
docker compose up -d postgres redis superset duckdb

# Apply schema
docker compose run --rm duckdb sh -c \
"duckdb /data/clinical.duckdb < /generated/schema.sql"

# Load seed rows
docker compose run --rm duckdb sh -c \
"duckdb /data/clinical.duckdb < /generated/inserts.sql"

Or with a local DuckDB binary:

duckdb clinical.duckdb < schema.sql
duckdb clinical.duckdb < inserts.sql

5. Connect Apache Superset

  1. Open Superset at http://localhost:8088 (admin / admin).
  2. + Database → DuckDB — connection string:
    duckdb:////data/clinical.duckdb
  3. Add a dataset for each table.
  4. Build charts: count by status, enrollment timeline, diagnosis distribution.

What Gets Generated

Row count

Each owl:Class table defaults to 50 rows. To override, enter a number in the notes field of the TABLE mapping row (e.g. 100 generates 100 rows).

IRI primary key

Every row gets a stable IRI primary key:

https://<sourceUri>/patient/1
https://<sourceUri>/patient/2
...

Column value heuristics

The plugin uses a two-tier value strategy:

1. Column name heuristics (clinical-domain aware)

Checked first — highest priority. Column names are matched against patterns:

Column name patternGenerated value
patient_name, subject_name, full_namefaker.person.fullName()
first_name, given_namefaker.person.firstName()
last_name, family_name, surnamefaker.person.lastName()
emailfaker.internet.email()
phone, telephonefaker.phone.number()
site, center, hospital, facility, institutionfaker.company.name()
countryfaker.location.country()
country_codefaker.location.countryCode()
cityfaker.location.city()
state, province, regionfaker.location.state()
zip, postalfaker.location.zipCode()
address, streetfaker.location.streetAddress()
description, comment, note, remarkfaker.lorem.sentence()
title, label, namefaker.lorem.words(2)
statusOne of: active, inactive, pending, completed, withdrawn
gender, sexOne of: M, F, U
race, ethnicityStandard clinical race/ethnicity values
dose, dosage, amountFloat 0.5 – 1000
unitOne of: mg, mL, mg/kg, IU, mcg
code6-char alphanumeric e.g. A7X3K9
ageInteger 18 – 90
weightFloat 40 – 150
heightFloat 140 – 210
score, value, result, measureFloat 0 – 100

2. targetType fallback

If no name heuristic matches:

targetTypeGenerated value
INTEGER, INT, BIGINTfaker.number.int({min:1, max:9999})
FLOAT, DOUBLE, DECIMALfaker.number.float({fractionDigits:2})
BOOLEANTRUE or FALSE
DATEPast date up to 5 years ago: '2023-04-15'
DATETIME, TIMESTAMPPast datetime: '2023-04-15 09:32:11.000'
UUIDfaker.string.uuid()
URI, IRI, REFERENCEfaker.internet.url() — or FK reference (see below)
TEXT, STRING, VARCHAR (default)faker.lorem.words(3)

Foreign key resolution

When a column is detected as a foreign key (via inferForeignKey() — same logic as the SQL DDL plugins), the generated value is a valid IRI from the referenced table, picked round-robin across its generated rows.

This preserves referential integrity — every FK column references a row that actually exists.

Example: if diagnosis has a patient column typed as URI or IRI, and patient is a known table, the patient column will contain values like:

'https://example.com/schema/patient/1'
'https://example.com/schema/patient/2'
...

Output Format

The generated inserts.sql looks like:

-- Auto-generated seed data from: https://example.com/eclinical
-- Schema Builder — Seed Dataset Generator (faker.js)
-- Generated: 2026-04-09T10:48:57Z
-- Rows per table: configured in notes field (default 50)

-- patient (50 rows)
INSERT INTO patient (iri, patient_name, site, status, age, gender, race) VALUES
('https://example.com/eclinical/patient/1', 'Margaret O''Brien', 'Acme Medical Center', 'active', 34, 'F', 'White'),
('https://example.com/eclinical/patient/2', 'James Thornton', 'Harborview Hospital', 'active', 51, 'M', 'Hispanic'),
...;

-- diagnosis (50 rows)
INSERT INTO diagnosis (iri, patient, diagnosis_code, description, onset_date) VALUES
('https://example.com/eclinical/diagnosis/1', 'https://example.com/eclinical/patient/1', 'D3K7A2', 'At the house.', '2024-01-22'),
...;

Reproducibility

The generator uses faker.seed(42) — every export of the same mapping produces identical data. This is useful for:

  • Repeatable integration tests
  • Consistent demo datasets committed to source control
  • QA scripts that rely on known row content

To get different data, pass the output through sed 's/faker.seed(42)/faker.seed(999)/' or modify the plugin source.


Plugin Identity

FieldValue
idcom.sureclinical.seed-data-generator
pluginTypetranslator
formatIdseed-data-sql
version1.0.0
authorSureClinical
Output extension.inserts.sql

Advanced: Building a Superset Dashboard

After loading schema.sql + inserts.sql into DuckDB, here is a quick dashboard recipe:

Chart typeDatasetMetricDimension
Big NumberpatientCOUNT(*)
Bar ChartdiagnosisCOUNT(*)diagnosis_code
Time-seriespatientCOUNT(*)onset_date (truncate to month)
Pie ChartpatientCOUNT(*)status
Tablediagnosis* columns

Save as dashboard "eClinical Model — Live" and share the URL as the integration demo artifact.


See Also