Seed Dataset Generator Plugin
The Seed Dataset Generator is a built-in Schema Builder translator plugin that converts your active JSON-LD schema into realistic SQL INSERT statements — ready to load into DuckDB or PostgreSQL for analytics dashboards, integration testing, or Superset demos.
The Full Pipeline
JSON-LD Schema (active in Schema Builder)
↓ Mapper View → MappingRow[]
↓ sql-duckdb plugin → schema.sql (CREATE TABLE + FK constraints)
↓ fake-data-sql plugin → inserts.sql (N rows per table, faker.js)
↓ duckdb clinical.duckdb < schema.sql
↓ duckdb clinical.duckdb < inserts.sql
↓ Apache Superset → Live Dashboard
Quick Start
1. Open your schema and map nodes
- Open a
.jsonldfile in Schema Builder. - Open the Mapper View (toolbar → map icon).
- Nodes auto-load; review each row's
mapTo(table/column name) andtargetType. - Select all rows you want to include (checkbox column).
2. Generate the DDL schema
In Target Format, choose SQL DDL (DuckDB) and click Export → save as schema.sql.
3. Generate seed data
In Target Format, choose Seed Data — SQL INSERTs (faker.js) and click Export → save as inserts.sql.
4. Load into DuckDB
Using the Docker sidecar from the SureCentric stack:
# Start the stack
docker compose up -d postgres redis superset duckdb
# Apply schema
docker compose run --rm duckdb sh -c \
"duckdb /data/clinical.duckdb < /generated/schema.sql"
# Load seed rows
docker compose run --rm duckdb sh -c \
"duckdb /data/clinical.duckdb < /generated/inserts.sql"
Or with a local DuckDB binary:
duckdb clinical.duckdb < schema.sql
duckdb clinical.duckdb < inserts.sql
5. Connect Apache Superset
- Open Superset at
http://localhost:8088(admin / admin). - + Database → DuckDB — connection string:
duckdb:////data/clinical.duckdb - Add a dataset for each table.
- Build charts: count by status, enrollment timeline, diagnosis distribution.
What Gets Generated
Row count
Each owl:Class table defaults to 50 rows. To override, enter a number in the notes field of the TABLE mapping row (e.g. 100 generates 100 rows).
IRI primary key
Every row gets a stable IRI primary key:
https://<sourceUri>/patient/1
https://<sourceUri>/patient/2
...
Column value heuristics
The plugin uses a two-tier value strategy:
1. Column name heuristics (clinical-domain aware)
Checked first — highest priority. Column names are matched against patterns:
| Column name pattern | Generated value |
|---|---|
patient_name, subject_name, full_name | faker.person.fullName() |
first_name, given_name | faker.person.firstName() |
last_name, family_name, surname | faker.person.lastName() |
email | faker.internet.email() |
phone, telephone | faker.phone.number() |
site, center, hospital, facility, institution | faker.company.name() |
country | faker.location.country() |
country_code | faker.location.countryCode() |
city | faker.location.city() |
state, province, region | faker.location.state() |
zip, postal | faker.location.zipCode() |
address, street | faker.location.streetAddress() |
description, comment, note, remark | faker.lorem.sentence() |
title, label, name | faker.lorem.words(2) |
status | One of: active, inactive, pending, completed, withdrawn |
gender, sex | One of: M, F, U |
race, ethnicity | Standard clinical race/ethnicity values |
dose, dosage, amount | Float 0.5 – 1000 |
unit | One of: mg, mL, mg/kg, IU, mcg |
code | 6-char alphanumeric e.g. A7X3K9 |
age | Integer 18 – 90 |
weight | Float 40 – 150 |
height | Float 140 – 210 |
score, value, result, measure | Float 0 – 100 |
2. targetType fallback
If no name heuristic matches:
targetType | Generated value |
|---|---|
INTEGER, INT, BIGINT | faker.number.int({min:1, max:9999}) |
FLOAT, DOUBLE, DECIMAL | faker.number.float({fractionDigits:2}) |
BOOLEAN | TRUE or FALSE |
DATE | Past date up to 5 years ago: '2023-04-15' |
DATETIME, TIMESTAMP | Past datetime: '2023-04-15 09:32:11.000' |
UUID | faker.string.uuid() |
URI, IRI, REFERENCE | faker.internet.url() — or FK reference (see below) |
TEXT, STRING, VARCHAR (default) | faker.lorem.words(3) |
Foreign key resolution
When a column is detected as a foreign key (via inferForeignKey() — same logic as the SQL DDL plugins), the generated value is a valid IRI from the referenced table, picked round-robin across its generated rows.
This preserves referential integrity — every FK column references a row that actually exists.
Example: if diagnosis has a patient column typed as URI or IRI, and patient is a known table, the patient column will contain values like:
'https://example.com/schema/patient/1'
'https://example.com/schema/patient/2'
...
Output Format
The generated inserts.sql looks like:
-- Auto-generated seed data from: https://example.com/eclinical
-- Schema Builder — Seed Dataset Generator (faker.js)
-- Generated: 2026-04-09T10:48:57Z
-- Rows per table: configured in notes field (default 50)
-- patient (50 rows)
INSERT INTO patient (iri, patient_name, site, status, age, gender, race) VALUES
('https://example.com/eclinical/patient/1', 'Margaret O''Brien', 'Acme Medical Center', 'active', 34, 'F', 'White'),
('https://example.com/eclinical/patient/2', 'James Thornton', 'Harborview Hospital', 'active', 51, 'M', 'Hispanic'),
...;
-- diagnosis (50 rows)
INSERT INTO diagnosis (iri, patient, diagnosis_code, description, onset_date) VALUES
('https://example.com/eclinical/diagnosis/1', 'https://example.com/eclinical/patient/1', 'D3K7A2', 'At the house.', '2024-01-22'),
...;
Reproducibility
The generator uses faker.seed(42) — every export of the same mapping produces identical data. This is useful for:
- Repeatable integration tests
- Consistent demo datasets committed to source control
- QA scripts that rely on known row content
To get different data, pass the output through sed 's/faker.seed(42)/faker.seed(999)/' or modify the plugin source.
Plugin Identity
| Field | Value |
|---|---|
id | com.sureclinical.seed-data-generator |
pluginType | translator |
formatId | seed-data-sql |
version | 1.0.0 |
author | SureClinical |
| Output extension | .inserts.sql |
Advanced: Building a Superset Dashboard
After loading schema.sql + inserts.sql into DuckDB, here is a quick dashboard recipe:
| Chart type | Dataset | Metric | Dimension |
|---|---|---|---|
| Big Number | patient | COUNT(*) | — |
| Bar Chart | diagnosis | COUNT(*) | diagnosis_code |
| Time-series | patient | COUNT(*) | onset_date (truncate to month) |
| Pie Chart | patient | COUNT(*) | status |
| Table | diagnosis | * columns | — |
Save as dashboard "eClinical Model — Live" and share the URL as the integration demo artifact.
See Also
- Developer Reference — full Schema Builder architecture
- Plugin Authoring Guide — how to build your own translator plugin
- CARD System — provision seed data directly into Superset via CARD Profiles