Seed Dataset Generator Plugin

The Seed Dataset Generator is a built-in Schema Builder translator plugin that converts your active JSON-LD schema into realistic SQL INSERT statements — ready to load into DuckDB or PostgreSQL for analytics dashboards, integration testing, or Superset demos.

The Full Pipeline

JSON-LD Schema (active in Schema Builder)
    ↓  Mapper View  →  MappingRow[]
    ↓  sql-duckdb plugin  →  schema.sql     (CREATE TABLE + FK constraints)
    ↓  fake-data-sql plugin  →  inserts.sql  (N rows per table, faker.js)
    ↓  duckdb clinical.duckdb < schema.sql
    ↓  duckdb clinical.duckdb < inserts.sql
    ↓  Apache Superset  →  Live Dashboard

Quick Start

1. Open your schema and map nodes

Open a .jsonld file in Schema Builder.
Open the Mapper View (toolbar → map icon).
Nodes auto-load; review each row's mapTo (table/column name) and targetType.
Select all rows you want to include (checkbox column).

2. Generate the DDL schema

In Target Format, choose SQL DDL (DuckDB) and click Export → save as schema.sql.

3. Generate seed data

In Target Format, choose Seed Data — SQL INSERTs (faker.js) and click Export → save as inserts.sql.

4. Load into DuckDB

Using the Docker sidecar from the SureCentric stack:

# Start the stack
docker compose up -d postgres redis superset duckdb

# Apply schema
docker compose run --rm duckdb sh -c \
  "duckdb /data/clinical.duckdb < /generated/schema.sql"

# Load seed rows
docker compose run --rm duckdb sh -c \
  "duckdb /data/clinical.duckdb < /generated/inserts.sql"

Or with a local DuckDB binary:

duckdb clinical.duckdb < schema.sql
duckdb clinical.duckdb < inserts.sql

5. Connect Apache Superset

Open Superset at http://localhost:8088 (admin / admin).
+ Database → DuckDB — connection string:
```
duckdb:////data/clinical.duckdb
```
Add a dataset for each table.
Build charts: count by status, enrollment timeline, diagnosis distribution.

What Gets Generated

Row count

Each owl:Class table defaults to 50 rows. To override, enter a number in the notes field of the TABLE mapping row (e.g. 100 generates 100 rows).

IRI primary key

Every row gets a stable IRI primary key:

https://<sourceUri>/patient/1
https://<sourceUri>/patient/2
...

Column value heuristics

The plugin uses a two-tier value strategy:

1. Column name heuristics (clinical-domain aware)

Checked first — highest priority. Column names are matched against patterns:

Column name pattern	Generated value
`patient_name`, `subject_name`, `full_name`	`faker.person.fullName()`
`first_name`, `given_name`	`faker.person.firstName()`
`last_name`, `family_name`, `surname`	`faker.person.lastName()`
`email`	`faker.internet.email()`
`phone`, `telephone`	`faker.phone.number()`
`site`, `center`, `hospital`, `facility`, `institution`	`faker.company.name()`
`country`	`faker.location.country()`
`country_code`	`faker.location.countryCode()`
`city`	`faker.location.city()`
`state`, `province`, `region`	`faker.location.state()`
`zip`, `postal`	`faker.location.zipCode()`
`address`, `street`	`faker.location.streetAddress()`
`description`, `comment`, `note`, `remark`	`faker.lorem.sentence()`
`title`, `label`, `name`	`faker.lorem.words(2)`
`status`	One of: `active`, `inactive`, `pending`, `completed`, `withdrawn`
`gender`, `sex`	One of: `M`, `F`, `U`
`race`, `ethnicity`	Standard clinical race/ethnicity values
`dose`, `dosage`, `amount`	Float 0.5 – 1000
`unit`	One of: `mg`, `mL`, `mg/kg`, `IU`, `mcg`
`code`	6-char alphanumeric e.g. `A7X3K9`
`age`	Integer 18 – 90
`weight`	Float 40 – 150
`height`	Float 140 – 210
`score`, `value`, `result`, `measure`	Float 0 – 100

2. `targetType` fallback

If no name heuristic matches:

`targetType`	Generated value
`INTEGER`, `INT`, `BIGINT`	`faker.number.int({min:1, max:9999})`
`FLOAT`, `DOUBLE`, `DECIMAL`	`faker.number.float({fractionDigits:2})`
`BOOLEAN`	`TRUE` or `FALSE`
`DATE`	Past date up to 5 years ago: `'2023-04-15'`
`DATETIME`, `TIMESTAMP`	Past datetime: `'2023-04-15 09:32:11.000'`
`UUID`	`faker.string.uuid()`
`URI`, `IRI`, `REFERENCE`	`faker.internet.url()` — or FK reference (see below)
`TEXT`, `STRING`, `VARCHAR` (default)	`faker.lorem.words(3)`

Foreign key resolution

When a column is detected as a foreign key (via inferForeignKey() — same logic as the SQL DDL plugins), the generated value is a valid IRI from the referenced table, picked round-robin across its generated rows.

This preserves referential integrity — every FK column references a row that actually exists.

Example: if diagnosis has a patient column typed as URI or IRI, and patient is a known table, the patient column will contain values like:

'https://example.com/schema/patient/1'
'https://example.com/schema/patient/2'
...

Output Format

The generated inserts.sql looks like:

-- Auto-generated seed data from: https://example.com/eclinical
-- Schema Builder — Seed Dataset Generator (faker.js)
-- Generated: 2026-04-09T10:48:57Z
-- Rows per table: configured in notes field (default 50)

-- patient (50 rows)
INSERT INTO patient (iri, patient_name, site, status, age, gender, race) VALUES
  ('https://example.com/eclinical/patient/1', 'Margaret O''Brien', 'Acme Medical Center', 'active', 34, 'F', 'White'),
  ('https://example.com/eclinical/patient/2', 'James Thornton', 'Harborview Hospital', 'active', 51, 'M', 'Hispanic'),
  ...;

-- diagnosis (50 rows)
INSERT INTO diagnosis (iri, patient, diagnosis_code, description, onset_date) VALUES
  ('https://example.com/eclinical/diagnosis/1', 'https://example.com/eclinical/patient/1', 'D3K7A2', 'At the house.', '2024-01-22'),
  ...;

Reproducibility

The generator uses faker.seed(42) — every export of the same mapping produces identical data. This is useful for:

Repeatable integration tests
Consistent demo datasets committed to source control
QA scripts that rely on known row content

To get different data, pass the output through sed 's/faker.seed(42)/faker.seed(999)/' or modify the plugin source.

Plugin Identity

Field	Value
`id`	`com.sureclinical.seed-data-generator`
`pluginType`	`translator`
`formatId`	`seed-data-sql`
`version`	`1.0.0`
`author`	SureClinical
Output extension	`.inserts.sql`

Advanced: Building a Superset Dashboard

After loading schema.sql + inserts.sql into DuckDB, here is a quick dashboard recipe:

Chart type	Dataset	Metric	Dimension
Big Number	`patient`	COUNT(*)	—
Bar Chart	`diagnosis`	COUNT(*)	`diagnosis_code`
Time-series	`patient`	COUNT(*)	`onset_date` (truncate to month)
Pie Chart	`patient`	COUNT(*)	`status`
Table	`diagnosis`	* columns	—

Save as dashboard "eClinical Model — Live" and share the URL as the integration demo artifact.

The Full Pipeline​

Quick Start​

1. Open your schema and map nodes​

2. Generate the DDL schema​

3. Generate seed data​

4. Load into DuckDB​

5. Connect Apache Superset​

What Gets Generated​

Row count​

IRI primary key​

Column value heuristics​

1. Column name heuristics (clinical-domain aware)​

2. targetType fallback​

Foreign key resolution​

Output Format​

Reproducibility​

Plugin Identity​

Advanced: Building a Superset Dashboard​

See Also​