feat(identifier): denormalize base_type onto identifier table by wezell · Pull Request #35164 · dotCMS/core

wezell · 2026-03-31T16:02:56Z

Summary

Adds base_type INT4 column + idx_identifier_base_type index to the identifier table, mirroring structure.structuretype
Identifier bean, IdentifierFactoryImpl, and IdentifierTransformer updated to read/write base_type on every save
Task260331AddBaseTypeColumnToIdentifier adds the column and index at startup (DDL only — fast), then schedules the backfill job
PopulateIdentifierBaseTypeJob / PopulateIdentifierBaseTypeUtil backfill existing rows in batches of 500 (250 ms sleep between batches), fire immediately at startup, and self-unschedule when complete
postgres.sql updated for fresh installs

Test plan

Task260331AddBaseTypeColumnToIdentifierTest — column/index created, idempotent, correct metadata
PopulateIdentifierBaseTypeUtilTest — sets correct base type per content type, ignores folders (null asset_subtype), returns 0 on second run
PopulateIdentifierBaseTypeJobTest — schedules, skips when no work pending, removes itself
IdentifierFactoryTest (5 new methods) — base_type persisted for CONTENT and FILEASSET, survives re-save, transformer maps it, folders get null

…ry performance Adds a base_type (INT4) column to the identifier table that mirrors structure.structuretype, eliminating structure joins from hot-path queries. - Identifier bean, factory, transformer updated to read/write base_type - base_type set automatically on createNewIdentifier for all contentlet types - Task260331AddBaseTypeColumnToIdentifier: adds column + index at startup - PopulateIdentifierBaseTypeJob/Util: backfills existing rows in batches of 500 with 250ms sleep between batches to avoid lock pressure on large tables; fires immediately at startup, self-unschedules when migration is complete - postgres.sql updated with column and index for fresh installs - Integration tests for startup task, Quartz job, util, and IdentifierFactory Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

claude · 2026-03-31T16:03:50Z

Claude finished @wezell's task in 3m 11s —— View job

PR Review: feat(identifier): denormalize base_type onto identifier table

Read changed files
Analyze code for bugs, risks, and design issues

3 issues worth a second look:

1. `saveIdentifier` mutates the caller's `Identifier` object as a side effect

IdentifierFactoryImpl.java:537

id.setBaseType(resolvedBaseType);

The stale-cache guard calls id.setBaseType() on the Identifier reference passed into saveIdentifier(). Callers don't expect saveIdentifier to modify their object's baseType. If the same instance is held elsewhere (e.g., in a local variable the caller reads after the save), it will silently have a different value. The mutation should either be documented on the method signature or avoided by returning the resolved value without touching the argument.

Fix this →

2. `populate()` blocks its Quartz thread until the entire migration completes

PopulateIdentifierBaseTypeUtil.java:80–103

populate() loops until zero rows remain, sleeping 250 ms between 500-row batches. For a customer with 5M content identifiers, that's ~10,000 batches × ~250 ms sleep = 41+ minutes holding one Quartz worker thread. If the thread pool is small and this job fires at startup alongside other jobs, it can starve them.

The HOURS_INTERVAL config (default 4 h) only matters for interrupted runs; a complete run never yields the thread mid-way. Either document this clearly as a known constraint, or split the loop so each Quartz execution processes one batch and reschedules itself, letting the scheduler manage the cadence.

3. `processBatch()` acquires a raw connection without explicit rollback on exception

PopulateIdentifierBaseTypeUtil.java:116–124

try (Connection conn = DbConnectionFactory.getDataSource().getConnection()) {
    conn.setAutoCommit(false);
    int result = new DotConnect().executeUpdate(conn, UPDATE_BATCH);
    conn.commit();
    return result;
} catch (final Exception e) {
    throw new DotRuntimeException(...);
}

On exception, the try-with-resources closes the connection without explicitly rolling back. Most JDBC drivers and pools will roll back uncommitted transactions on close, but it's not guaranteed by the spec and behavior varies across pool implementations. An explicit conn.rollback() in the catch before re-throwing makes the intent clear and is defensive against pool misbehavior.

Fix this →

Minor observations (not blocking):

BATCH_SIZE is a static final initialized at class load from Config.getIntProperty(). If the config system isn't fully bootstrapped when the class is first loaded, you'll silently get the default 500. Consider lazy initialization (Lazy.of(...)) consistent with how HOURS_INTERVAL is handled in PopulateIdentifierBaseTypeJob.
The test_populate_correctlyMapsMultipleBaseTypes test calls DbConnectionFactory.getConnection().setAutoCommit(true) directly. This can interfere with other tests' transaction management if they share the same connection. Check whether this is actually needed or can be removed.
hasPendingRows() is called twice in a single run() execution (lines 55 and 67). Each is a DB round-trip. On busy clusters this is fine, but it's worth noting the result between the two calls is not guaranteed consistent (a concurrent save could fill the last rows between them). The existing comment on line 66 implicitly acknowledges this — just confirming the logic is still correct under that race.

Overall the design is sound — the DDL-only startup task, batched backfill job with self-unscheduling, and stale-cache guard are all solid patterns for this kind of denormalization.

…-migration Previously forceRun() returned false once idx_identifier_base_type existed, permanently abandoning millions of NULL base_type rows if the server restarted mid-backfill. Fix: also return true when hasPendingRows() detects identifier rows with null base_type and non-null asset_subtype. DDL uses IF NOT EXISTS so re-running executeUpgrade() is always safe. Tests: replace incorrect false-when-index-exists assertion with two cases: - returns false only when index exists AND no pending rows remain - returns true when index exists but backfill is incomplete (restart scenario) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…n is confirmed complete Previously removeJob() was called unconditionally after populate(), which caused the job to unschedule itself even when interrupted mid-run with rows still pending. Now hasPendingRows() is checked after populate() and removeJob() is only called when the backfill is confirmed finished. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…startup errors 1. hasPendingRows() now joins structure so orphaned identifiers (asset_subtype with no matching structure entry) cannot keep the backfill job running indefinitely. The populate loop already skips them via inner join; the sentinel must match. 2. forceRun() now returns true on DotDataException instead of false so a transient DB error during startup does not permanently skip the migration. executeUpgrade() is idempotent (IF NOT EXISTS) so re-running is always safe. 3. Added integration test covering the orphaned-row case for hasPendingRows(). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…25264

- Make hasPendingRows() public — called from Task260331 in a different package - Fix DotConnect.executeUpdate() arg order: connection comes first, not second - Remove unused @WrapInTransaction import (replaced by manual connection management) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…entifier() The ContentTypeAPI lookup that resolves a missing base_type was running on every saveIdentifier() call regardless of whether it was an INSERT or UPDATE. INSERT callers always provide base_type (set in createNewIdentifier), so the guard was only ever needed for UPDATE — where a stale cached Identifier loaded before the backfill ran could write null back over a freshly populated value. Moving the guard inside the isIdentifier==true branch eliminates the API+DB call on all INSERT paths, making bulk operations (imports, publishing pipelines) that create many new identifiers unaffected during the migration window. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…up tasks Task260403SetPermissionReferenceUnlogged: - Converts permission_reference to UNLOGGED via ALTER TABLE SET UNLOGGED - Table is a regenerable permission cache; crash-safety is not required - Eliminates WAL write cost on every permission rebuild INSERT/DELETE - forceRun() checks pg_class.relpersistence to skip if already unlogged Task260403SetLz4CompressionOnTextColumns: - Sets LZ4 compression on all text/bytea/jsonb/json columns in public schema - Queries pg_attribute dynamically — no hardcoded column list, covers plugins - Skips columns already using LZ4 (attcompression='l') for idempotency - Continues on per-column failure so one bad column does not block all others - Gracefully skips on PostgreSQL < 14 (attcompression column absent) - LZ4 decompresses 3-5x faster than pglz; affects future writes only postgres.sql: - permission_reference CREATE TABLE now uses UNLOGGED - LZ4 SET COMPRESSION block added for all eligible columns (fresh installs) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…on startup tasks" This reverts commit ac32518.

github-project-automation bot added this to dotCMS - Product Planning Mar 31, 2026

github-actions bot added the Area : Backend PR changes Java/Maven backend code label Mar 31, 2026

wezell mentioned this pull request Mar 31, 2026

perf(identifier): replace structure joins with identifier.base_type lookups #35165

Draft

3 tasks

claude bot added the AI: Safe To Rollback label Mar 31, 2026

wezell and others added 3 commits March 31, 2026 12:34

dotCMS deleted a comment from github-actions bot Mar 31, 2026

wezell and others added 7 commits March 31, 2026 14:34

fix: resolve pages requested with trailing slash - fixing /c/ pattern #…

88f25ce

…25264

Merge branch 'main' into feature/identifier-base-type-data-model

8dc605d

Merge branch 'main' into feature/identifier-base-type-data-model

eb9913d

Revert "perf(db): add UNLOGGED permission_reference and LZ4 compressi…

ffd6aa4

…on startup tasks" This reverts commit ac32518.

wezell requested review from gortiz-dotcms and jcastro-dotcms April 4, 2026 03:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(identifier): denormalize base_type onto identifier table#35164

feat(identifier): denormalize base_type onto identifier table#35164
wezell wants to merge 11 commits intomainfrom
feature/identifier-base-type-data-model

wezell commented Mar 31, 2026 •

edited

Loading

Uh oh!

claude bot commented Mar 31, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wezell commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

claude bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: feat(identifier): denormalize base_type onto identifier table

1. saveIdentifier mutates the caller's Identifier object as a side effect

2. populate() blocks its Quartz thread until the entire migration completes

3. processBatch() acquires a raw connection without explicit rollback on exception

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wezell commented Mar 31, 2026 •

edited

Loading

claude bot commented Mar 31, 2026 •

edited

Loading

1. `saveIdentifier` mutates the caller's `Identifier` object as a side effect

2. `populate()` blocks its Quartz thread until the entire migration completes

3. `processBatch()` acquires a raw connection without explicit rollback on exception