Skip to content

feat(identifier): denormalize base_type onto identifier table#35164

Open
wezell wants to merge 11 commits intomainfrom
feature/identifier-base-type-data-model
Open

feat(identifier): denormalize base_type onto identifier table#35164
wezell wants to merge 11 commits intomainfrom
feature/identifier-base-type-data-model

Conversation

@wezell
Copy link
Copy Markdown
Member

@wezell wezell commented Mar 31, 2026

Summary

  • Adds base_type INT4 column + idx_identifier_base_type index to the identifier table, mirroring structure.structuretype
  • Identifier bean, IdentifierFactoryImpl, and IdentifierTransformer updated to read/write base_type on every save
  • Task260331AddBaseTypeColumnToIdentifier adds the column and index at startup (DDL only — fast), then schedules the backfill job
  • PopulateIdentifierBaseTypeJob / PopulateIdentifierBaseTypeUtil backfill existing rows in batches of 500 (250 ms sleep between batches), fire immediately at startup, and self-unschedule when complete
  • postgres.sql updated for fresh installs

Test plan

  • Task260331AddBaseTypeColumnToIdentifierTest — column/index created, idempotent, correct metadata
  • PopulateIdentifierBaseTypeUtilTest — sets correct base type per content type, ignores folders (null asset_subtype), returns 0 on second run
  • PopulateIdentifierBaseTypeJobTest — schedules, skips when no work pending, removes itself
  • IdentifierFactoryTest (5 new methods) — base_type persisted for CONTENT and FILEASSET, survives re-save, transformer maps it, folders get null

fixes #35162

…ry performance

Adds a base_type (INT4) column to the identifier table that mirrors
structure.structuretype, eliminating structure joins from hot-path queries.

- Identifier bean, factory, transformer updated to read/write base_type
- base_type set automatically on createNewIdentifier for all contentlet types
- Task260331AddBaseTypeColumnToIdentifier: adds column + index at startup
- PopulateIdentifierBaseTypeJob/Util: backfills existing rows in batches of
  500 with 250ms sleep between batches to avoid lock pressure on large tables;
  fires immediately at startup, self-unschedules when migration is complete
- postgres.sql updated with column and index for fresh installs
- Integration tests for startup task, Quartz job, util, and IdentifierFactory

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 31, 2026

Claude finished @wezell's task in 3m 11s —— View job


PR Review: feat(identifier): denormalize base_type onto identifier table

  • Read changed files
  • Analyze code for bugs, risks, and design issues

3 issues worth a second look:

1. saveIdentifier mutates the caller's Identifier object as a side effect

IdentifierFactoryImpl.java:537

id.setBaseType(resolvedBaseType);

The stale-cache guard calls id.setBaseType() on the Identifier reference passed into saveIdentifier(). Callers don't expect saveIdentifier to modify their object's baseType. If the same instance is held elsewhere (e.g., in a local variable the caller reads after the save), it will silently have a different value. The mutation should either be documented on the method signature or avoided by returning the resolved value without touching the argument.

Fix this →


2. populate() blocks its Quartz thread until the entire migration completes

PopulateIdentifierBaseTypeUtil.java:80–103

populate() loops until zero rows remain, sleeping 250 ms between 500-row batches. For a customer with 5M content identifiers, that's ~10,000 batches × ~250 ms sleep = 41+ minutes holding one Quartz worker thread. If the thread pool is small and this job fires at startup alongside other jobs, it can starve them.

The HOURS_INTERVAL config (default 4 h) only matters for interrupted runs; a complete run never yields the thread mid-way. Either document this clearly as a known constraint, or split the loop so each Quartz execution processes one batch and reschedules itself, letting the scheduler manage the cadence.


3. processBatch() acquires a raw connection without explicit rollback on exception

PopulateIdentifierBaseTypeUtil.java:116–124

try (Connection conn = DbConnectionFactory.getDataSource().getConnection()) {
    conn.setAutoCommit(false);
    int result = new DotConnect().executeUpdate(conn, UPDATE_BATCH);
    conn.commit();
    return result;
} catch (final Exception e) {
    throw new DotRuntimeException(...);
}

On exception, the try-with-resources closes the connection without explicitly rolling back. Most JDBC drivers and pools will roll back uncommitted transactions on close, but it's not guaranteed by the spec and behavior varies across pool implementations. An explicit conn.rollback() in the catch before re-throwing makes the intent clear and is defensive against pool misbehavior.

Fix this →


Minor observations (not blocking):

  • BATCH_SIZE is a static final initialized at class load from Config.getIntProperty(). If the config system isn't fully bootstrapped when the class is first loaded, you'll silently get the default 500. Consider lazy initialization (Lazy.of(...)) consistent with how HOURS_INTERVAL is handled in PopulateIdentifierBaseTypeJob.
  • The test_populate_correctlyMapsMultipleBaseTypes test calls DbConnectionFactory.getConnection().setAutoCommit(true) directly. This can interfere with other tests' transaction management if they share the same connection. Check whether this is actually needed or can be removed.
  • hasPendingRows() is called twice in a single run() execution (lines 55 and 67). Each is a DB round-trip. On busy clusters this is fine, but it's worth noting the result between the two calls is not guaranteed consistent (a concurrent save could fill the last rows between them). The existing comment on line 66 implicitly acknowledges this — just confirming the logic is still correct under that race.

Overall the design is sound — the DDL-only startup task, batched backfill job with self-unscheduling, and stale-cache guard are all solid patterns for this kind of denormalization.

wezell and others added 3 commits March 31, 2026 12:34
…-migration

Previously forceRun() returned false once idx_identifier_base_type existed,
permanently abandoning millions of NULL base_type rows if the server restarted
mid-backfill.

Fix: also return true when hasPendingRows() detects identifier rows with
null base_type and non-null asset_subtype. DDL uses IF NOT EXISTS so
re-running executeUpgrade() is always safe.

Tests: replace incorrect false-when-index-exists assertion with two cases:
- returns false only when index exists AND no pending rows remain
- returns true when index exists but backfill is incomplete (restart scenario)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…n is confirmed complete

Previously removeJob() was called unconditionally after populate(), which caused
the job to unschedule itself even when interrupted mid-run with rows still pending.
Now hasPendingRows() is checked after populate() and removeJob() is only called
when the backfill is confirmed finished.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…startup errors

1. hasPendingRows() now joins structure so orphaned identifiers (asset_subtype with
   no matching structure entry) cannot keep the backfill job running indefinitely.
   The populate loop already skips them via inner join; the sentinel must match.

2. forceRun() now returns true on DotDataException instead of false so a transient
   DB error during startup does not permanently skip the migration. executeUpgrade()
   is idempotent (IF NOT EXISTS) so re-running is always safe.

3. Added integration test covering the orphaned-row case for hasPendingRows().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@dotCMS dotCMS deleted a comment from github-actions bot Mar 31, 2026
wezell and others added 7 commits March 31, 2026 14:34
- Make hasPendingRows() public — called from Task260331 in a different package
- Fix DotConnect.executeUpdate() arg order: connection comes first, not second
- Remove unused @WrapInTransaction import (replaced by manual connection management)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…entifier()

The ContentTypeAPI lookup that resolves a missing base_type was running on every
saveIdentifier() call regardless of whether it was an INSERT or UPDATE. INSERT
callers always provide base_type (set in createNewIdentifier), so the guard was
only ever needed for UPDATE — where a stale cached Identifier loaded before the
backfill ran could write null back over a freshly populated value.

Moving the guard inside the isIdentifier==true branch eliminates the API+DB call
on all INSERT paths, making bulk operations (imports, publishing pipelines) that
create many new identifiers unaffected during the migration window.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…up tasks

Task260403SetPermissionReferenceUnlogged:
- Converts permission_reference to UNLOGGED via ALTER TABLE SET UNLOGGED
- Table is a regenerable permission cache; crash-safety is not required
- Eliminates WAL write cost on every permission rebuild INSERT/DELETE
- forceRun() checks pg_class.relpersistence to skip if already unlogged

Task260403SetLz4CompressionOnTextColumns:
- Sets LZ4 compression on all text/bytea/jsonb/json columns in public schema
- Queries pg_attribute dynamically — no hardcoded column list, covers plugins
- Skips columns already using LZ4 (attcompression='l') for idempotency
- Continues on per-column failure so one bad column does not block all others
- Gracefully skips on PostgreSQL < 14 (attcompression column absent)
- LZ4 decompresses 3-5x faster than pglz; affects future writes only

postgres.sql:
- permission_reference CREATE TABLE now uses UNLOGGED
- LZ4 SET COMPRESSION block added for all eligible columns (fresh installs)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI: Safe To Rollback Area : Backend PR changes Java/Maven backend code

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Denormalize base_type onto identifier table for query performance

1 participant