Skip to content

Add gene ID utils and refactor eFP expression endpoint#305

Open
rmobmina wants to merge 37 commits intoBioAnalyticResource:devfrom
rmobmina:fix/refactor-efp-data-and-schemas
Open

Add gene ID utils and refactor eFP expression endpoint#305
rmobmina wants to merge 37 commits intoBioAnalyticResource:devfrom
rmobmina:fix/refactor-efp-data-and-schemas

Conversation

@rmobmina
Copy link

  • Adds api/utils/gene_id_utils.py

    • Introduces DATABASE_SPECIES lookup table mapping all eFP databases to their species
    • Adds helper functions for gene ID validation, normalization, and AGI to probeset conversion
  • Refactors api/resources/gene_expression.py

    • Uses shared utilities for gene ID handling
    • Validates gene ID format using species-specific regex rules
    • Strips maize transcript suffixes when needed
    • Converts Arabidopsis AGI IDs to ATH1 probeset IDs before querying
  • Adds constants for special database behavior

    • PROBESET_DATABASES - databases that store probeset IDs
    • CROSS_SPECIES_DATABASES - databases that accept Arabidopsis AGI input even if the database species differs (e.g. phelipanche, striga)
  • Updates .flake8

    • Suppresses E241 to allow aligned formatting in DATABASE_SPECIES

asherpasha and others added 30 commits March 23, 2026 13:16
- Replace manual REGEX with BARUtils validators for all species
- Add descriptive error messages for each species (e.g., "Invalid Cannabis gene ID")
- Add comprehensive Sphinx/reST docstrings to all ORM files
- Improve validation logic to handle both AGI and probeset IDs correctly
- Fix SQL injection vulnerability by validating schema identifiers
- Add identifier validation (alphanumeric + underscore only) before SQL construction

Security: Addresses CodeQL high-severity SQL injection alert by validating
all database/table/column identifiers match safe pattern before use in queries
- Fix trailing whitespace and blank line issues in efp_proxy.py
- Fix missing blank lines in microarray_gene_expression.py
- Add noqa comments for E402 in test_efp_data.py (imports after sys.path modification)
- Remove unused variable retry_url
- Fix spacing after commas in long list
- Fix indentation issues
- Fix syntax error in efp_data.py (incomplete regex pattern)
- Refactor efp_data.py to use class-based static methods (EFPDataService)
  following the BARUtils pattern for consistency
- Add backward compatibility wrappers for existing function imports
- Simplify efp_schemas.py by grouping similar schemas:
  * Created _simple_schema() helper for 5-column schemas
  * Created _schema_with_qa_columns() helper for QA column schemas
  * Reduced code duplication while maintaining same functionality
- All changes tested with Docker build and Python syntax checks
- Updated _schema_with_qa_columns to properly handle length for string types
- Added file_name_len and call_len parameters with defaults
- Fixed shoot_apex schema to specify lengths for string columns (16, 2)
- Fixes KeyError: 'length' in efp_dynamic.py model generation
- Fixes bootstrap script MySQL connection in Docker
- Bootstrap script now connects to BAR_mysqldb instead of localhost socket
- API starts successfully and serves on port 5000
- Check HTTP status code before parsing JSON response
- Return 502 Bad Gateway when external API fails or returns non-200
- Catch JSONDecodeError when external API returns invalid response
- Fixes test failure when atted.jp API is down or returns 403
MySQL doesn't allow TEXT/BLOB columns in primary keys without a key length.
When bot_id_type='text' or probeset_type='text' is specified, the column
type changes to TEXT but was still marked as primary_key=True, causing:
  OperationalError: (1170, "BLOB/TEXT column 'data_bot_id' used in key
  specification without a key length")

Changes:
- In _simple_schema(), when probeset_type='text', set primary_key=False
- In _simple_schema(), when bot_id_type='text', set primary_key=False
- Ensures data_signal remains a primary key for all databases

Affected databases (11 total):
- affydb, canola, canola_original, canola_seed
- hnahal, human, humandb
- meristemdb, rohan, rpatel, tomato_atlas

All 58 tests pass with 5,643 subtests.
OsAffx.1.1.S1_at (Rice chip)

Conversion pipeline used by the gene expression endpoint:
1. is_probeset_id(gene_id) → skip conversion if already a probeset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation or logic is wrong - you need to convert a probeset ID to a gene identifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants