Add gene ID utils and refactor eFP expression endpoint by rmobmina · Pull Request #305 · BioAnalyticResource/BAR_API

rmobmina · 2026-03-23T17:21:53Z

Adds api/utils/gene_id_utils.py
- Introduces DATABASE_SPECIES lookup table mapping all eFP databases to their species
- Adds helper functions for gene ID validation, normalization, and AGI to probeset conversion
Refactors api/resources/gene_expression.py
- Uses shared utilities for gene ID handling
- Validates gene ID format using species-specific regex rules
- Strips maize transcript suffixes when needed
- Converts Arabidopsis AGI IDs to ATH1 probeset IDs before querying
Adds constants for special database behavior
- PROBESET_DATABASES - databases that store probeset IDs
- CROSS_SPECIES_DATABASES - databases that accept Arabidopsis AGI input even if the database species differs (e.g. phelipanche, striga)
Updates .flake8
- Suppresses E241 to allow aligned formatting in DATABASE_SPECIES

…points

- Replace manual REGEX with BARUtils validators for all species - Add descriptive error messages for each species (e.g., "Invalid Cannabis gene ID") - Add comprehensive Sphinx/reST docstrings to all ORM files - Improve validation logic to handle both AGI and probeset IDs correctly - Fix SQL injection vulnerability by validating schema identifiers - Add identifier validation (alphanumeric + underscore only) before SQL construction Security: Addresses CodeQL high-severity SQL injection alert by validating all database/table/column identifiers match safe pattern before use in queries

- Fix trailing whitespace and blank line issues in efp_proxy.py - Fix missing blank lines in microarray_gene_expression.py - Add noqa comments for E402 in test_efp_data.py (imports after sys.path modification) - Remove unused variable retry_url - Fix spacing after commas in long list - Fix indentation issues

- Fix syntax error in efp_data.py (incomplete regex pattern) - Refactor efp_data.py to use class-based static methods (EFPDataService) following the BARUtils pattern for consistency - Add backward compatibility wrappers for existing function imports - Simplify efp_schemas.py by grouping similar schemas: * Created _simple_schema() helper for 5-column schemas * Created _schema_with_qa_columns() helper for QA column schemas * Reduced code duplication while maintaining same functionality - All changes tested with Docker build and Python syntax checks

- Updated _schema_with_qa_columns to properly handle length for string types - Added file_name_len and call_len parameters with defaults - Fixed shoot_apex schema to specify lengths for string columns (16, 2) - Fixes KeyError: 'length' in efp_dynamic.py model generation

- Fixes bootstrap script MySQL connection in Docker - Bootstrap script now connects to BAR_mysqldb instead of localhost socket - API starts successfully and serves on port 5000

- Check HTTP status code before parsing JSON response - Return 502 Bad Gateway when external API fails or returns non-200 - Catch JSONDecodeError when external API returns invalid response - Fixes test failure when atted.jp API is down or returns 403

MySQL doesn't allow TEXT/BLOB columns in primary keys without a key length. When bot_id_type='text' or probeset_type='text' is specified, the column type changes to TEXT but was still marked as primary_key=True, causing: OperationalError: (1170, "BLOB/TEXT column 'data_bot_id' used in key specification without a key length") Changes: - In _simple_schema(), when probeset_type='text', set primary_key=False - In _simple_schema(), when bot_id_type='text', set primary_key=False - Ensures data_signal remains a primary key for all databases Affected databases (11 total): - affydb, canola, canola_original, canola_seed - hnahal, human, humandb - meristemdb, rohan, rpatel, tomato_atlas All 58 tests pass with 5,643 subtests.

These scripts are called by config/init.sh to create simple eFP MySQL databases from schema definitions in CI, where SQLite mirrors are not available.

…ssion endpoint

VinLau · 2026-03-23T18:44:29Z

api/utils/gene_id_utils.py

+  OsAffx.1.1.S1_at   (Rice chip)
+
+Conversion pipeline used by the gene expression endpoint:
+  1. is_probeset_id(gene_id)         → skip conversion if already a probeset


Documentation or logic is wrong - you need to convert a probeset ID to a gene identifier.

asherpasha and others added 30 commits March 23, 2026 13:16

Revert "Update master branch"

6b5f2be

edited microarray and added 2 new endpoints and created efp_proxy end…

3a00428

…points

dynamic ORMs

47e0ded

tried to fix to pass

199dc91

Add fail-fast: false to see all Python test results

d4b97be

Add Sphinx/reST docstrings to dynamic ORM functions

26e4f0f

comment style

4173f8e

Add Q&A reference guide for supervisor meetings

6da9573

Add guide explaining schema definition vs database creation

800c9ec

got rid of inaccurate markdown files

510e7b0

flake8 ran

358ede2

Fix dynamic eFP query fallback for CI mirrors

ff50982

fixed syntax error

52a2def

Add DB_HOST environment variable for Docker MySQL connection

e185e62

- Fixes bootstrap script MySQL connection in Docker - Bootstrap script now connects to BAR_mysqldb instead of localhost socket - API starts successfully and serves on port 5000

Revert proxy changes - external API issue unrelated to EFP work

eca92c7

fixed?

c4741ad

fixed comments

e8411ee

added new schemas

45d2a6d

should pass all checks?

a5d07a8

checkcheck

b4d54c1

cleaned flake

e09790f

passed?

77bbf7d

updated schema

22f1dae

fixed metadata

b81db62

rmobmina added 7 commits March 23, 2026 13:18

set VARCHAR to 255

2271479

removed POST from efp proxy and simplified endpoints to a single one

e5ace17

add SQL Alchemy binds

9145384

Deleted some unused files, renamed efp proxy to gene_expression

270a40b

fixed naming error

2c4ef4a

restore bootstrap_simple_efp_dbs.py and efp_bootstrap.py needed for CI

fda1648

These scripts are called by config/init.sh to create simple eFP MySQL databases from schema definitions in CI, where SQLite mirrors are not available.

add gene ID detection and probeset conversion utilities for eFP expre…

a2aa164

…ssion endpoint

VinLau reviewed Mar 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add gene ID utils and refactor eFP expression endpoint#305

Add gene ID utils and refactor eFP expression endpoint#305
rmobmina wants to merge 37 commits intoBioAnalyticResource:devfrom
rmobmina:fix/refactor-efp-data-and-schemas

rmobmina commented Mar 23, 2026

Uh oh!

VinLau Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rmobmina commented Mar 23, 2026

Uh oh!

VinLau Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants