Skip to content

Add blobstorage for CRs#896

Open
badcount wants to merge 52 commits intomainfrom
badcount/etcd-blobstorage
Open

Add blobstorage for CRs#896
badcount wants to merge 52 commits intomainfrom
badcount/etcd-blobstorage

Conversation

@badcount
Copy link
Copy Markdown
Contributor

What type of PR is this? (check all applicable)

  • Refactor
  • Feature
  • Bug Fix
  • Optimization
  • Documentation Update

What changed?
Support blob storage for ETCD

Why?

How did you test it?

Potential risks

Release notes

Documentation Changes

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 11, 2026

📊 Combined Coverage Report

Component Coverage Baseline New Code Status
🐍 Python 82.43% ≥70% ≥90% ✅
🐹 Go 62.5% ≥60% ≥90% ✅
📜 JavaScript N/A% (no coverage data) ≥78.74% ≥90% ✅ ⏭️

Coverage Policy

Baseline (Existing Code):

  • Python: ≥70% (current coverage)
  • Go: ≥60% (current coverage)
  • JavaScript: ≥78.74% (current coverage)

New/Changed Code: ≥90% STRICTLY ENFORCED

Long-term Goal: Maintain and improve coverage baselines


📝 Coverage Guidelines
  • All new code must include tests
  • Coverage should not decrease from the base branch
  • Use # pragma: no cover sparingly for truly untestable code
  • Aim for meaningful tests, not just coverage numbers
🔗 Quick Links

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 16, 2026

Go Coverage Report (Bazel)

Total Coverage: 62.5%
Tests Passed: Executed

Coverage Policy:

  • Baseline (existing code): ≥60% (current coverage)
  • New/changed code: ≥90% ✅ STRICTLY ENFORCED
  • Long-term goal: Improve baseline to 90%
Coverage Summary
See artifacts for details
📊 How to view detailed coverage

Download the coverage artifacts from this run and open coverage-html/index.html

Or run locally:

bazel coverage //go/...
genhtml bazel-out/_coverage/_coverage_report.dat -o coverage-html
open coverage-html/index.html

Note: This project uses Bazel. Run bazel test //go/... locally to test.

View detailed HTML report in artifacts

@badcount badcount changed the base branch from badcount/feature-ingester2 to main March 16, 2026 16:26
himanshk96 and others added 27 commits March 16, 2026 09:33
…onvention

- Fix go/storage/mysql/BUILD.bazel: Rename target from 'mysql' to 'go_default_library'
- Fix go/storage/minio/BUILD.bazel: Rename target from 'minio' to 'go_default_library'
- Fix go/components/ingester/BUILD.bazel: Rename target from 'ingester' to 'go_default_library'
- Add missing MySQL driver dependency (github.com/go-sql-driver/mysql v1.9.3)
- Update MODULE.bazel to include com_github_go_sql_driver_mysql in use_repo
- Update all dependency references to use :go_default_library suffix

This fixes the build error:
'no such target //go/storage/mysql:go_default_library'

The codebase uses go_default_library naming convention for all Go targets
as specified by the gazelle configuration.
- Add ingester_config.go to go/base/config/BUILD.bazel srcs
- Add minio/mysql dependencies to go/base/config/BUILD.bazel
- Fix MinIO tags API: import tags package and rename parameter to avoid conflict
- Add tags dependency to go/storage/minio/BUILD.bazel
- Rename CrdObjects to AllCRDObjects in proto/api/v2/crd_objects.go to avoid conflict with generated code
- Fix TypeMeta usage in ingester controller: use metav1.TypeMeta instead of runtime.TypeMeta
- Add metav1 dependency to go/components/ingester/BUILD.bazel
- Fix TargetKind type: change from runtime.Object to client.Object
- Add client package import to ingester module.go
- Add type assertion for runtime.Object to client.Object in module registration

All ingester components now build successfully.
The mysql-ingester.yaml file was redundant as mysql.yaml already contains
all the necessary resources (ConfigMap and Job) for MySQL ingester setup.

Changes:
- Delete python/michelangelo/cli/sandbox/resources/mysql-ingester.yaml
- Remove reference from sandbox.py resources list
- Update documentation references in INGESTER_SANDBOX_INTEGRATION.md

This eliminates the confusing 'AlreadyExists' errors when running sandbox setup.
- Replace google.golang.org/protobuf with github.com/gogo/protobuf
- CRDs are generated using gogo/protobuf via kubeproto compiler
- This fixes 'object does not implement proto.Message' error in ingester
- Update BUILD.bazel deps to use @com_github_gogo_protobuf//proto
…upport

Schema Management:
- Create single canonical SQL file: scripts/ingester_schema.sql
  - Complete schema for ALL 13 CRDs (39 tables total)
  - Includes all indexed fields from protobuf GetIndexedKeyValuePairs()
  - Idempotent with CREATE TABLE IF NOT EXISTS
  - Works for sandbox and production

- Add Kubernetes Job: scripts/ingester_schema_job.yaml
  - Loads ingester_schema.sql via ConfigMap
  - Includes init container to wait for MySQL
  - Auto-cleanup with TTL

- Add shell script: scripts/init_ingester_db.sh
  - Unified script for any environment
  - Uses ingester_schema.sql directly
  - Environment variable configuration

- Add simple README.md with usage instructions

Cleanup:
- Delete deprecated mysql_schema.sql (incomplete, out of sync)
- Delete deprecated setup_mysql_sandbox.sh
- Delete redundant documentation files
- Update all references in documentation to new files

Sandbox Fixes:
- Fix sandbox.py to use GITHUB_USERNAME environment variable
  - Resolves hardcoded 'USERNAME' causing image pull failures
- Update image tags in controllermgr and apiserver YAMLs

Verified:
- Single SQL file tested and working
- Creates 39 tables (13 CRDs × 3 tables each)
- All indexed fields present
- Labels and annotations tables functional
- Enhanced init container to check DNS resolution, connectivity, and database access
- Increased wait time to 2 minutes (60 attempts)
- Added comprehensive documentation about health checks and deployment ordering
- Documented best practices for production deployment

The Job now ensures MySQL is fully healthy before executing schema SQL.
- Document health check behavior and retry logic
- Add deployment order best practices
- Include Helm/Kustomize integration notes
- Clarify that Job waits for MySQL health before executing
The ingester now stores everything in MySQL (including large fields).
This simplifies the implementation and removes unnecessary complexity.

Removed:
- go/storage/minio/ directory (new MinIO implementation)
- BlobStorage field from ingester Reconciler
- Blob storage handling in controller sync/delete logic
- MinIO config from ingester_config.go
- MinIO dependency from BUILD.bazel

Note: The existing MinIO blob storage (go/base/blobstore/minio/) is
UNCHANGED and continues to work for workers, jobs, and deployments.

Added:
- INGESTER_PRODUCTION_GAPS.md - Production readiness checklist

Result:
- Simpler codebase (removed ~250 lines)
- Easier to maintain
- All data in MySQL (easier to query)
- Can add blob storage later if needed as a separate feature
- Remove all .md documentation files (9 files)
- Remove MinIO configuration from config/sandbox_ingester.yaml
- Ingester is now MySQL-only, no blob storage
- Keep production code only
… ConfigMap

- Delete config/sandbox_ingester.yaml (not used, just an example)
- Add ingester config section to michelangelo-controllermgr ConfigMap
- All config is now hardcoded in the sandbox YAML files
- No separate config files needed
- Fix proto import paths: proto/api -> proto-go/api
- Add proto/api/v2 dependency to mysql BUILD.bazel
- All builds now pass successfully
This commit updates the ingester MySQL schema from a partial schema (5 CRDs)
to a complete schema with all 13 CRD types watched by the ingester.

Changes:
- Updated python/michelangelo/cli/sandbox/resources/mysql-ingester.yaml
  * Replaced partial schema with complete schema (all 13 CRDs)
  * Changed ConfigMap name from 'mysql-ingester-schema' to 'ingester-schema-init'
  * Updated to use lowercase table names matching ingester code (e.g., 'pipelinerun')
  * Added proper metadata labels for ingester component

- Added scripts/ingester-schema-init-job-complete.yaml
  * Standalone Kubernetes Job for schema initialization
  * Includes init container to wait for MySQL readiness
  * Verifies all 13 tables are created successfully
  * Useful for manual deployments outside sandbox

Schema now includes tables for all 13 CRD types:
1. Model
2. ModelFamily
3. Pipeline
4. PipelineRun
5. Deployment
6. InferenceServer
7. Project
8. Revision
9. Cluster
10. RayCluster
11. RayJob
12. SparkJob
13. TriggerRun

Testing:
- Verified all 9 available CRD types sync successfully to MySQL
- Confirmed table naming matches ingester code expectations
- Validated schema job completes successfully

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add docs/ingester-sandbox-validation.md with end-to-end validation results
  covering sandbox setup, MySQL schema verification, CR creation, and update
  propagation for all 13 CRD types
- Add scripts/ingester-test-crs/ with sample CR YAML files for each CRD
  (Project, ModelFamily, Model, Pipeline, PipelineRun, InferenceServer,
  Revision, Cluster, RayCluster, RayJob, SparkJob, TriggerRun, Deployment)
- Documents known issues found: SparkJob controller double-panic (pre-existing),
  TriggerRun business controller deletion, Cluster namespace restriction

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- go/components/ingester: proto-go/api -> proto/api in test deps
- go/storage/mysql: remove unused proto/api/v2 dep, fix dep ordering
- proto/api/v2: move srcs before embed per gazelle convention

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous gazelle run was executed while a git worktree existed at
.claude/worktrees/bash-logging, causing gazelle to see ambiguous package
paths and strip all internal deps from BUILD.bazel files across the repo.

Removed the stale worktree and re-ran gazelle cleanly, which correctly
restored internal dependencies (//go/storage, //go/api, //proto/api, etc.)
in 113 BUILD.bazel files.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
go-sql-driver/mysql is directly imported by go/storage/mysql/mysql.go,
so go mod tidy moves it from indirect to the require block.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
himanshk96 and others added 25 commits March 16, 2026 10:13
…onvention

- Fix go/storage/mysql/BUILD.bazel: Rename target from 'mysql' to 'go_default_library'
- Fix go/storage/minio/BUILD.bazel: Rename target from 'minio' to 'go_default_library'
- Fix go/components/ingester/BUILD.bazel: Rename target from 'ingester' to 'go_default_library'
- Add missing MySQL driver dependency (github.com/go-sql-driver/mysql v1.9.3)
- Update MODULE.bazel to include com_github_go_sql_driver_mysql in use_repo
- Update all dependency references to use :go_default_library suffix

This fixes the build error:
'no such target //go/storage/mysql:go_default_library'

The codebase uses go_default_library naming convention for all Go targets
as specified by the gazelle configuration.
- Add ingester_config.go to go/base/config/BUILD.bazel srcs
- Add minio/mysql dependencies to go/base/config/BUILD.bazel
- Fix MinIO tags API: import tags package and rename parameter to avoid conflict
- Add tags dependency to go/storage/minio/BUILD.bazel
- Rename CrdObjects to AllCRDObjects in proto/api/v2/crd_objects.go to avoid conflict with generated code
- Fix TypeMeta usage in ingester controller: use metav1.TypeMeta instead of runtime.TypeMeta
- Add metav1 dependency to go/components/ingester/BUILD.bazel
- Fix TargetKind type: change from runtime.Object to client.Object
- Add client package import to ingester module.go
- Add type assertion for runtime.Object to client.Object in module registration

All ingester components now build successfully.
The mysql-ingester.yaml file was redundant as mysql.yaml already contains
all the necessary resources (ConfigMap and Job) for MySQL ingester setup.

Changes:
- Delete python/michelangelo/cli/sandbox/resources/mysql-ingester.yaml
- Remove reference from sandbox.py resources list
- Update documentation references in INGESTER_SANDBOX_INTEGRATION.md

This eliminates the confusing 'AlreadyExists' errors when running sandbox setup.
- Replace google.golang.org/protobuf with github.com/gogo/protobuf
- CRDs are generated using gogo/protobuf via kubeproto compiler
- This fixes 'object does not implement proto.Message' error in ingester
- Update BUILD.bazel deps to use @com_github_gogo_protobuf//proto
…upport

Schema Management:
- Create single canonical SQL file: scripts/ingester_schema.sql
  - Complete schema for ALL 13 CRDs (39 tables total)
  - Includes all indexed fields from protobuf GetIndexedKeyValuePairs()
  - Idempotent with CREATE TABLE IF NOT EXISTS
  - Works for sandbox and production

- Add Kubernetes Job: scripts/ingester_schema_job.yaml
  - Loads ingester_schema.sql via ConfigMap
  - Includes init container to wait for MySQL
  - Auto-cleanup with TTL

- Add shell script: scripts/init_ingester_db.sh
  - Unified script for any environment
  - Uses ingester_schema.sql directly
  - Environment variable configuration

- Add simple README.md with usage instructions

Cleanup:
- Delete deprecated mysql_schema.sql (incomplete, out of sync)
- Delete deprecated setup_mysql_sandbox.sh
- Delete redundant documentation files
- Update all references in documentation to new files

Sandbox Fixes:
- Fix sandbox.py to use GITHUB_USERNAME environment variable
  - Resolves hardcoded 'USERNAME' causing image pull failures
- Update image tags in controllermgr and apiserver YAMLs

Verified:
- Single SQL file tested and working
- Creates 39 tables (13 CRDs × 3 tables each)
- All indexed fields present
- Labels and annotations tables functional
The ingester now stores everything in MySQL (including large fields).
This simplifies the implementation and removes unnecessary complexity.

Removed:
- go/storage/minio/ directory (new MinIO implementation)
- BlobStorage field from ingester Reconciler
- Blob storage handling in controller sync/delete logic
- MinIO config from ingester_config.go
- MinIO dependency from BUILD.bazel

Note: The existing MinIO blob storage (go/base/blobstore/minio/) is
UNCHANGED and continues to work for workers, jobs, and deployments.

Added:
- INGESTER_PRODUCTION_GAPS.md - Production readiness checklist

Result:
- Simpler codebase (removed ~250 lines)
- Easier to maintain
- All data in MySQL (easier to query)
- Can add blob storage later if needed as a separate feature
- Fix proto import paths: proto/api -> proto-go/api
- Add proto/api/v2 dependency to mysql BUILD.bazel
- All builds now pass successfully
This commit updates the ingester MySQL schema from a partial schema (5 CRDs)
to a complete schema with all 13 CRD types watched by the ingester.

Changes:
- Updated python/michelangelo/cli/sandbox/resources/mysql-ingester.yaml
  * Replaced partial schema with complete schema (all 13 CRDs)
  * Changed ConfigMap name from 'mysql-ingester-schema' to 'ingester-schema-init'
  * Updated to use lowercase table names matching ingester code (e.g., 'pipelinerun')
  * Added proper metadata labels for ingester component

- Added scripts/ingester-schema-init-job-complete.yaml
  * Standalone Kubernetes Job for schema initialization
  * Includes init container to wait for MySQL readiness
  * Verifies all 13 tables are created successfully
  * Useful for manual deployments outside sandbox

Schema now includes tables for all 13 CRD types:
1. Model
2. ModelFamily
3. Pipeline
4. PipelineRun
5. Deployment
6. InferenceServer
7. Project
8. Revision
9. Cluster
10. RayCluster
11. RayJob
12. SparkJob
13. TriggerRun

Testing:
- Verified all 9 available CRD types sync successfully to MySQL
- Confirmed table naming matches ingester code expectations
- Validated schema job completes successfully

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add docs/ingester-sandbox-validation.md with end-to-end validation results
  covering sandbox setup, MySQL schema verification, CR creation, and update
  propagation for all 13 CRD types
- Add scripts/ingester-test-crs/ with sample CR YAML files for each CRD
  (Project, ModelFamily, Model, Pipeline, PipelineRun, InferenceServer,
  Revision, Cluster, RayCluster, RayJob, SparkJob, TriggerRun, Deployment)
- Documents known issues found: SparkJob controller double-panic (pre-existing),
  TriggerRun business controller deletion, Cluster namespace restriction

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- go/components/ingester: proto-go/api -> proto/api in test deps
- go/storage/mysql: remove unused proto/api/v2 dep, fix dep ordering
- proto/api/v2: move srcs before embed per gazelle convention

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous gazelle run was executed while a git worktree existed at
.claude/worktrees/bash-logging, causing gazelle to see ambiguous package
paths and strip all internal deps from BUILD.bazel files across the repo.

Removed the stale worktree and re-ran gazelle cleanly, which correctly
restored internal dependencies (//go/storage, //go/api, //proto/api, etc.)
in 113 BUILD.bazel files.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
go-sql-driver/mysql is directly imported by go/storage/mysql/mysql.go,
so go mod tidy moves it from indirect to the require block.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GetDeletionGracePeriodSeconds() returns the static value set at deletion
time, not a countdown. Checking it caused a livelock: if the grace period
was configured > 10s, the reconciler requeued forever and never completed
the deletion.

The API Server already enforces the grace period before the delete event
reaches controllers, so by the time we see DeletionTimestamp != nil the
grace period has been respected. Just proceed directly to MySQL delete
and finalizer removal.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lear before MySQL write

- protoc-gen-kubeproto: extend findBlobFields to mark repeated field segments with
  "[]" suffix so genClearCrdFields emits range loops instead of direct nil
  assignments, which would produce invalid Go for slice fields.
  Add blobFieldSegment, parsePathSegments, hasRepeatedSegment, and
  genClearRepeatedBlobField (recursive, handles nested repeated segments).
  Add TestGenClearRepeatedBlobField covering single/double/mixed paths.

- pipeline_run.proto: annotate PipelineRunStepInfo.output and .input with
  external_storage = true so large Struct payloads are stored in blob (S3)
  rather than etcd/MySQL.

- pipeline_run.pb.go: hand-update ClearBlobFields with the range loops that
  the fixed generator would now produce for Status.Steps[].Output and
  Status.Steps[].Input.

- blobstorage/utils.go: implement Phase 2 of the write path — deep-copy obj,
  call ClearBlobFields on the copy, and upsert the stripped copy to MySQL.
  The original obj is left intact. This keeps large payloads out of etcd
  while the blob (S3) retains the full object for later restoration.

- blobstorage/handler_test.go: update TestHandler_ClearBlobFields_Steps to
  reflect the surgical design (steps survive, only Input/Output are cleared).
  Add end-to-end test TestHandleUpdate_PipelineRun_StepInputOutputClearedInMySQL_RestoredOnGet
  that verifies: S3 has full data, MySQL has cleared Input/Output, and API GET
  restores the payloads via MergeWithExternalBlob while preserving ETCD metadata.

- Add README for protoc-gen-kubeproto documenting external_storage, repeated
  field handling, and FillBlobFields semantics.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@badcount badcount force-pushed the badcount/etcd-blobstorage branch from 0276a55 to a4807a6 Compare March 16, 2026 21:27
@github-actions
Copy link
Copy Markdown

🔍 Go Lint & TODO Tracking Results

⚠️ golangci-lint: Issues detected

level=warning msg="[config_reader] The configuration option `run.skip-dirs` is deprecated, please use `issues.exclude-dirs`."
level=warning msg="[config_reader] The configuration option `output.format` is deprecated, please use `output.formats`"
Error: can't load config: the Go language version (go1.23) used to build golangci-lint is lower than the targeted Go version (1.24.3)
Failed executing command with error: can't load config: the Go language version (go1.23) used to build golangci-lint is lower than the targeted Go version (1.24.3)

✅ TODO Tracking: All TODOs properly linked to issues

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 24, 2026

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
0 out of 2 committers have signed the CLA.

❌ badcount
❌ himanshk96
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants