refactor : configurations, and introduce CUDA features (#53)#53
Merged
Conversation
- Increase MAX_SESSIONS to 1000 and set SESSION_TTL_HOURS to 24. - Update CPP_SERVER_URL to point to localhost:8080. - Fix a typo in TORCH_CHECKPOINT_PATH (removed trailing space before extension). - Set REQUEST_TIMEOUT_SECONDS to 60.
- Sync default field values with new environment baseline. - Remove duplicate `torch_checkpoint_path` definition and fix the trailing space typo in the file extension. - Update `request_timeout_seconds` to a float default of 60.0.
Implement dataset evaluation logic (`estimate_loss`) and standard training loops. - Add an interactive `--chat` interface mode utilizing token stream generation. - Configure automatic hardware routing between CUDA and CPU execution environments.
Implement dataset evaluation logic (`estimate_loss`) and standard training loops. - Add an interactive `--chat` interface mode utilizing token stream generation. - Configure automatic hardware routing between CUDA and CPU execution environments.
…urations - Update environment variables configuration and fix the trailing space typo in TORCH_CHECKPOINT_PATH. - Remove duplicate definition of torch_checkpoint_path in backend/config.py. - Decrease evaluation intervals (EVAL_INTERVAL) in the C++ engine for quicker validation tracking. - Add LibTorch C++ execution and interactive chat stream handler in torch_main.cpp. - Implement state migration (v1) in frontend settings store to default clients to the PyTorch backend.
- Remove the legacy typo (trailing space before extension) from the PyTorch checkpoint string in the header's logic.
- Update the tokenizer configuration in `llm.py` to use the `o200k_base` encoding. - Expand tokenization capabilities and adjust the `vocab_size` dynamically to support the updated vocabulary baseline.
… architecture - Replace static `generate_response` with a generator-based `stream_response` utilizing token yielding. - Update `GPTLanguageModel.generate` to act as an iterator yielding sequential token IDs instead of returning a complete array. - Implement token-by-token decoding (`decode([token_id])`) to support real-time user-interface updates. - Keep `generate_response` as a backward-compatible utility that aggregates the token stream.
- Multi-stage Dockerfile (CPU default, CUDA-ready via BASE_IMAGE arg) - Single container running FastAPI + React frontend via supervisord - Model weights mounted as volume at runtime (/app/models) - docker-compose.yml for local development - GitHub Actions workflow publishing to ghcr.io on master push and version tags - .dockerignore to keep build context clean
…ge Dockerfile (CPU default, CUDA-ready via BASE_IMAGE arg) - Single container running FastAPI + React frontend via supervisord - Model weights mounted as volume at runtime (/app/models) - docker-compose.yml for local development - GitHub Actions workflow publishing to ghcr.io on master push and version tags - .dockerignore to keep build context clean - Multi-stage Dockerfile (CPU default, CUDA-ready via BASE_IMAGE arg) - Single container running FastAPI + React frontend via supervisord - Model weights mounted as volume at runtime (/app/models) - docker-compose.yml for local development - GitHub Actions workflow publishing to ghcr.io on master push and version tags - .dockerignore to keep build context clean
…ge Dockerfile (CPU default, CUDA-ready via BASE_IMAGE arg) - Single container running FastAPI + React frontend via supervisord - Model weights mounted as volume at runtime (/app/models) - docker-compose.yml for local development - GitHub Actions workflow publishing to ghcr.io on master push and version tags - .dockerignore to keep build context clean
…ge Dockerfile (CPU default, CUDA-ready via BASE_IMAGE arg) - Single container running FastAPI + React frontend via supervisord - Model weights mounted as volume at runtime (/app/models) - docker-compose.yml for local development - GitHub Actions workflow publishing to ghcr.io on master push and version tags - .dockerignore to keep build context clean
…ge Dockerfile (CPU default, CUDA-ready via BASE_IMAGE arg) - Single container running FastAPI + React frontend via supervisord - Model weights mounted as volume at runtime (/app/models) - docker-compose.yml for local development - GitHub Actions workflow publishing to ghcr.io on master push and version tags - .dockerignore to keep build context clean
…ge Dockerfile (CPU default, CUDA-ready via BASE_IMAGE arg) - Single container running FastAPI + React frontend via supervisord - Model weights mounted as volume at runtime (/app/models) - docker-compose.yml for local development - GitHub Actions workflow publishing to ghcr.io on master push and version tags - .dockerignore to keep build context clean - Multi-stage Dockerfile (CPU default, CUDA-ready via BASE_IMAGE arg) - Single container running FastAPI + React frontend via supervisord - Model weights mounted as volume at runtime (/app/models) - docker-compose.yml for local development - GitHub Actions workflow publishing to ghcr.io on master push and version tags - .dockerignore to keep build context clean
…ge Dockerfile (CPU default, CUDA-ready via BASE_IMAGE arg) - Single container running FastAPI + React frontend via supervisord - Model weights mounted as volume at runtime (/app/models) - docker-compose.yml for local development - GitHub Actions workflow publishing to ghcr.io on master push and version tags - .dockerignore to keep build context clean
Ignore specific paths for push and pull request events.
- Add `adamw_kernel` for parallelized parameter updates on CUDA. - Implement host side `adamw_update` with shape and parameter validation. - Guard device transitions using `DeviceGuard`.
- Add `attention_forward_kernel` supporting casual masking and online softmax. - Implement parallelized block reduction helpers `block_sum` and `block_max`. - Include shared memory utilization for intermediate warp-level reduction. - Add safety checks for contiguous F32 CUDA tensors and integer bounds.
…at and unused dependencies
…at and unused dependencies
…at and unused dependencies
…at and unused dependencies
…at and unused dependencies
…at and unused dependencies
…at and unused dependencies
…at and unused dependencies
…at and unused dependencies
codeaddict-119
approved these changes
May 25, 2026
Collaborator
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
Causal Multi-Head Attention Forward Pass (CUDA)
PR implements the CUDA forward pass for causal multi-head attention (attention_forward). It includes the core GPU kernel, custom block-level reduction primitives, and tensor validation helpers.
Core Attention Kernelattention_forward_kernel:
#52
#11
#12
#14
#29