Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
127 changes: 127 additions & 0 deletions unstructured2graph/spec/sso-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
# SSO (Single Sign-On) Design for unstructured2graph

**Status:** Discovery / Requirements Gathering
**Date:** 2026-03-31

## 1. System Landscape

The SSO solution must unify authentication across three components:

| Component | Protocol/Interface | Current Auth |
|---|---|---|
| **Memgraph** (database) | Bolt (port 7687) | Username/password via env vars (`MEMGRAPH_URL`, `MEMGRAPH_USER`, `MEMGRAPH_PASSWORD`). Enterprise supports OIDC (Entra ID, Okta), SAML, and LDAP via external auth modules. |
| **Memgraph Lab** (UI) | HTTP (browser) | Enterprise Docker deployment supports OIDC and SAML SSO with Entra ID and Okta. Desktop version has no SSO. |
| **LightRAG** (AI/RAG API) | REST/FastAPI (port 9621) | API key (`X-API-Key` header) and/or JWT tokens (OAuth2 password flow at `/login`). Supports bcrypt-hashed passwords, token expiry, auto-renewal, and path whitelisting. |
| **unstructured2graph** (this project) | Python library | No auth layer — inherits Memgraph connection credentials from env vars and LightRAG wrapper config. |

## 2. Approach Options

### Option A: Centralized Identity Provider (IdP) with OIDC

A single IdP (e.g., Entra ID, Okta, Keycloak, Auth0) issues tokens consumed by all components.

- Memgraph Enterprise: native OIDC support via external auth module
- Memgraph Lab Enterprise (Docker): native OIDC support
- LightRAG: would need a custom auth dependency or reverse-proxy layer that validates OIDC tokens
- unstructured2graph: token-passing middleware or service account credentials

### Option B: Token Broker / API Gateway

An API gateway (e.g., Kong, APISIX, Nginx + OAuth2-proxy) sits in front of LightRAG and unstructured2graph, handling authentication centrally. Memgraph uses its own OIDC/LDAP integration.

### Option C: Shared JWT with a Lightweight Auth Service

A custom auth service issues JWTs after authenticating against the IdP. All services validate the same JWT. Memgraph's external auth module is extended to validate these JWTs on Bolt connections.

### Option D: Service Accounts + User SSO (Hybrid)

User-facing SSO via OIDC for Memgraph Lab and a web frontend, while backend services (unstructured2graph, LightRAG) use service account credentials or short-lived tokens minted from the IdP's client-credentials flow.

---

## 3. Token Type Considerations

| Token Type | Pros | Cons |
|---|---|---|
| **OIDC ID Token (JWT)** | Standard claims (`sub`, `email`, `roles`), widely supported, Memgraph native support | Short-lived, needs refresh flow, not designed for API authorization |
| **OIDC Access Token (opaque or JWT)** | Designed for API auth, scoped, Memgraph accepts it | May require introspection endpoint if opaque |
| **Custom JWT** | Full control over claims, single validation across services | Must build issuance and rotation; not natively recognized by Memgraph |
| **API Key** | Simple, LightRAG supports it natively | No identity, no expiry by default, hard to revoke at scale |
| **SAML Assertion** | Enterprise-friendly, Memgraph supports it | XML-based, not practical for API-to-API calls |

---

## 4. Client Connection Patterns

| Client | Connects To | Mechanism |
|---|---|---|
| Browser user | Memgraph Lab | OIDC redirect flow (authorization code + PKCE) |
| Browser user | LightRAG WebUI | OIDC redirect or JWT from `/login` |
| Python SDK (unstructured2graph) | Memgraph (Bolt) | Token passed via Neo4j driver auth (`access_token=...;id_token=...`) |
| Python SDK (unstructured2graph) | LightRAG API | Bearer token or API key in HTTP header |
| CI/CD / batch jobs | Memgraph + LightRAG | Client-credentials (machine-to-machine) tokens |
| Memgraph Lab | Memgraph (Bolt) | Forwards user's SSO token to database |

---

## 5. LightRAG AI Authentication

LightRAG's API provides RAG queries, document management, and an Ollama-compatible chat interface. Authentication options:

1. **Extend LightRAG's FastAPI auth dependency** — replace `get_combined_auth_dependency()` in `lightrag/api/utils_api.py` with OIDC token validation (e.g., using `python-jose` or `authlib`).
2. **Reverse proxy with token validation** — Nginx/Envoy validates OIDC tokens before forwarding to LightRAG; LightRAG runs unauthenticated behind the proxy.
3. **Keep LightRAG's JWT but federate login** — LightRAG's `/login` endpoint delegates to the IdP instead of local username/password. The `AuthHandler` class in `lightrag/api/auth.py` is the extension point.

---

## 6. Follow-Up Questions

Before designing the solution, I need clarity on the following:

### Deployment & Infrastructure

1. **Are you running Memgraph Enterprise or Community?** OIDC/SAML SSO and RBAC are Enterprise-only features. If Community, we're limited to basic username/password on Bolt, and the SSO scope shrinks to LightRAG + any web layer.

2. **How is the stack deployed?** Docker Compose? Kubernetes? Bare metal? This affects whether an API gateway or sidecar proxy approach is practical.

3. **Is Memgraph Lab deployed via Docker or used as a desktop app?** SSO in Lab requires the Docker deployment.

4. **Is there an existing Identity Provider (IdP)?** If so, which one — Entra ID, Okta, Keycloak, Google Workspace, or something else? Or do we need to stand one up (e.g., Keycloak)?

### Users & Access Patterns

5. **Who are the users?** Internal team only, or external/customer-facing? This affects the complexity of the auth flow and whether self-registration is needed.

6. **Do you need role-based access control (RBAC)?** For example: some users can only query (read), others can ingest documents (write), admins manage the system. If so, where should roles be enforced — at the IdP level, Memgraph level, LightRAG level, or all?

7. **Are there machine-to-machine (M2M) clients?** For example, CI/CD pipelines, cron jobs, or other services calling unstructured2graph programmatically. These typically need client-credentials flow rather than user login.

8. **Is multi-tenancy a concern?** Should different teams/orgs see different subsets of the knowledge graph?

### LightRAG Specifics

9. **Are you running LightRAG as a standalone server (`lightrag-server`) or only as an embedded Python library?** The SSO approach differs significantly — a server has HTTP endpoints to protect, while a library delegates auth to the calling application.

10. **Do you need to secure the LightRAG WebUI separately from the API?** The WebUI currently falls back to guest access if no auth accounts are configured.

11. **Which LightRAG endpoints need protection?** All of them, or only write operations (document upload, graph mutation) while queries remain open?

### Token & Session Management

12. **What token lifetime is acceptable?** Short-lived (minutes) with refresh, or longer-lived (hours)? This affects UX for interactive users vs. batch processes.

13. **Do you need token revocation?** For example, if someone leaves the team, can you wait for tokens to expire, or must you revoke immediately?

14. **Should the Python SDK (unstructured2graph) handle token refresh transparently**, or is it acceptable to require callers to manage their own tokens?

### Scope & Priority

15. **Is this for a production deployment or a development/POC setup?** This determines how much infrastructure (IdP, gateway, TLS) to invest in upfront.

16. **What's the priority order?** If we can't do everything at once, which is most important:
- Securing Memgraph Bolt connections?
- SSO for Memgraph Lab?
- Protecting LightRAG API/WebUI?
- Unified identity across all three?

17. **Are there compliance requirements** (SOC2, HIPAA, GDPR) that constrain token storage, session length, or audit logging?
111 changes: 111 additions & 0 deletions unstructured2graph/spec/sso-shared-token-patch.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
# Patch: Hot-Reloadable API Key for LightRAG

**Status:** Proposal
**Date:** 2026-04-01
**Targets:** [LightRAG](https://github.com/HKUDS/LightRAG) upstream

## Problem

`LIGHTRAG_API_KEY` is read once at startup and captured in a closure. Rotating the key requires restarting the server, which drops in-flight requests and interrupts users.

## Root Cause

In `lightrag/api/utils_api.py`, `get_combined_auth_dependency(api_key)` receives the key as a parameter and closes over it:

```python
# Current code (simplified from utils_api.py)
def get_combined_auth_dependency(api_key: Optional[str] = None):
api_key_configured = api_key is not None

async def combined_dependency(
request: Request,
api_key_header_value: Optional[str] = Security(api_key_header),
token: Optional[str] = Depends(oauth2_scheme_optional),
):
# ...
if api_key_configured and api_key_header_value and api_key_header_value == api_key:
return # API key validation successful
# ...

return combined_dependency
```

The `api_key` variable is bound at dependency-creation time, not at request time.

## Fix

Re-read `os.environ` on each request instead of comparing against the closure-captured value.

### Patch for `lightrag/api/utils_api.py`

```diff
def get_combined_auth_dependency(api_key: Optional[str] = None):
- api_key_configured = api_key is not None
+ # Store the initial key as fallback, but prefer the live env var
+ initial_api_key = api_key

async def combined_dependency(
request: Request,
api_key_header_value: Optional[str] = Security(api_key_header),
token: Optional[str] = Depends(oauth2_scheme_optional),
):
+ # Re-read env var on each request to support hot-reload
+ current_api_key = os.environ.get("LIGHTRAG_API_KEY") or initial_api_key
+ api_key_configured = current_api_key is not None
+
# ... (whitelist check unchanged) ...

if api_key_configured and api_key_header_value and api_key_header_value == current_api_key:
return # API key validation successful

# ... (rest of auth logic unchanged) ...

return combined_dependency
```

### Patch for `lightrag/api/lightrag_server.py`

No change needed. `create_app()` still passes `api_key` to `get_combined_auth_dependency()` — it just becomes the fallback if the env var is unset.

## Rotation flow after patching

```bash
# No restart needed — just update the env var in the running process
# Option 1: If running via a process manager that supports env updates (e.g., systemd, Docker)
docker exec <container> sh -c 'export LIGHTRAG_API_KEY=<new-token>'

# Option 2: Add a small admin endpoint (optional, see below)
curl -X POST http://localhost:9621/admin/rotate-key \
-H "X-API-Key: $OLD_KEY" \
-d '{"new_key": "<new-token>"}'
```

### Optional: Admin endpoint for key rotation

If you can't easily update env vars in the running process (e.g., Docker without exec), add a small admin endpoint:

```python
# Add to lightrag/api/lightrag_server.py in create_app()

@app.post("/admin/rotate-key")
async def rotate_key(
request: Request,
body: dict,
_=Depends(combined_auth), # must authenticate with current key
):
new_key = body.get("new_key")
if not new_key or len(new_key) < 32:
raise HTTPException(400, "new_key must be at least 32 characters")
os.environ["LIGHTRAG_API_KEY"] = new_key
return {"status": "rotated"}
```

This is self-securing: you need the current key to set a new one.

## Performance Impact

`os.environ.get()` is a dict lookup in CPython — negligible cost per request (nanoseconds). No caching or locking needed.

## Contribution Path

This is a small, backwards-compatible change suitable for an upstream PR to [HKUDS/LightRAG](https://github.com/HKUDS/LightRAG). If upstream doesn't accept it, it can be monkey-patched in `lightrag-memgraph` or applied as a fork patch.
120 changes: 120 additions & 0 deletions unstructured2graph/spec/sso-shared-token.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# Shared Token Auth (Quick & Dirty)

**Status:** Proposal
**Date:** 2026-03-31

## Goal

Minimal auth that protects LightRAG WebUI and API with a single shared token. No IdP, no user accounts, no infrastructure changes. Something you can set up in 10 minutes.

## How It Works

LightRAG already supports API key auth natively. The flow:

```
User (browser/SDK)
|
| X-API-Key: <shared-token>
v
LightRAG Server (port 9621)
|
| bolt:// (MEMGRAPH_USER / MEMGRAPH_PASSWORD)
v
Memgraph (port 7687)
```

1. Generate a random token (e.g., `openssl rand -hex 32`)
2. Set `LIGHTRAG_API_KEY=<token>` on the LightRAG server
3. Share the token with authorized users
4. Users pass it via `X-API-Key` header (API) or configure it in the WebUI

Memgraph stays on username/password over Bolt — no changes needed there.

## Configuration

### LightRAG Server

```bash
# .env or environment
LIGHTRAG_API_KEY=a3f8c1...your-random-token
```

Or via CLI:

```bash
lightrag-server --key a3f8c1...your-random-token
```

### Python SDK (unstructured2graph callers)

```python
import httpx

headers = {"X-API-Key": os.environ["LIGHTRAG_API_KEY"]}
response = httpx.post("http://localhost:9621/query", headers=headers, json={...})
```

### WebUI

LightRAG's built-in WebUI checks for the API key. When `LIGHTRAG_API_KEY` is set, unauthenticated requests get a 401. The WebUI should prompt for the key — but note the caveat below.

### curl

```bash
curl -H "X-API-Key: $LIGHTRAG_API_KEY" http://localhost:9621/query -d '{"query": "..."}'
```

## Caveats

1. **WebUI guest fallback** — LightRAG's WebUI falls back to guest access if no `AUTH_ACCOUNTS` are configured, even when `LIGHTRAG_API_KEY` is set. To fully lock down the WebUI, you may also need to set `AUTH_ACCOUNTS` or put a reverse proxy in front.

2. **No identity** — everyone shares the same token. You can't distinguish who did what in logs or enforce per-user permissions.

3. **No expiry** — the token lives until you rotate it manually. If it leaks, you must regenerate and redistribute.

4. **No revocation per user** — rotating the token locks out everyone; you can't revoke one person's access.

5. **HTTP in cleartext** — without TLS, the token travels in plaintext. In production, either:
- Enable LightRAG's built-in SSL (`--ssl --ssl-certfile cert.pem --ssl-keyfile key.pem`)
- Put it behind a TLS-terminating reverse proxy (Nginx, Caddy)

6. **Memgraph and Memgraph Lab are separate** — this only covers LightRAG. Memgraph Bolt stays on its own username/password. Memgraph Lab has no shared-token mechanism (it uses SSO or its own login).

## Token Rotation

### Current limitation: restart required

The API key **cannot be hot-reloaded** in upstream LightRAG. It is read once at startup and captured in a closure across three layers:

1. `config.py` — `parse_args()` reads `LIGHTRAG_API_KEY` env var once into `global_args.key`
2. `lightrag_server.py` — `create_app()` captures it into a local `api_key` via `os.getenv()` once
3. `utils_api.py` — `get_combined_auth_dependency(api_key)` stores it in a closure compared on every request

Even mutating `os.environ` at runtime has no effect — the closure still holds the old value.

### Rotation without hot-reload

```bash
# 1. Generate new token
NEW_TOKEN=$(openssl rand -hex 32)

# 2. Update LightRAG server env and restart
export LIGHTRAG_API_KEY=$NEW_TOKEN
# restart lightrag-server

# 3. Distribute new token to users
```

### Proposed patch: hot-reloadable API key

A small change to `lightrag/api/utils_api.py` allows the key to be rotated by updating the env var without restarting the server. See `sso-shared-token-patch.md` for the full patch.

## When to Outgrow This

Move to the full SSO design (`sso-design.md`) when any of these become true:

- You need to know *who* is making requests (audit trail)
- Different users need different permissions (RBAC)
- You have more than ~10 users sharing the token
- Compliance requires individual credentials
- You need to revoke a single user's access without disrupting others
Loading