Data Model: Archive Meeting RAG

Created: 2025-11-02
Feature: Archive Meeting Retrieval & Grounded Interpretation RAG

Entities

MeetingRecord

Description: Represents an archived meeting log entry in JSON format.

Fields:

id (string, required): Unique meeting identifier
date (datetime, required): Meeting date and time
participants (list[string], required): List of participant names or IDs
transcript (string, required): Full meeting transcript text
decisions (list[string], optional): List of decisions made in meeting
tags (list[string], optional): Categorization tags for meeting

Validation Rules:

id must be unique across all meeting records
date must be valid ISO 8601 datetime
transcript must not be empty
JSON structure must be valid and parseable
SHA-256 hash computed at ingestion time (FR-011)
PII detection and redaction applied before indexing (FR-012)

Relationships:

One-to-many: One MeetingRecord can have multiple chunks in EmbeddingIndex
One-to-many: One MeetingRecord can be cited in multiple RAGQuery results

State Transitions:

Ingested: JSON file read and validated
Hashed: SHA-256 hash computed and stored
PII-scanned: Personal information detected and redacted
Indexed: Content chunked and embedded in FAISS index
Available: Ready for retrieval in RAG queries

RAGQuery

Description: Represents a user query and its processed response with citations.

Fields:

query_id (string, required): Unique query identifier (UUID)
user_input (string, required): Original user question
timestamp (datetime, required): Query execution timestamp
retrieved_chunks (list[dict], required): Retrieved document chunks with metadata
- Each chunk: {meeting_id, text, score, chunk_index}
output (string, required): Generated answer text
citations (list[dict], required): Verifiable citations in format [meeting_id | date | speaker]
- Each citation: {meeting_id, date, speaker, excerpt}
model_version (string, required): Version of LLM used for generation
embedding_version (string, required): Version of embedding model used
user_id (string, optional): SSO user identifier (from FR-013)
evidence_found (boolean, required): Whether credible evidence was found (FR-008)
audit_log_path (string, required): Path to immutable audit log entry (FR-005)

Validation Rules:

query_id must be unique UUID
citations must match format [meeting_id | date | speaker] (constitution principle II)
citations must have traceable source in retrieved_chunks (constitution principle I)
If evidence_found is false, output must be "No evidence found" (FR-008)
All citations must reference meeting_id present in retrieved_chunks

Relationships:

Many-to-many: One RAGQuery cites multiple MeetingRecord chunks
One-to-one: One RAGQuery has one audit log entry

State Transitions:

Submitted: User query received
Embedded: Query converted to embedding vector
Retrieved: Similar chunks retrieved from FAISS index
Generated: Answer generated from retrieved context
Cited: Citations extracted and formatted
Logged: Audit record created and persisted
Complete: Response returned to user

EmbeddingIndex

Description: Represents the FAISS vector index containing embedded meeting document chunks.

Fields:

index_id (string, required): Unique index identifier
version_hash (string, required): SHA-256 hash of index configuration and model versions
embedding_model (string, required): Name and version of embedding model used
embedding_dimension (integer, required): Dimension of embedding vectors
index_type (string, required): FAISS index type (e.g., IndexFlatIP, IndexIVFFlat)
document_vectors (FAISS index, required): FAISS index containing document embeddings
metadata (dict, required): Mapping from vector index to document metadata
- Format: {vector_index: {meeting_id, chunk_index, text, date, participants}}
total_documents (integer, required): Total number of document chunks indexed
created_at (datetime, required): Index creation timestamp
index_path (string, required): File system path to FAISS index file

Validation Rules:

version_hash must include embedding model version, FAISS index type, and configuration
metadata must align with document_vectors (one entry per vector)
Index file must be deterministic for same input and seed (constitution principle III)
Index must be reproducible given same meeting JSON and model versions

Relationships:

One-to-many: One EmbeddingIndex contains multiple MeetingRecord chunks
One-to-many: One EmbeddingIndex serves multiple RAGQuery retrievals

State Transitions:

Initialized: Index structure created
Embedded: Meeting documents chunked and embedded
Built: FAISS index constructed with vectors
Versioned: Version hash computed and stored
Persisted: Index saved to disk
Loaded: Index loaded from disk for retrieval

EvaluationCase

Description: Represents a benchmark test case for evaluating RAG system performance.

Fields:

case_id (string, required): Unique evaluation case identifier
prompt (string, required): Test query prompt
ground_truth (string, required): Expected answer content
expected_citations (list[dict], required): Expected citations in format [meeting_id | date | speaker]
- Each citation: {meeting_id, date, speaker, excerpt}
evaluation_metrics (dict, required): Scoring results
- Keys: citation_accuracy, factuality, hallucination_count, retrieval_precision
run_timestamp (datetime, required): Evaluation run timestamp
model_version (string, required): LLM version used for evaluation
embedding_version (string, required): Embedding model version used

Validation Rules:

expected_citations must match format [meeting_id | date | speaker] (constitution principle II)
ground_truth must be non-empty
evaluation_metrics must include citation_accuracy (≥90% per SC-001)
evaluation_metrics.hallucination_count must be 0 (SC-002)

Relationships:

Many-to-many: One EvaluationCase validates multiple MeetingRecord citations
One-to-one: One EvaluationCase has one evaluation result set

State Transitions:

Defined: Test case created with prompt and ground truth
Executed: Query run against RAG system
Scored: Metrics computed (citation accuracy, factuality, etc.)
Validated: Results compared against ground truth
Reported: Results included in evaluation report

Data Flow

Ingestion Flow

Meeting JSON Read → MeetingRecord created
Validation → JSON structure and required fields validated
Hashing → SHA-256 hash computed (FR-011)
PII Detection → spaCy NER detects personal information (FR-012)
PII Redaction → Personal information redacted before indexing
Chunking → Transcript split into chunks (overlapping windows)
Embedding → Chunks converted to embeddings using sentence-transformers
Indexing → Embeddings added to FAISS index in EmbeddingIndex
Metadata Storage → Chunk metadata stored in index metadata mapping

Query Flow

Query Submission → RAGQuery created with user_input
Query Embedding → User query converted to embedding vector
FAISS Retrieval → Similar chunks retrieved from EmbeddingIndex
Context Assembly → Retrieved chunks assembled with metadata
LLM Generation → Answer generated from retrieved context only
Citation Extraction → Citations formatted as [meeting_id | date | speaker]
Evidence Check → If no credible evidence, output set to "No evidence found"
Audit Logging → Immutable audit record created with full query details
Response Return → RAGQuery completed with output and citations

Audit Flow

Query Execution → RAGQuery state transitions to Submitted
Processing → Query processed through embedding, retrieval, generation
Log Creation → Audit log entry created with:
- Query ID, user ID (from SSO), timestamp
- User input, retrieved sources, model versions
- Output, citations, evidence_found flag
Persistence → Structured JSON log written to audit_logs/ directory
Retention → Log retained for 3 years per FR-014

Validation Rules Summary

MeetingRecord

✅ Unique ID validation
✅ Date format validation (ISO 8601)
✅ Transcript non-empty validation
✅ JSON structure validation
✅ SHA-256 hash computation
✅ PII detection and redaction

RAGQuery

✅ Citation format validation: [meeting_id | date | speaker]
✅ Citation traceability: citations must reference retrieved_chunks
✅ Evidence check: "No evidence found" when evidence_found is false
✅ Citation-source alignment: citations must match retrieved chunks

EmbeddingIndex

✅ Version hash includes model version and configuration
✅ Metadata alignment with document vectors
✅ Deterministic index construction with fixed seed
✅ Reproducibility validation

EvaluationCase

✅ Citation format validation
✅ Ground truth non-empty validation
✅ Citation accuracy ≥90% (SC-001)
✅ Hallucination count = 0 (SC-002)

Data Integrity

Tamper Detection

SHA-256 hash computed for each MeetingRecord at ingestion (FR-011)
Hash stored in EmbeddingIndex metadata
Hash verified on index access
Hash mismatches logged as security events

Reproducibility

Fixed seeds for all random operations (embeddings, LLM inference, FAISS)
Model versions pinned and tracked
Index construction deterministic
Same input + data state → identical output (constitution principle III)

Auditability

Immutable audit logs for every RAGQuery (constitution principle V)
Full provenance: query, sources, model versions, output
User ID from SSO included (FR-013)
Logs retained for 3 years (FR-014)
Structured JSON format enables parsing and analysis

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Model: Archive Meeting RAG

Entities

MeetingRecord

RAGQuery

EmbeddingIndex

EvaluationCase

Data Flow

Ingestion Flow

Query Flow

Audit Flow

Validation Rules Summary

MeetingRecord

RAGQuery

EmbeddingIndex

EvaluationCase

Data Integrity

Tamper Detection

Reproducibility

Auditability

FilesExpand file tree

data-model.md

Latest commit

History

data-model.md

File metadata and controls

Data Model: Archive Meeting RAG

Entities

MeetingRecord

RAGQuery

EmbeddingIndex

EvaluationCase

Data Flow

Ingestion Flow

Query Flow

Audit Flow

Validation Rules Summary

MeetingRecord

RAGQuery

EmbeddingIndex

EvaluationCase

Data Integrity

Tamper Detection

Reproducibility

Auditability