Skip to content

Docs/split phase1 phase2 readmes#13

Open
sofasogood wants to merge 71 commits intoRDI-Foundation:mainfrom
LambdaLabsML:docs/split-phase1-phase2-readmes
Open

Docs/split phase1 phase2 readmes#13
sofasogood wants to merge 71 commits intoRDI-Foundation:mainfrom
LambdaLabsML:docs/split-phase1-phase2-readmes

Conversation

@sofasogood
Copy link

No description provided.

sofasogood and others added 30 commits November 19, 2025 03:23
- Plugin-based framework for dynamic adversarial security testing
- Generic attacker/defender agents with context injection
- Three example scenarios: PortfolioIQ, Thingularity, Medical Records
- Comprehensive competition documentation
The attacker and defender agents require the openai package
to communicate with OpenAI's API.
- Change all Track references to Phase terminology
- Add detailed Phase 1 (Nov 21 - Dec 19) and Phase 2 (Jan 12 - Feb 23) info
- Add model constraint: gpt-oss-20b required for all agents
- Add deliverables, example flow, and submission guidelines
- Update orchestrator to auto-generate evidence files (baseline_passed.json, attack_succeeded.json)
- Create example submission at submissions/example_team/example_scenario/
- Fix model inconsistency in SCENARIO_SPECIFICATIONS.md
- Delete redundant COMPETITION_GUIDE.md
- Add OPENAI_BASE_URL env var support to both agents
- Allows all agents to share a single local vLLM/Ollama server
- Defaults to OpenAI API when not set (backward compatible)
- Document local model serving setup in README
- Add step-by-step setup instructions (fork, install, env vars)
- Document how to run Lambda scenarios (PortfolioIQ, Thingularity)
- Document how to run example submission
- Add PR submission process (fork, branch, PR title format)
- Note Python >= 3.11 requirement
- Add team_name field to ArenaResult model
- Update _save_results() to create results/{team_name}/{scenario_type}/{timestamp}/ path
- Save evidence files to both central results and submission test_results folders
- Include team_name in baseline_passed.json and attack_succeeded.json
- Update example submission TOML and README with team_name documentation
- Update README.md with explicit 100 + 20 bonus point structure
- Align judging criteria: Functionality (40), Target Sensitivity (30), Creativity (30), Attack Success (+20 bonus)
- Add result.json to test_results in SCENARIO_SPECIFICATIONS.md
- Clarify baseline definition and 0-point policy for trivial targets
Explicitly set model to openai/gpt-oss-20b in toml
- Update competition dates: Phase 1 (Nov 24 - Jan 5), Phase 2 (Jan 19 - Feb 23)
- Add Lambda × Berkeley branding and track description
- Standardize model flag (--model openai/gpt-oss-20b) across all TOML files
- Add support section with Discord/GitHub contact info
- Update Phase 2 evaluation to focus on public/private leaderboards
- Remove specific metric references (ASR, defense rate percentages)
- Standardize scenario count to 450+ specifications
- Remove evaluation metrics table from README
- Stateless defender (fresh each round, no memory)
- Stateful attacker (learns across rounds with history)
- Simplified defender prompt (generic professional principles)
- Enhanced attacker prompt (strategic planning, balanced examples)
- Smarter thingularity detection (excludes denials from leaks)
- Gandalf-style thingularity (attacker knows categories, not exact names)
- Early termination on attacker success
- Add missing __init__.py files to submissions folder hierarchy
- Update import paths in README to use relative imports (from ..submissions...)
- Register example_scenario plugin in registry.py

This fixes the ModuleNotFoundError and circular import issues when
following the tutorial to register custom scenario plugins.
da-h and others added 30 commits February 6, 2026 18:47
- gymjailbreak: Add 'capped from' pattern for safety caveats
- New _format_battle_log_html() generates self-contained HTML report
- Dark theme with color-coded attacker/defender sections
- Visual status indicators for each round
- Saves battle_log.html alongside battle_log.md
- Align resume topics with the actual job posting (Senior Software
  Engineer: Python, Cloud, Leadership) instead of unrelated roles
  (data analyst, project manager)
- Fix expected_concepts to validate the defender's JSON ranking output
  (rank, score, rationale, candidates) instead of resume input content
- Add variety in seniority: strong mid (7yr), junior (1-2yr), senior (10yr)
- Tested: 3/3 topics pass with gpt-oss-20b on vLLM
The full defender response is now saved in the transcript JSON,
making it easier to review and debug normal user test results.
Plugins:
- thingularity: add tool descriptions, show hidden tools in defender task description
- example_medical: use random SSNs with patient names/DOBs, move SSNs to defender
  task description with realistic use case, add missing format_input()
- moneymatemayhem: add MoneyMate product details (fees, times, security) to defender
- flightapifake: add legitimate package details to defender task description

Normal user topics:
- thingularity: test public tool visibility (search, details, stock)
- flightapifake: replace comparison/setup topics with recommendation topics
  compatible with JSON output format
- legaleagleeye: more specific expected_concepts for force majeure analysis
- encodedpayload: replace duplicate topic with workplace accommodation
- gymjailbreak: fix expected_concept "heart rate" -> "workout"
…uator

The evaluator marked correct defender responses as "not helpful" because
it didn't know the required output format was a brief JSON. Now the
orchestrator passes output_format to the evaluation context, and the
evaluate template instructs the LLM not to penalize brevity when the
format only requires a few fields.
Replace the monolithic scenarios/security_arena/README.md with a short
landing page that links to docs/phase1.md and docs/phase2.md (already
on main). Add phase doc links to root README.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants