FEAT: Introduce IdentifierFilters to allow generic DB queries on identifier…#1557
FEAT: Introduce IdentifierFilters to allow generic DB queries on identifier…#1557behnam-o wants to merge 13 commits intomicrosoft:mainfrom
Conversation
pyrit/memory/identifier_filters.py
Outdated
| class AttackIdentifierFilter(IdentifierFilter[AttackIdentifierProperty]): | ||
| """ | ||
| Immutable filter definition for matching JSON-backed attack identifier properties. | ||
|
|
||
| Args: | ||
| property_path: The JSON path of the property to filter on. | ||
| value_to_match: The value to match against the property. | ||
| partial_match: Whether to allow partial matches (default: False). | ||
| """ | ||
|
|
||
|
|
||
| @dataclass(frozen=True) | ||
| class TargetIdentifierFilter(IdentifierFilter[TargetIdentifierProperty]): | ||
| """Immutable filter definition for matching JSON-backed target identifier properties.""" | ||
|
|
||
|
|
||
| @dataclass(frozen=True) | ||
| class ConverterIdentifierFilter(IdentifierFilter[ConverterIdentifierProperty]): | ||
| """Immutable filter definition for matching JSON-backed converter identifier properties.""" | ||
|
|
||
|
|
||
| @dataclass(frozen=True) | ||
| class ScorerIdentifierFilter(IdentifierFilter[ScorerIdentifierProperty]): | ||
| """Immutable filter definition for matching JSON-backed scorer identifier properties.""" |
There was a problem hiding this comment.
Here you have a _StrEnum base, a Generic[T] ABC IdentifierFilter, then 4 *Property enums, and 4 *Filter subclasses, but every filter subclass is an empty body, they add zero behavior. I think the IdentifierFilter type hierarchy is unnecessary.
The Generic[T] bound gives you type-time safety on which *Property enum you pass, but at runtime __post_init__ immediately calls str(self.property_path), erasing the enum type entirely. So a caller can pass an any string and it works fine.
For 4 empty subclasses are only different in the type parameter, I think this is a little bit of over-engineering. You're essentially encoding which JSON column to query in the type of the filter, but the actual column is still chosen by the caller at the call site (e.g. json_column=ScoreEntry.scorer_class_identifier). The type hierarchy doesn't prevent users misusing it, nothing stops you from passing a ScorerIdentifierFilter with json_column=AttackResultEntry.atomic_attack_identifier.
I think a single IdentifierFilter dataclass with a flat property_path: str would be simpler, equally extensible, and more transparent about what the runtime actually does.
Here is what I propose:
@dataclass(frozen=True)
class IdentifierFilter:
property_path: str
value_to_match: str
partial_match: bool = FalseThat's it. The *Property enums are fine to keep as constants (or even a flat module-level class IdentifierPaths namespace), but there's no need for them to constrain the filter type generically. The column binding already happens at the call site in memory_interface.py, so the filter is purely about what path, what value, plus the exact or partial condition.
This removes most of the classes you proposed here and is equally type safe because the real safety comes from which json_column you pass, not the filter type, It is also extensible since new properties are just new enum values and no new filter class is needed.
And I think for _get_condition_json_array_match, you should add a sub_path: str | None = None parameter (that mimics _get_unique_json_array_values), and use it instead of hardcoding '$.class_name'.
Here is an example of how this would work:
results = memory.get_attack_results(
identifier_filter=IdentifierFilter(
property_path=AttackIdentifierProperty.ATTACK_CLASS_NAME,
value_to_match="Crescendo",
partial_match=True,
),
)
results = memory.get_scores(
identifier_filter=IdentifierFilter(
property_path=ScorerIdentifierProperty.CLASS_NAME,
value_to_match="SelfAskLikertScorer",
),
)Then inside memory_interface.py, the method signature pins which column the filter applies to:
def get_scores(
self,
*,
scorer_identifier_filter: IdentifierFilter | None = None,
...
) -> Sequence[Score]:
if scorer_identifier_filter:
conditions.append(
self._get_condition_json_property_match(
# this is where the column is bound ---
json_column=ScoreEntry.scorer_class_identifier,
# ---
property_path=scorer_identifier_filter.property_path,
value_to_match=scorer_identifier_filter.value_to_match,
partial_match=scorer_identifier_filter.partial_match,
)
)There was a problem hiding this comment.
The main reason I introduced all those ***IdentifierProperty.XYZ was to limit what path on an identifier can be queried ... I agree it's not a bad idea to keep it free form, especially since our identifiers are constructed in somewhat of a free-form manner where keys are arbitrary strings.
maybe down the road, we want to have our identifiers statically typed, and then it might make sense to also have filters enforce that.
for now, made the property_path to allow a free-form string
There was a problem hiding this comment.
@bashirpartovi @hannahwestra25 @ValbuenaVC Thanks for your comments, I think you all touched on this free-form vs. restricted property_path pattern, and I agree it is a bit of an over-engineering with no "real" benefit. Please let me know if we should have any follow-ups on this.
ValbuenaVC
left a comment
There was a problem hiding this comment.
This PR (#1451) might be a useful reference since it also handled filtering concerns, although for datasets.
| not_data_type: Optional[str] = None, | ||
| converted_value_sha256: Optional[Sequence[str]] = None, | ||
| attack_identifier_filter: Optional[IdentifierFilter] = None, | ||
| prompt_target_identifier_filter: Optional[IdentifierFilter] = None, |
There was a problem hiding this comment.
i think we would also want to filter by converter identifier
There was a problem hiding this comment.
Right now every identifier-bearing field needs its own dedicated parameter on the MemoryInterface query methods, which means adding a new stored identifier later requires changing public method signatures again. so like this method and get_attack_results, and get_scenario_results could potentially have an infinite amount of filter parameters (not really but you might see my point)
could we switch from one-parameter-per-identifier-field to a generic identifier_filters collection per query method, and map logical filter targets to concrete columns internally. That would make adding new identifier-bearing fields mostly a matter of extending a field map instead of changing every public API again.
| """ | ||
| return self._get_metadata_conditions(prompt_metadata=metadata)[0] | ||
|
|
||
| def _get_condition_json_property_match( |
There was a problem hiding this comment.
could you add doc strings to explain the funciton / parameters
|
|
||
| Args: | ||
| endpoint (str): The endpoint URL substring to filter by (case-insensitive). | ||
| Insert a list of message pieces into the memory storage. |
| assert len(results) == 0 | ||
|
|
||
|
|
||
| def test_get_attack_results_by_attack_class_case_sensitive(sqlite_instance: MemoryInterface): |
There was a problem hiding this comment.
does this fail ? / why are we removing it ? want to make sure we have back compat so this shouldn't fail and it looks like it's not replaced
Uh oh!
There was an error while loading. Please reload this page.