Skip to content

Use augur subsample#103

Draft
victorlin wants to merge 6 commits intomasterfrom
victorlin/use-augur-subsample
Draft

Use augur subsample#103
victorlin wants to merge 6 commits intomasterfrom
victorlin/use-augur-subsample

Conversation

@victorlin
Copy link
Copy Markdown
Member

@victorlin victorlin commented Sep 22, 2025

Description of proposed changes

This PR contains 2 prep commits + 1 main commit. Message from main commit:

The previous subsampling implementation was fixed to a two-sample recent+background split with some hardcoded parameters. Replacing it with augur subsample allows for more flexible configuration.

To keep the workflow config schema concise, we generate each augur subsample config dynamically using a matrix defined in the config and merging code in config.smk.

This is a breaking change and the old configuration will no longer work.

Related issue(s)

Closes #101

Checklist

  • Draft necessary changes on NW-PaGe fork (branch)
  • Checks pass
  • Update changelog
old implementations

@victorlin victorlin self-assigned this Sep 22, 2025
@victorlin victorlin force-pushed the victorlin/use-augur-subsample branch from e869e31 to 18232af Compare September 22, 2025 22:24
@victorlin victorlin mentioned this pull request Sep 22, 2025
2 tasks
@victorlin victorlin linked an issue Sep 22, 2025 that may be closed by this pull request
@victorlin victorlin force-pushed the victorlin/use-augur-subsample branch from 18232af to b0d6728 Compare September 24, 2025 20:20
Comment thread config/configfile.yaml Outdated
genome/all-time:
samples:
all-time:
<<: [*subsample_genome, *subsample_all-time, *subsample_defaults]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I'm not sure this pattern of YAML anchors/aliases is very user friendly. If we do use anchors/aliases, then we should probably disable them in the dumped config so that users can at least see the fully expanded config in results/run_config.yaml (see suggested workaround in yaml/pyyaml#103).

I wonder if this can use the config wildcards pattern that @jameshadfield used in avian-flu. These could be expanded at start of Snakemake (or whatever comes of discussion in nextstrain/public#23).

Copy link
Copy Markdown
Member

@jameshadfield jameshadfield Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do use anchors/aliases, then we should probably disable them in the dumped config

Yes, agreed. (I'd go further: we should write out small-multiples with lots of duplication, but that's beyond this PR.)

I wonder if this can use the config wildcards pattern that @jameshadfield used in avian-flu.

Across the 9 builds almost all of the difference is the subsampling parameters, so at some level it is going to be either complex or verbose. I don't think glob-like syntax will help here with the structure as it is now, but it could if you moved towards something more like:

subsample:
  samples:
    min_length:
      "genome/*": 10_000
      "G/*": 600
      "F/*": 1_200
    group_by: "year country"
    min_coverage: 0.3
    resolutions:
      "*/all-time": {"min_date": "1795-01-01"}
      "*/6y": {"min_date": "6Y", "background_min_date": "1975-01-01"}
      "*/3y": {"min_date": "3Y", "background_min_date": "1975-01-01"}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't think we should use YAML anchors and aliases, and would like to decide on a good alternative in nextstrain/public#27. I'll consider something like the above with config pre-processing to translate into augur subsample-ready config.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my ability to follow this thread, the reason against using YAML anchors and aliases:

  • not sure if the pattern of YAML anchors/aliases is very user friendly (for which users? A google search is returning several articles recommending YAML anchors (example) to avoid YAML duplication, but maybe these are not our user group or feel free to add comments from WA-DOH/others etc.)
  • requires resolving the anchors explicitly in the dumped config (similar lift as config pre-processing to translate into augur subsample-ready config)

Feel free to add to the above list, mostly looking for a summary statement

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if the pattern of YAML anchors/aliases is very user friendly

I don't think we're advising against anchors per-se -- they can be very useful! -- rather I think the pushback was against the specific usage of anchors in this config. (The pushback wasn't from me, so I won't elaborate.)

requires resolving the anchors explicitly in the dumped config

Anchors are resolved automatically, i.e. yaml.dump(yaml.load(...)) will not preserve anchors.

Copy link
Copy Markdown
Member Author

@victorlin victorlin Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had YAML anchors+aliases mostly working in 466d7cb, but it was pretty ugly.

The logic is more readable as Python code that generates each small-multiple subsample config from a more concise but still flexible workflow config, implemented in d24a8d1.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to a new version of "glob syntax" in e54f105:

rsv/config/configfile.yaml

Lines 31 to 100 in e54f105

# dataset values are in the format '<subtype>/<build>/<resolution>'
# dataset parts can be a literal value, '*', or multiple values like '(6y|3y)'
subsample:
- dataset: '*/*/*'
query: missing_data<1000
exclude: config/outliers_ppx.txt
exclude_where:
- qc.overallStatus=bad
group_by:
- year
- country
- dataset: 'a/*/*'
include: config/include_a.txt
query: clade.str.startswith("A.D", na=False)
- dataset: 'b/*/*'
include: config/include_b.txt
query: clade.str.startswith("B.D", na=False)
- dataset: '*/genome/*'
min_length: 10000
query: genome_coverage>0.3
- dataset: '*/G/*'
min_length: 600
query: G_coverage>0.3
- dataset: '*/F/*'
min_length: 1200
query: F_coverage>0.3
- dataset: '*/F-antibody-escape/*'
min_length: 1200
query: F_coverage>0.75
- dataset: '*/*/all-time'
samples:
all-time:
max_sequences: 3000
min_date: 1975-01-01
- dataset: '*/*/(6y|3y)'
samples:
recent:
max_sequences: 3000
background:
max_sequences: 300
exclude_where:
- qc.overallStatus=mediocre
- dataset: '*/*/6y'
samples:
recent:
min_date: 6Y
background:
min_date: 12Y
max_date: 6Y
- dataset: '*/*/3y'
samples:
recent:
min_date: 3Y
background:
min_date: 12Y
max_date: 3Y
- dataset: '*/F-antibody-escape/all-time'
samples:
all-time:
max_sequences: 2000
- dataset: '*/F-antibody-escape/(6y|3y)'
samples:
recent:
max_sequences: 2000
background:
max_sequences: 200

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed up another version of glob syntax in bff7e53. This is the same structure as avian-flu with some additional features.

max_sequences:
'*/*/all-time': 3000
'*/F-antibody-escape/all-time': 2000
'*/*/(6y|3y)':
recent: 3000
background: 300
'*/F-antibody-escape/(6y|3y)':
recent: 2000
background: 200

Also added a draft of the same syntax for refine config: 08eb257.

Copy link
Copy Markdown
Member

@jameshadfield jameshadfield Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm biased but I really like the "option-first" syntax for refine (and other non-subsample rules) shown in 08eb257 etc

One minor note (not to be implemented here, just for reference), in avian-flu you can supply a scalar value directly if all wildcard combinations use the same value. E.g. the following are identical:

  coalescent:
    '*/*/*': opt
  coalescent: opt

For subsample - reading the vaious YAMLs (and only reading them, not actually testing them), I feel there's too much magic. I couldn't confidently tell you how each dataset will be filtered! I'm sure I could learn, but from a first impression it feels hard! To riff on various approaches, have you explored something like this:

subsample:
  # Define which samples each dataset uses
  # (I couldn't work this out from just reading bff7e5 or e54f10, so contents may be wrong!)
  - dataset:
    '*/*/all-time':
      - all-time
    '*/*/(6y|3y)':
      - recent
      - background
  - samples:
    all-time:
      exclude: config/outliers_ppx.txt
      include:
        'a/*/*': config/include_a.txt
        'b/*/*': config/include_b.txt
      min_date: 1975-01-01
      max_sequences: 
        '*/*/*': 3000
        '*/F-antibody-escape/*': 2000
      ...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I like the "option-first" approach in bff7e53. I think it makes it easier to stitch together which options are used per build. However, I'm a little unsure about the concatenation of query and list options. This keeps the config concise, but makes it's more difficult to understand how the override of the default config works for these options.

At first glance, the glob syntax being a little daunting for users to grok and edit. They can read the results/**_config.yaml to see what the expanded config will be, but that does not help with editing or overriding the config...I wonder if we should also support directly providing the subsample config, i.e.

subsample:
  "a/genome/6y":
    samples:
      recent:
        ...
      background:
        ...

Then they can copy the expanded config into their workflow config and edit the values as needed.

@victorlin victorlin mentioned this pull request Oct 7, 2025
3 tasks
@victorlin
Copy link
Copy Markdown
Member Author

I'll wait for a decision in nextstrain/public#27 before continuing here.

@victorlin victorlin marked this pull request as draft October 7, 2025 02:18
@victorlin victorlin force-pushed the victorlin/use-augur-subsample branch from b0d6728 to 5ee2efa Compare February 21, 2026 03:04
@victorlin victorlin force-pushed the victorlin/update-filter-config branch from 3340439 to 7dd63f2 Compare February 21, 2026 03:04
@victorlin victorlin force-pushed the victorlin/use-augur-subsample branch from 5ee2efa to 5490644 Compare February 26, 2026 01:04
@victorlin victorlin force-pushed the victorlin/use-augur-subsample branch from 5490644 to 42aa5f6 Compare March 6, 2026 01:48
@victorlin victorlin force-pushed the victorlin/update-filter-config branch from e243d2f to 0b22185 Compare March 6, 2026 01:48
Base automatically changed from victorlin/update-filter-config to master March 6, 2026 18:55
@victorlin victorlin force-pushed the victorlin/use-augur-subsample branch from 42aa5f6 to b437074 Compare March 7, 2026 02:28
@victorlin victorlin force-pushed the victorlin/use-augur-subsample branch from a727f0c to 9b071fb Compare March 24, 2026 00:01
@victorlin victorlin force-pushed the victorlin/use-augur-subsample branch from 9b071fb to 627a61b Compare April 8, 2026 00:28
victorlin and others added 2 commits April 8, 2026 13:59
Preparing to use build_dir in config.smk, and I figured it'd be good to
move both auspice_dir with it.
Similar to "Add separate frequencies config" (0b22185), this rule
shouldn't rely on config from another rule.
@victorlin victorlin force-pushed the victorlin/use-augur-subsample branch 2 times, most recently from e54f105 to e48b466 Compare April 9, 2026 18:20
To be used for validating augur subsample config.
The previous subsampling implementation was fixed to a two-sample
recent+background split with some hardcoded parameters. Replacing it
with augur subsample allows for more flexible configuration.

To keep the workflow config schema concise, we generate each augur
subsample config dynamically using a patterns defined in the config
helper functions in config.smk.

This is a breaking change and the old configuration will no longer work.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use augur subsample

4 participants