Adding Conformer encoder I/O-styled Transformer encoder#15703
Conversation
Signed-off-by: taejinp <tango4j@gmail.com>
| pre_block_norm: bool = True, | ||
| subsampling_factor: int = 4, | ||
| pos_emb_max_len: int = 5000, | ||
| xscaling: bool = True, |
There was a problem hiding this comment.
Shall we set default xscaling to False, since we already know that the layernorm will zero-out the effect of xscaling?
There was a problem hiding this comment.
Thanks for pointing this out. Setting this with default to False.
Signed-off-by: taejinp <tango4j@gmail.com>
Signed-off-by: taejinp <tango4j@gmail.com>
|
/ok to test 5398604 |
|
[🤖]: Hi @tango4j 👋, We wanted to let you know that a CICD pipeline for this PR just finished successfully. So it might be time to merge this PR or get some approvals. |
|
Thanks Taejin! Have you had a chance to run training with this? Does it converges similarly with positional embedding enabled and how the results compare to previous runs. |
@KunalDhawan is working on using this PR part to train his MoE transformer experiments. Recently, after doing some survey, It appeared to me that convnet frontend and positional encoding can affect the performance a lot. So I think we need to test these two configurations separately (ablations). |
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
|
@nithinraok Also figured that Filterbank Stacking feature is equally good as dw_striding (3 level convnet frontend). Better switching to Filterbank stacking to make this model low precision friendly. @ipmedenn @KunalDhawan @stevehuang52 |
|
/ok to test 6725930 |
Signed-off-by: Taejin Park <tango4j@gmail.com>
|
/ok to test 56423ba |
|
/ok to test 839fd85 |
|
/ok to test 3b3ec63 |
|
[🤖]: Hi @tango4j 👋, We wanted to let you know that a CICD pipeline for this PR just finished successfully. So it might be time to merge this PR or get some approvals. |
What does this PR do?
Follow up work after the initial TF encoder PR (#15661). Many NeMo Speech AI maintainers are asking for the new Transformer Encoder implementations to have pre-encode, positional encoding features in conformer encoder.
Aligns the ASR
TransformerEncodermodule with the offlineConformerEncodermodule surface while preserving Transformer-specific attention parameters and behavior.Streaming encoder and adapter implementations are not included in this PR. These features will be added later on.
Tested LibriSpeech training with Transformer + CTC (BPE). Added
transformer_ctc_bpe.yamlwith the default configurations.Collection: ASR
Changelog
TransformerEncoderto inherit NeMo module/export/access mixins and expose Conformer-style input/output type metadata.input_example,forward_for_export,forward_internal,bypass_pre_encode,feat_out, positional encoding, pad mask toggling, stochastic depth, and inter-CTC tensor capture.FeatureStackingpath assubsampling="feature_stacking".FeatureStackinginto the shared ASR subsampling module so it can be imported fromnemo.collections.asr.parts.submodules.subsampling.self_attention_modelmirroring Conformer's positional-encoding switch:"rel_pos"(default),"abs_pos", and"no_pos"(Noneis accepted as a YAML alias for"no_pos").score_modclosure, (c) bias folded asQ + pos_bias_u; rel-shift is shared withConformerEncoderviaRelPositionMultiHeadAttention.rel_shift.self_attention_modeltests, including aT != n_headsregression forpos_bias_{u,v}broadcasting.torch.no_grad()so FlexAttention's CPU path doesn't raise undermodel.train().Usage
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information
Follow up work after the initial TF encoder PR (#15661)