Add transducer components #79

NeoLegends · 2025-09-15T14:50:33Z

Needs docs.

michelwi · 2025-11-04T11:17:37Z

are source_encodings the output of the acoustic encoder and the target_encodings the output of the prediction network? Maybe we could rename (+ document) this better.

NeoLegends · 2025-09-15T14:52:36Z

I don't think I'm a big fan of switching between logits and log probs based on whether it's train time or not. I'd rather pass a parameter or leave the log softmax to the forward_step function.

+1, I think we usually get logits in the train step and apply the appropriate softmax function there

NeoLegends · 2025-09-15T14:52:48Z

NeoLegends · 2025-09-15T14:56:15Z

What does this do, why is it necessary and different from

config2 = copy.deepcopy(config1)

?

NeoLegends · 2025-09-15T14:56:40Z

Needs docs.

NeoLegends · 2025-09-15T14:57:07Z

Suggested change

def __init__(self, cfg: EmbeddingTransducerPredictionNetworkV1Config) -> None:

def __init__(self, cfg: EmbeddingTransducerPredictionNetworkV1Config):

NeoLegends · 2025-09-15T14:57:36Z

Consider unsqueezing from the back.

NeoLegends · 2025-09-15T14:57:45Z

Consider indexing dims from the back.

dim can be a tuple of ints, so we could do it in one step
https://docs.pytorch.org/docs/stable/generated/torch.sum.html

Suggested change

reduced = weighted.sum(dim=2).sum(dim=2) # [B, S, E]

reduced = weighted.sum(dim=(-2, -1)) # [B, S, E]

michelwi · 2025-11-04T12:36:32Z

*history.shape[:-1] reads odd.. that should be the same as history.shape[0], since we have len(history.shape) == 2.. but talk to @NeoLegends about making this work with more batch dim.

michelwi · 2025-11-04T13:08:42Z

ChatGPT suggested this code

B, S = targets.shape H = self.cfg.context_history_size # Pad left with H blanks: [B, S+H] extended = F.pad(targets, (H, 0), value=self.blank_id) # Unfold over sequence dim to get [B, S+1, H] # (PyTorch: unfold(size=H, step=1) "slides" a length-H window) history = extended.unfold(dimension=1, size=H, step=1) # [B, S+1, H]

NeoLegends · 2025-09-15T15:06:50Z

I think this class would benefit from using composition instead of inheritance. Make it contain/own an EmbeddingTransducerPredictionNetworkV1 instead of inheriting from one. That resolves all your issues wrt. config nesting/updating.

NeoLegends · 2025-09-15T15:03:03Z

Since the first config inherits from the second one, you are able to just:

Suggested change

super().__init__(EmbeddingTransducerPredictionNetworkV1Config.from_child(cfg))

super().__init__(cfg)

EDIT: With composition instead of inheritance, this comment is no longer relevant.

NeoLegends · 2025-09-15T15:05:09Z

Leave the configs immutable. Always safer wrt. bugs.

Suggested change

cfg.ffnn_cfg.input_dim = self.output_dim

self.ffnn = FeedForwardBlockV1(cfg.ffnn_cfg)

self.ffnn = FeedForwardBlockV1(

dataclasses.replace(

cfg,

ffnn_cfg=dataclasses.replace(cfg.ffnn_cfg, input_dim=self.output_dim),

)

)

This creates copies of the dataclasses as needed.

or we could not change anything and throw an error if a value is configured that is wrong..

michelwi · 2025-11-04T12:49:54Z

the target_lengths seems to be unused in any of the forward calls.. is it needed?

Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		from .prediction_network import *
		from .joint_network import *

-Original file line number
+Diff line change
@@ -0,0 +1,117 @@
+    __all__ = ["TransducerJointNetworkV1Config", "TransducerJointNetworkV1"]
+    from dataclasses import dataclass
+    from typing import Any, Dict, Tuple, Union
+    import torch
+    from torch import nn
+    from i6_models.config import ModelConfiguration
+    from i6_models.parts.ffnn import FeedForwardBlockV1Config, FeedForwardBlockV1
+    @dataclass
+    class TransducerJointNetworkV1Config(ModelConfiguration):
+        """
+        Configuration for the Transducer Joint Network.
+        Attributes:
+            ffnn_cfg: Configuration for the internal feed-forward network.
+        """
+        ffnn_cfg: FeedForwardBlockV1Config
+    class TransducerJointNetworkV1(nn.Module):
+        def __init__(
+            self,
+            cfg: TransducerJointNetworkV1Config,
+            enc_input_dim: int,
+            pred_input_dim: int,
+        ) -> None:
+            super().__init__()
+            hidden_dim = cfg.ffnn_cfg.layer_sizes[0]
+            self.enc_proj = nn.Linear(enc_input_dim, hidden_dim, bias=True)
+            self.pred_proj = nn.Linear(pred_input_dim, hidden_dim, bias=False) # Bias handled by enc_proj
+            self.activation = cfg.ffnn_cfg.layer_activations[0]
+            self.dropout = nn.Dropout(cfg.ffnn_cfg.dropouts[0]) if cfg.ffnn_cfg.dropouts else nn.Identity()
+            # Build the rest of the network (if any)
+            if len(cfg.ffnn_cfg.layer_sizes) > 1:
+                remaining_cfg = FeedForwardBlockV1Config(
+                    input_dim=hidden_dim,
+                    layer_sizes=cfg.ffnn_cfg.layer_sizes[1:],
+                    dropouts=cfg.ffnn_cfg.dropouts[1:] if cfg.ffnn_cfg.dropouts else None,
+                    layer_activations=cfg.ffnn_cfg.layer_activations[1:],
+                    use_layer_norm=cfg.ffnn_cfg.use_layer_norm,
+                )
+                self.ffnn = FeedForwardBlockV1(remaining_cfg)
+            else:
+                self.ffnn = nn.Identity()
+            self.output_dim = cfg.ffnn_cfg.layer_sizes[-1]
+        def _forward_joint(self, enc: torch.Tensor, pred: torch.Tensor) -> torch.Tensor:
+            # Project individually then broadcast-sum
+            enc_proj = self.enc_proj(enc).unsqueeze(2)   # [B, T, 1, H]
+            pred_proj = self.pred_proj(pred).unsqueeze(1) # [B, 1, U, H]
+            combined = enc_proj + pred_proj
+            if self.activation is not None:
+                combined = self.activation(combined)
+            combined = self.dropout(combined)
+            return self.ffnn(combined)
+        def forward(
+            self,
+            source_encodings: torch.Tensor,  # [1, T, E]
+            target_encodings: torch.Tensor,  # [B, S, P]
+        ) -> torch.Tensor:  # [B, T, S, F]
+            """
+            Forward pass for recognition.
+            """
+            output = self._forward_joint(source_encodings, target_encodings)
+            if not self.training:
+                output = torch.log_softmax(output, dim=-1)  # [B, T, S, F]
+            return output
+        def forward_viterbi(
+            self,
+            source_encodings: torch.Tensor,  # [B, T, E]
+            source_lengths: torch.Tensor,  # [B]
+            target_encodings: torch.Tensor,  # [B, T, P]
+            target_lengths: torch.Tensor,  # [B]
+        ) -> torch.Tensor:  # [B, T, F]
+            """
+            Forward pass for Viterbi training.
+            """
+            # For Viterbi, dimensions align (T=T), so we can sum directly without broadcasting
+            enc_proj = self.enc_proj(source_encodings)
+            pred_proj = self.pred_proj(target_encodings)
+            combined = enc_proj + pred_proj
+            if self.activation is not None:
+                combined = self.activation(combined)
+            combined = self.dropout(combined)
+            output = self.ffnn(combined)  # [B, T, F]
+            if not self.training:
+                output = torch.log_softmax(output, dim=-1)  # [B, T, F]
+            return output, source_lengths, target_lengths
+        def forward_fullsum(
+            self,
+            source_encodings: torch.Tensor,  # [B, T, E]
+            source_lengths: torch.Tensor,  # [B]
+            target_encodings: torch.Tensor,  # [B, S+1, P]
+            target_lengths: torch.Tensor,  # [B]
+        ) -> torch.Tensor:  # [B, T, S+1, F]
+            """
+            Forward pass for fullsum training. Returns output with shape [B, T, S+1, F].
+            """
+            output = self._forward_joint(source_encodings, target_encodings)
+            return output, source_lengths, target_lengths

Add transducer components #79

Are you sure you want to change the base?

Uh oh!

Add transducer components #79

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!