Questions Regarding Decoder Design and Integration in CrossViViT

Dear Authors,

Thank you for your excellent work, which has been incredibly valuable for our research.

However, I have a few questions regarding the decoder, specifically the `self.temporal_transformer`. From my understanding, the time series representations `latent_ts` are directly fed into the `self.temporal_transformer`. Here, `latent_ts` is a tensor with the shape [B, L, D], where `L` denotes the length of the input time series. At the same time, the `self.temporal_transformer` acts as a Transformer encoder, which also outputs a tensor with the shape `[B, L, D]`.

I have two main questions:
1. Does this imply that the input length and output length must be identical? In my case, I have longer input sequences (e.g., one week) and shorter output sequences (e.g., one day). How can I integrate CrossViViT into this scenario?
2. What is the rationale behind using a Transformer encoder in this design? In an encoder setup, the values at time step T+1 seem to be directly influenced by the values at T-L, even though multi-head self-attention allows for some level of information exchange. Could you elaborate on the reasoning behind this design choice? For prediction tasks, a more common approach is to use an auto-regressive transformer decoder. Have you experimented with this type of decoder architecture?

Thank you again for your valuable insights.

Best regards!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions Regarding Decoder Design and Integration in CrossViViT #15

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Questions Regarding Decoder Design and Integration in CrossViViT #15

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions