Skip to content

Official implementation of the paper "Rethinking Generative Recommender Tokenizer: Recsys-Native Encoding and Semantic Quantization Beyond LLMs"

License

Notifications You must be signed in to change notification settings

FuCongResearchSquad/ReSID

Repository files navigation

ReSID

This repository provides a PyTorch reference implementation of the main models and training procedures described in our paper:

Yu Liang*, Zhongjin Zhang*, Yuxuan Zhu, Kerui Zhang, Zhiluohan Guo, Zhou Wenhang, Zonqi Yang, Kangle Wu, Yabo Ni, Anxiang Zeng, Cong Fu, Jianxin Wang, and Jiazhi Xia. Rethinking Generative Recommender Tokenizer: Recsys-Native Encoding and Semantic Quantization Beyond LLMs.

Overview

We propose ReSID, a recommendation-native, principled SID framework that rethinks representation learning and quantization from the perspective of information preservation and sequential predictability, without relying on LLMs. ReSID consists of two components: (i) Field-Aware Masked Auto-Encoding (FAMAE), which learns predictive-sufficient item representations from structured features, and (ii) Globally Aligned Orthogonal Quantization (GAOQ), which produces compact and predictable SID sequences by jointly reducing semantic ambiguity and prefix-conditional uncertainty.

image

Project Structure

The structure of this repository is as follows:

.
├── config/                   # All *.yaml configuration files for the pipeline
├── dataset/                  # Amazon-2023 review dataset processing code
├── model/                    # Model implementations
├── logger.py                 # Logging utilities for printing runtime outputs
├── main.py                   # Main entry point for training and evaluation
├── metrics.py                # Evaluation-related code
├── requirements.txt          # List of required Python packages and dependencies
├── run_pipelines.py          # One-click script to run the full ReSID pipeline
├── trainer.py                # Training script
├── utils.py                  # Training utilities, mainly for data loading
└── README.md                 # This file

Experiments

Setup

We recommend installing dependencies using requirements.txt. This setup has been tested on Ubuntu 18.04, CUDA 12.4, and Python 3.12.

pip3 install -r requirements.txt

Data Processing

  1. Download the ten Amazon-2023 review subsets used in our experiments by running:

    bash dataset/download_amazon_2023.sh
    bash dataset/download_amazon_2023_statistics.sh
  2. Preprocess the downloaded data:

    python dataset/data_process.py

Training

To run ReSID, use the following command:

python run_pipelines.py --dataset Musical_Instruments --device cuda:0

Set --dataset to the name of the dataset you want to run.

Results

image

Citation

If you find this repository helpful, please consider citing our paper:

@misc{ReSID,
      title={Rethinking Generative Recommender Tokenizer: Recsys-Native Encoding and Semantic Quantization Beyond LLMs}, 
      author={Yu Liang and Zhongjin Zhang and Yuxuan Zhu and Kerui Zhang and Zhiluohan Guo and Wenhang Zhou and Zonqi Yang and Kangle Wu and Yabo Ni and Anxiang Zeng and Cong Fu and Jianxin Wang and Jiazhi Xia},
      year={2026},
      eprint={2602.02338},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2602.02338}, 
}

About

Official implementation of the paper "Rethinking Generative Recommender Tokenizer: Recsys-Native Encoding and Semantic Quantization Beyond LLMs"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published