This repository provides a PyTorch reference implementation of the main models and training procedures described in our paper:
Yu Liang*, Zhongjin Zhang*, Yuxuan Zhu, Kerui Zhang, Zhiluohan Guo, Zhou Wenhang, Zonqi Yang, Kangle Wu, Yabo Ni, Anxiang Zeng, Cong Fu, Jianxin Wang, and Jiazhi Xia. Rethinking Generative Recommender Tokenizer: Recsys-Native Encoding and Semantic Quantization Beyond LLMs.
We propose ReSID, a recommendation-native, principled SID framework that rethinks representation learning and quantization from the perspective of information preservation and sequential predictability, without relying on LLMs. ReSID consists of two components: (i) Field-Aware Masked Auto-Encoding (FAMAE), which learns predictive-sufficient item representations from structured features, and (ii) Globally Aligned Orthogonal Quantization (GAOQ), which produces compact and predictable SID sequences by jointly reducing semantic ambiguity and prefix-conditional uncertainty.
The structure of this repository is as follows:
.
├── config/ # All *.yaml configuration files for the pipeline
├── dataset/ # Amazon-2023 review dataset processing code
├── model/ # Model implementations
├── logger.py # Logging utilities for printing runtime outputs
├── main.py # Main entry point for training and evaluation
├── metrics.py # Evaluation-related code
├── requirements.txt # List of required Python packages and dependencies
├── run_pipelines.py # One-click script to run the full ReSID pipeline
├── trainer.py # Training script
├── utils.py # Training utilities, mainly for data loading
└── README.md # This fileWe recommend installing dependencies using requirements.txt. This setup has been tested on Ubuntu 18.04, CUDA 12.4, and Python 3.12.
pip3 install -r requirements.txt-
Download the ten Amazon-2023 review subsets used in our experiments by running:
bash dataset/download_amazon_2023.sh bash dataset/download_amazon_2023_statistics.sh
-
Preprocess the downloaded data:
python dataset/data_process.py
To run ReSID, use the following command:
python run_pipelines.py --dataset Musical_Instruments --device cuda:0Set --dataset to the name of the dataset you want to run.
If you find this repository helpful, please consider citing our paper:
@misc{ReSID,
title={Rethinking Generative Recommender Tokenizer: Recsys-Native Encoding and Semantic Quantization Beyond LLMs},
author={Yu Liang and Zhongjin Zhang and Yuxuan Zhu and Kerui Zhang and Zhiluohan Guo and Wenhang Zhou and Zonqi Yang and Kangle Wu and Yabo Ni and Anxiang Zeng and Cong Fu and Jianxin Wang and Jiazhi Xia},
year={2026},
eprint={2602.02338},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2602.02338},
}
