How To AI (Almost) Anything #71

chengjun · 2025-09-15T10:39:50Z

chengjun
Sep 15, 2025
Maintainer

MAS.S60 How2AI
Schedule
** Exact topics and schedule subject to change, based on student interests and course discussions. **

Date Topics Readings
2/4 Week 1 Introduction [slides] [video]
Course syllabus and requirements
Introduction to AI and AI research
Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions
Multimodal Machine Learning: A Survey and Taxonomy
Representation Learning: A Review and New Perspectives

2/6 Week 1 Introduction to AI Research [slides] [video]
Introduction to AI and AI research
Generating ideas, reading and writing papers, AI experimentation

2/11 Week 2 Foundation 1: Data, structure, information [slides] [video]
Common data modalities
Data collection strategies
Training objectives and generalization
Machine learning: Trends, Perspectives, and Prospects
Representation Learning: A Review and New Perspectives
Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions
Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges

2/14 Week 2 Foundation 2: Practical AI tools [slides] [video]
Getting started with PyTorch
Huggingface packages
Debugging machine learning models
A Recipe for Training Neural Networks
Fine-tuning a Code LLM on Custom Code on a single GPU
MAS.S60 Pytorch Introduction

2/18 Week 3 No class, shifted President's day
2/20 Week 3 Project proposal presentations

2/25 Week 4 Foundation 3: Model architectures [slides] [video]
Structure and invariances
Temporal sequence models
Spatial convolution models
Models for sets and graphs
Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Attention Is All You Need
Neural Machine Translation by Jointly Learning to Align and Translate
Deep Sets
Graph Attention Networks

2/25 Week 4 Discussion 1: Learning and generalization
Learning the Bitter Lesson
Unifying Grokking and Double Descent
Generalization in Neural Networks
Textbooks are all you Need
A Conceptual Pipeline for Machine Learning

3/4 Week 5 Multimodal 1: Connections and alignment [slides] [video]
Heterogeneity, connections, and interactions
Multimodal technical challenges
Alignment and transformers
Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions
What Makes for Good Views for Contrastive Learning?
Characterization and classification of semantic image-text relations
When and why vision-language models behave like bags-of-words, and what to do about it?

3/6 Week 5 Discussion 2: Modern AI architectures
Scaling Laws for Generative Mixed-Modal Models
Not All Tokens Are What You Need for Pretraining
PaLI: A Jointly-Scaled Multilingual Language-Image Model
The Evolution of Multimodal Model Architectures
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
A ConvNet for the 2020s
Inductive Representation Learning on Large Graphs
Janossy Pooling: Learning Deep Permutation-Invariant Functions for Variable-Size Inputs

3/11 Week 6 Multimodal 2: Interactions and fusion [slides] [video]
Cross-modal interactions
Multimodal fusion
Ten Myths of Multimodal Interaction
Multimodal interaction: A review
Quantifying & Modeling Multimodal Interactions: An Information Decomposition Framework
Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think!

3/13 Week 6 Discussion 3: Multimodal alignment
The Platonic Representation Hypothesis
What Makes for Good Views for Contrastive Learning?
Understanding the Emergence of Multimodal Representation Alignment
Does equivariance matter at scale?
Learning Transferable Visual Models From Natural Language Supervision?
Emerging Properties in Self-Supervised Vision Transformers
Foundations & trends in multimodal machine learning - Principles, challenges, and open questions

3/18 Week 7 Multimodal 3: Cross-modal transfer [slides] [video]
Cross-modal learning via fusion
Cross-modal learning via alignment
Cross-modal learning via translation
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day
DreamLLM: Synergistic Multimodal Comprehension and Creation
PaLM-E: An Embodied Multimodal Language Model

3/20 Week 7 Discussion 4: Multimodal interactions
Multimodal interaction: A review
Quantifying & Modeling Multimodal Interactions: An Information Decomposition Framework
Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think!
Kosmos-2: Grounding Multimodal Large Language Models to the World
Chameleon: Mixed-modal early-fusion foundation models
MM1: Methods, Analysis and Insights from Multimodal LLM Pre-training
MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts

3/25 Week 8 No class, spring break

4/1 Week 9 Large models 1: Large foundation models [slides] [video]
Pre-training data
Self-supervised learning
Fine-tuning, instructing, alignment
Training Compute-Optimal Large Language Models
Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks
LoRA: Low-Rank Adaptation of Large Language Models
A Visual Guide to Mixture of Experts (MoE)
A Visual Guide to Quantization
Improved Baselines with Visual Instruction Tuning

4/3 Week 9 Project midterm presentations

4/8 Week 10 No class, member's week

4/15 Week 11 Large models 2: Large multimodal models [slides] [video]
Multimodal pre-training
Adapting large language models to multimodal
Multimodal LLMs with generation
Quantifying & Modeling Multimodal Interactions: An Information Decomposition Framework
Multimodal Transformer for Unaligned Multimodal Language Sequences
Masked Autoencoders Are Scalable Vision Learners
Scaling Laws for Native Multimodal Models Scaling Laws for Native Multimodal Models
Transfer between Modalities with MetaQueries

4/17 Week 11 Discussion 5: Large language models
LoRA: Low-Rank Adaptation of Large Language Models
Gated Linear Attention Transformers with Hardware-Efficient Training
Unintended Impacts of LLM Alignment on Global Representation
A Visual Guide to Quantization
Scaling Instruction-Finetuned Language Models

4/22 Week 12 Large models 3: Modern generative models [slides] [video]
Diffusion models
Controllable generation
Flow Matching
Scalable Diffusion Models with Transformers
Flow Matching in Latent Space
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Movie Gen: A Cast of Media Foundation Models

4/24 Week 12 Discussion 6: Large multimodal models
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference
ModaVerse: Efficiently Transforming Modalities with LLMs
Spider: Any-to-Many Multimodal LLM
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
NExT-GPT: Any-to-Any Multimodal LLM
Learning to rebalance multi-modal optimization by adaptively masking subnetworks

4/29 Week 13 No class, CHI week
5/1 Week 13 Discussion 7: Generative AI
Large Language Diffusion Models
Compositional Generative Modeling: A Single Model is Not All You Need
Flow Matching for Generative Modeling
Flow Matching Guide and Code
FlowMotion: Target-Predictive Conditional Flow Matching for Jitter-Reduced Text-to-Motion Generation
MusFlow: Multimodal Music Generation via Conditional Flow Matching
Unraveling the Connections Between Flow Matching and Diffusion Probabilistic Models
Exploring Diffusion and Flow Matching Under Generator Matching

5/6 Week 14 Interaction 1: Interactive agents and reasoning [slides] [video]
Reinforcement learning
Multi-step reasoning
Deep reinforcement learning from human preferences
Deepseek-r1: Incentivizing reasoning capability in LLMs via reinforcement learning
Faulty reward functions in the wild
Direct preference optimization: Your language model is secretly a reward model

5/8 Week 14 Project final presentations

5/13 Week 15 Interaction 2: Human AI interaction [slides] [video]
Interaction mediums
Human in the loop learning
Safety and reliability
Interactive Sketchpad: A Multimodal Tutoring System for Collaborative, Visual Problem-Solving
VideoWebArena: Evaluating Multimodal Agents on Video Understanding Web Tasks
OpenVLA: An Open-Source Vision-Language-Action Model
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
Guidelines for Human-AI Interaction

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How To AI (Almost) Anything #71

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

How To AI (Almost) Anything #71

Uh oh!

chengjun Sep 15, 2025 Maintainer

Replies: 0 comments

chengjun
Sep 15, 2025
Maintainer