- Clone the repo
- Create and activate a virtual environment with either conda or pip and make sure that you are using at least Python 3.10 in the environment
- run
pip install -r requirements.txtin thelloomfolder - run
npm installandnpm run devin thelloomfolder. This builds the workbench, which is needed for the experiment notebooks. You can cancel thenpm run devprocess afterwards - Create a
.envfile in thelloomfolder, and populate it withOPENAI_API_KEY=<your key> - Create a
concept_logsfolder in thelloomfolder. This will be where the outputs of lloom can be stored, should you want to - Get the
datafolder, with several xlsx files, from someone and add them at a top level to thelloomfolder - If you're using the
ipynbfiles to test, make sure that your kernel in the jupyter notebook is set properly.
- Keep "id_col" to the default, because the data does not have unique comment ids for each comment
- Many of the tests in
old_tests.ipynbare probably not returning what you think they should be returning, because the default prompts in prompts.py have been modified. There's definitely a way to modify the prompts a different way
LLooM is an interactive text analysis tool introduced as part of an ACM CHI 2024 paper:
Concept Induction: Analyzing Unstructured Text with High-Level Concepts Using LLooM. Michelle S. Lam, Janice Teoh, James Landay, Jeffrey Heer, Michael S. Bernstein. Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI '24).
LLooM is an interactive data analysis tool for unstructured text data, such as social media posts, paper abstracts, and articles. Manual text analysis is laborious and challenging to scale to large datasets, and automated approaches like topic modeling and clustering tend to focus on lower-level keywords that can be difficult for analysts to interpret.
By contrast, the LLooM algorithm turns unstructured text into meaningful high-level concepts that are defined by explicit inclusion criteria in natural language. For example, on a dataset of toxic online comments, while a BERTopic model outputs "women, power, female", LLooM produces concepts such as "Criticism of gender roles" and "Dismissal of women's concerns". We call this process concept induction: a computational process that produces high-level concepts from unstructured text.
The LLooM Workbench is an interactive text analysis tool that visualizes data in terms of the concepts that LLooM surfaces. With the LLooM Workbench, data analysts can inspect the automatically-generated concepts and author their own custom concepts to explore the data.
LLooM can assist with a range of data analysis goals—from preliminary exploratory analysis to theory-driven confirmatory analysis. Analysts can review LLooM concepts to interpret emergent trends in the data, but they can also author concepts to actively seek out certain phenomena in the data. Concepts can be compared with existing metadata or other concepts to perform statistical analyses, generate plots, or train a model.
Check out the Examples section to walk through case studies using LLooM, including:
- 🇺🇸📱 Political social media: Case Study | Colab NB
- 💬⚖️ Content moderation: Case Study | Colab NB
- 📄📈 HCI paper abstracts: Case Study | Colab NB
- 📝🤖 AI ethics statements: Case Study | Colab NB
After running concept induction, the Workbench can display an interactive visualization like the one above. LLooM Workbench features include:
- A: Concept Overview: Displays an overview of the dataset in terms of concepts and their prevalence.
- B: Concept Matrix: Provides an interactive summary of the concepts. Users can click on concept rows to inspect concept details and associated examples. Aids comparison between concepts and other metadata columns with user-defined slice columns.
- C: Detail View (for Concept or Slice):
- C1: Concept Details: Includes concept information like the Name, Inclusion criteria, Number of doc matches, and Representative examples.
- C2: Concept Matches and Non-Matches: Shows all input documents in table form. Includes the original text, bullet summaries, concept scores, highlighted text that exemplifies the concept, score rationale, and metadata columns.
LLooM is a concept induction algorithm that extracts and applies concepts to make sense of unstructured text datasets. LLooM leverages large language models (specifically GPT-3.5 and GPT-4 in the current implementation) to synthesize sampled text spans, generate concepts defined by explicit criteria, apply concepts back to data, and iteratively generalize to higher-level concepts.
Follow the Get Started instructions on our documentation for a walkthrough of the main LLooM functions to run on your own dataset. We suggest starting with this template Colab Notebook.
This will involve downloading our Python package, available on PyPI as text_lloom. We recommend setting up a virtual environment with venv or conda.
pip install text_lloomLLooM is a research prototype and still under active development! Feel free to reach out to Michelle Lam at mlam4@cs.stanford.edu if you have questions, run into issues, or want to contribute.
If you find this work useful to you, we'd appreciate you citing our paper!
@article{lam2024conceptInduction,
author = {Lam, Michelle S. and Teoh, Janice and Landay, James and Heer, Jeffrey and Bernstein, Michael S.},
title = {Concept Induction: Analyzing Unstructured Text with High-Level Concepts Using LLooM},
year = {2024},
isbn = {9798400703300},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3613904.3642830},
doi = {10.1145/3613904.3642830},
booktitle = {Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems},
articleno = {933},
numpages = {28},
location = {Honolulu, HI, USA},
series = {CHI '24}
}
