SPINOS1: A Dataset of Subtle Polarity and INtensity Opinion Shifts
You can find here an example of the annotation template for abortion.
The dataset is introduced and analyzed in our paper: Investigating User Radicalization: A Novel Dataset for Identifying Fine-Grained Temporal Shifts in Opinion, Flora Sakketou, Allison Lahnala, Liane Vogel and Lucie Flek
Read here a blog post about our paper.
You will need Python>=3.8 and the following package to be installed in order to load the dataset:
pip install pandas==1.4.1You can read the data with:
import pandas as pd
spinos_official = pd.read_pickle('data/Spinos_official_dataset_v1.1.pkl')
thread_posts = pd.read_pickle('data/Spinos_context_posts_v1.1.pkl')The index of the rows corresponds to the ids of the posts in the Reddit API
The dataframe contains the following columns:
-
author_id(str): Which user has posted this particular post. Note: The usernames are anonymous. -
title(str): The title of the post, if it exists. -
content(str): The content of the post, if it exists. -
annotation(str): Majority vote of the non-expert annotators. -
topic(str): The topic this post is about. Possible values: 'abortion' 'feminism' 'brexit' 'veganism' 'guns' 'nuclear-energy' 'capitalism' 'climate-change' -
subreddit(str): The subreddit this post was posted on. -
is_sarcastic(str): The stance is sarcastic. Values = ['', 'No', 'Yes']. -
is_unsure(str): How many annotators stated that they were unsure of the annotation. Values = ['', 'No', '1/3', '2/3']. -
is_explicit(str): The stance is explicitly stated. Values = ['No', 'Yes']. -
top_level_post: How many annotators requested to read the top-level post in order to do the annotation. Values = ['No', '1', '2', '3'] -
toplevel_id(str): ID of the top-level post in the thread (prev.top_level_post_id) -
parents: How many annotators requested to read the parent posts in order to do the annotation. Values = ['No', '1', '2', '3'] -
parent_ids(list): List of the parent posts' ids that where used for the annotation -
timestamp(pandas Timestamp): Date and time when the post was posted. -
parent_id(str): ID of post's parent -
n_parents(int): number of posts between root and post -
n_children(int): number of children the post has (1 level down) -
children_ids_limited(list): we share up to two children
You can construct the context with:
def get_thread_df(post_id):
"""
Given a post_id in spinos_official, return the thread as a DataFrame
in order (parents -> post -> children), with all columns.
Looks in thread_posts first, then spinos_official for missing posts.
"""
if post_id not in spinos_official.index:
raise ValueError(f"{post_id} not in spinos_official")
# Get the IDs for the thread
parent_ids = spinos_official.loc[post_id, 'parent_ids']
children_ids = spinos_official.loc[post_id, 'children_ids_limited']
# Full thread order
thread_ids = parent_ids + [post_id] + children_ids
rows = []
for tid in thread_ids:
if tid in thread_posts.index:
row = thread_posts.loc[tid]
elif tid in spinos_official.index:
# Take only the relevant columns
row = spinos_official.loc[tid, ['parent_id','author_id', 'title', 'content', 'topic', 'subreddit', 'timestamp']]
else:
print("post not found")
continue # skip if missing everywhere
rows.append(row)
# Combine into a DataFrame
thread_df = pd.DataFrame(rows)
thread_df.index = [tid for tid in thread_ids if tid in thread_df.index]
return thread_df
test_post_id = spinos_official.index[0]
thread_df = get_thread_df(test_post_id)
print(test_post_id)
thread_dfFootnotes
-
Fun fact: SPINOS in Greek (σπίνος) means chaffinch, which is our logo. Interestigly, this name brings together some of the authors of the paper, since the first author is Greek, the second author is an aspirant birdwatcher and the third author's surname means "bird" in German. ↩
