To use this, click the download as .zip button to start from a local computer, or skip down to the github example to fork this and get started that way.
The unzipped folder contains all of the files you need to compile a website in R Markdown. This should all work fine if you have the latest version of R and R-studio installed.
-
My Lab Journal
-
This is a template example for lab journaling. Students in Matt Crump’s Human Cognition and Performance Lab will use this template to learn R, and other things. Students can replace this text with more fun things. People not in my lab can use this too.
-
How to use
+
Hi! I am Utsab. An enterprenuer and advanced analytics professional sho uses statistics and cutting edge technology to turn raw data into actionable information. I have over 8 years experience with ingerential statistics, time series analysis, and statistical computing with R. Furthermore, I am also interested in development in Python.
+BACKGROUND
+
+
I have always been a numbers person, with exceptional mathemathis and computer skills. I have spent the past 5 years analyzing complex business requirements, compiling market and trend data, and designing enterprise-level solutions to accelerate efficiency and revenue growth. I am fluent in several data management systems and software, including Excel, Microsoft SQL Server, Python, R and Power BI. Statistical significance, A/B testing, and data-driven optimization are some of the areas of my expertise.
+
As much as I’m into data manipulation, it’s the analysis of data that really gets me going. I enjoy exploring the relationships between data and translating those data into stories. In the age of big data, these stories become actionable solutions and strategies for businesses, and I take pride in my ability to make data accessible to both executive decision-makers and frontline sales staff.
+
In my last role, as a business intelligence analyst for an online travel agency, I worked with cross-functional teams to structure problems, identify appropriate data sources, extract data and develop integrated information delivery solutions. I was leading the analysis on pricing opportunities, customer behavior, and competitive positioning to build a long-term pricing strategy for the firm.
+
On a personal level, I am detail-oriented, organized, and precise in my work. I have strong communication skills with a knack for clear and illuminating presentation. I’m comfortable on my own facing the numbers, but I really enjoy being part of a motivated team of smart people.
+
I have completed my graduate program in Predictive Analytics, and I have bachelor’s degree in computer information systems with business concentration.
+
Besides spreadsheets and charts, I am passionate about nature and mountains, and live an active lifestyle.
+
+
+PROJECTS
+
-fork the repo for this website and follow instructions on read me to get set up. https://github.com/CrumpLab/LabJournalWebsite
-Blog/journal what you are doing in R, by editing the Journal.Rmd. See the Journal page for an example of what to do to get started learning R.
-See the links page for lots of helpful links on learning R.
-Change everything to make it your own.
-
+
Study of Interaction Patterns in Primary School Children Network
+
Dating Compatibiligy and Chemistry: Analysis on a Speed-dating Experiment
+
Risk assessment of Suicide using Machine Learning
+
-
-
+
diff --git a/docs/risk.html b/docs/risk.html
new file mode 100644
index 000000000..0585902e9
--- /dev/null
+++ b/docs/risk.html
@@ -0,0 +1,2684 @@
+
+
+
+
+
+
+
+
+

+
Risk Assessment of Suicide using Machine Learning
+
BiswajitDhar
+
Dolores Ke Ding
+
David Marks
+
Roya Pashmineh
+
Utsab Shrestha
+
Depaul University, CSC 672
+
+
+
+

+
1
+
Table of Contents
+
1. Abstract
+
2
+
3. Introduction
+
2
+
3. Related Work
+
4
+
3.1 Word2Vec Topic Modeling
+
4
+
3.2Human Annotation
+
4
+
3.3Social Media Data
+
5
+
4. Methodology
+
6
+
4.1. Data Gathering and Pre-processing
+
6
+
4.2 Data Sampling
+
7
+
4.3 Data Labeling
+
8
+
4.4 Data Classification
+
14
+
5. Results
+
16
+
5.1 Data Acquisition and Cleansing
+
16
+
5.2 Data Sampling
+
18
+
5.3 Data Labeling - Human annotation
+
20
+
5.4Effectiveness of Machine Classifiers
+
24
+
6. Discussion
+
29
+
7. Conclusion
+
31
+
8. Future Work
+
32
+
9. References
+
33
+
10. Appendix
+
35
+
+
+
+

+
2
+
1. Abstract
+
Suicide is a global public health problem that is responsible for alarming number of
+
deaths every year worldwide. Suicide risk is difficult to detect because of the heterogeneous
+
characteristics of the individuals. Researchers in multiple disciplines seek to find the reasons for
+
the increase in suicides. Some of the studies show that there is a direct link of increase in
+
suicides rates with the popularity of smartphones and social media. The purpose of this study is
+
to determine if machine learning techniques have the capacity to qualify the risk of suicide when
+
applied to informal online posts regarding suicide.
+
With the popularity of the world wide web and social media, people suffering from
+
suicide ideation have resorted to seeking help online rather than consulting professionals and
+
risking social stigmatization. This gives researchers an opportunity to analyze the vast amount of
+
social media data to see what drives and motivates people to suicidal thoughts. Therefore,
+
analyzing and assessing social media post to classify suicidal ideation and possibly predict
+
suicidal attempts is seen as a vital area of research that can help alleviate suicide epidemic. We
+
explored Reddit data to design, develop and evaluate machine learning algorithm can unlock
+
valuable information from Reddit posts to detect and predict suicidal thoughts and ideation.
+
Our informal topic discovery evaluation reveals that machine learning techniques can
+
detect the risk of suicide from informal online posts. This paper has demonstrated that at least
+
two models perform substantially better than random chance at identifying the posts that present
+
the most serious and immediate risk of suicide. Naive Bayes performed the best, but Decision
+
Tree performed better than random chance.
+
2. Introduction
+
Suicide, the act of intentionally taking one's own life, is among one the leading cause of
+
death worldwide. American Foundation for Suicide Prevention reports that there are 123 suicides
+
on average every day and that the suicide rate in USA has been increasing year over year and has
+
recently surged to a 30-year high [1][2]. It is the 10th leading cause of mortality in the United
+
States and cost an estimated 44.6 billion dollars per year [3]. According to a study by National
+
Center for Health Statistics, the age-adjusted suicide rate in the United States increased a
+
staggering 24 percent from 1999 to 2014. Increases were seen in every age group except for
+
those 75 and above and in every racial and gender category except for black men [4]. This
+
suicidal epidemic has a profound impact on individuals, families, and communities across the
+
United States and the world.
+
Suicide risk is difficult to detect because of the heterogeneous characteristics of
+
individuals at potential risk of harming themselves. Most individuals having these risk factors do
+
not attempt suicide, and others without these conditions sometimes do. Therefore, there is a
+
danger in considering only individuals with certain conditions or experiences as being at risk for
+
suicide. Researchers in multiple disciplines seek to find the reasons for the increase in suicides.
+
Many studies focused on various stages of suicide process, from suicidal thoughts and ideations
+
through planning to attempt. Other studies have focused on identifying the risk of suicide in
+
various stages of the suicide process and find underlying causes of suicide. While many of the
+
+
+
+

+
3
+
causes of suicide stem from loneliness or relationship issues, other reasons for suicidal thoughts
+
include lack of support from friends and family, financial issues or simply lack of self-esteem.
+
Not surprisingly, some of the studies show that suicide rates among teenagers have risen along
+
with their ownership of smartphones and use of social media, suggesting a disturbing link
+
between technology and suicide rate.
+
With wide acceptance and popularity of social media, it has been found that people
+
suffering from suicidal thoughts tend to share their feelings on social media, such as Facebook,
+
Reddit, Twitter etc. Beyond sharing these thoughts about suicide, a new phenomenon has begun
+
where people live stream their suicide on platforms like Facebook and YouTube. Studies have
+
shown that people are more likely to seek support from non-professional resources like through
+
anonymous social media post, rather than seek help from professional due to the fear of social
+
stigma. Studies have shown that there is a positive correlation between suicide rates and volume
+
of social media posts related to suicide ideation and intent. [bd6] This gives researchers an
+
opportunity to analyze the vast amount of social media data to see what drives and motivates
+
people to suicidal thoughts. Therefore, analyzing and assessing social media post to classify
+
suicidal ideation and possibly predict suicidal attempts is seen as a vital area of research that can
+
help alleviate this epidemic. The goal of this project is to assess suicidal risk based on Reddit
+
post using machine learning techniques.
+
With the combination of informal social media language and complexity of suicidal
+
language that includes emotive content and possible signs of depression or other mental illnesses,
+
understanding the suicide communication and building a classifier to assess suicide risk can be a
+
challenging task. Our study aims to contribute to understanding on the topic of suicide in suicide
+
media by i) creating a high quality human-annotated suicide related data set that will help in
+
suicide study and suicide prevention in social media ii) using the labeled data set build an
+
automated classification model that can assess the suicide risk of a given social media post and
+
categorize it into which stage of suicide process the post is at. Our second aim depends on the
+
quality of the data set we will be able to build. We are aiming for a data set that captures most of
+
the features and characteristics of suicide communication and most importantly accurately the
+
data need to be accurately labeled. To minimize the error in labeling, we came up with
+
definitions of the labels modified for online informal language that will be used as a guideline for
+
the human annotation task. Finally, we will be using Natural-Language processing algorithms to
+
learn the features of this data set and experiment with text categorization algorithms to build a
+
classifier that performs well with the data.
+
The data set for this project consists of 94,671 Reddit posts. The whole work for the
+
project is broken down into four major phases, data sampling, data labeling, data classification
+
and evaluation. Given the resources of the team, it was decided that a small representative subset
+
of this datashould be selected, and manually labeled. The team decided to cluster the posts into
+
50 clusters using k-means clustering and then take the 20 posts closest to the cluster centroid and
+
20 posts furthest from the cluster centroid in order to obtain data that are representative of all
+
posts. To label the posts, two of the team members labeled the same data set individually and
+
another team member did third judgment on the posts which were on disagreements. Features for
+
classification model were created based on the word cluster created in the previous study on the
+
same data [5]. The Project team chose Neural Network (NN), Naive Bayes and Support Vector
+
+
+
+

+
4
+
Machine (SVM) to develop a classification model for the data. So far, in our experimental result
+
all three classification models showed more than 50 percent accuracy for risk level.
+
In subsequent sections, this report is organized to explain and describe related works,
+
methodology, results, and discussion. We conclude the report summarizing our findings and
+
recommendations for future works.
+
3. Related Work
+
3.1 Word2Vec Topic Modeling
+
Word2Vec is one of the popular topic modeling algorithm used to produce word
+
embedding. There are previous attempts by researchers to automatically extract informal latent
+
recurring topics using Word2Vec language model from online social media. In a previous study
+
on the same Reddit data being used for this project, Grant et al. attempted to do it by evaluating
+
the latent topics and then extensively comparing them to twelve risk factors that are proposed by
+
domain experts [5]. They computationally generated Word2vec language model using word
+
embeddings. The skip-gram model was used in their research which learns vector representations
+
of words by predicting neighboring words in a text. After that, the k -means clustering technique
+
using Euclidean distance was leveraged to group the items. This produced clusters of topics. As
+
part of word representation, the results of the model were subjectively evaluated. The results
+
showed that the word embeddings were able to capture much semantic information from the
+
corpus relevant to the suicidal ideation topic.
+
In a different research study for PTSD, Grant et al. tried to extract informal topics using
+
Word2Vec language model to explore patterns and association between the topics [6]. That
+
study demonstrated that Word2Vec can be successfully employed to identify latent topics related
+
to mental health issues, specifically PTSD. They checked clusters which were most and less
+
frequently related to PTSD. Both clusters were semantically coherent which proved that the
+
clusters were grouping words correctly. From the DSM manual, eight criteria for diagnosing
+
PTSD were evaluated against relevant clusters. Common words from the cluster were compared
+
with the description of the criteria. The informal topics captured by the clusters related well with
+
the PTSD diagnosis criteria. In some cases, it was challenging to capture the criteria where
+
symptoms occurred over time or symptoms that are difficult to express explicitly. One of the
+
limitation of that study was, some criteria could not be defined with informal topics. It was found
+
that, the Word2Vec algorithm could not capture the semantics of use of time in a sentence which
+
is very important for this research.
+
3.2 Human Annotation
+
Human annotation of data can be key in categorizing online posts since humans are better
+
than computers in understanding the semantics, subject, underlying context or ambiguity.
+
Researchers in the past successfully used human annotation of twitter data to classify suicidal
+
ideation. O’Dea et al. did a study on twitter data to establish the feasibility of consistently
+
detecting the level of concern for individuals' Twitter posts [7]. It aimed to design and implement
+
an automated computer classifier that could replicate the accuracy of the human coders. This
+
+
+
+

+
5
+
study aimed to examine whether the level of concern for a suicide-related post on Twitter could
+
be determined based solely on the content of the post. They manually coded each twitter posts
+
with one the three level of severity, such as strongly concerning, possibly concerning, and safe to
+
ignore. Then they used various machine learning algorithm to develop a text classifier that could
+
automatically distinguish tweets into the three categories of concern. Result from this study was
+
promising with overall agreement rate 76 percent. Although the agreement rates among human
+
coders and machine-learned classifier were satisfactory, concordance was by no means perfect.
+
Besides, human coding was difficult and time-consuming.
+
In another study, Burnap, et al. also used human annotation to identify features of
+
suicidal ideation that is going to be useful for machine learning [8]. In this study, researchers first
+
created a set of benchmark experimental results for machine learning approaches to classify
+
suicide ideation. Then they compared the popular baseline and ensemble classifier for suicide
+
ideation to see which classifier performs well. And finally, this study tried to develop a multi-
+
classifier that can distinguish between suicide related topics such as suicide ideation, awareness,
+
campaign, report, or low risk reference to suicide. In their study, the three baseline classifiers
+
performed same for all three feature sets across all classes. Result of their study suggested that
+
ensemble of multiple base classifier can perform better than base classifier for the multi -class
+
classification of suicidal communication and ideation in informal text such as social media posts.
+
One of the challenges they faced during the experiment is differentiating between suicide
+
ideation and flippant reference to suicide due to the sarcasm and irony used during this context.
+
3.3 Social Media Data
+
Researchers attempted to use social media data for various domain to predict the intent of
+
the post to better analyze a situation. For example, in one such study Chen et al. from Purdue
+
university wanted to analyze students' informal twitter conversation to understand issues and
+
problems students encounter in their learning experiences [9]. The research goals of this study
+
were to demonstrate a workflow of social media data sense-making for educational purposes,
+
integrating both qualitative analysis and large-scale data mining techniques. As we saw in some
+
other research papers mentioned before, in this study also categories were manually developed
+
by the researchers. Then, three researchers discussed and collapsed the initial categories into five
+
prominent themes, such as, heavy study load, lack of social engagement, negative emotion, sleep
+
problems, and diversity issues. Then a multi-label classifier was developed to classify tweets
+
based on the categories developed in the content analysis stage. One of the major limitation o f
+
their study was, it was only able to analyze the students who were actively expressed their
+
feelings on twitter.
+
In another study, Anshary et al. studied twitter data for target marketing [10]. The
+
purpose of this study was to classify followers of company twitter account as the target market
+
and then extract the features needed for target market classification based on their tweets. Each
+
tweet was labeled according to its category (i.e. computers, mobile phones or cameras). The
+
selection of these labels was based on the types of goods sold by the company. Next, a model
+
was built to classify the target market using ensemble methods, followed by testing of the
+
classification results. Lastly, an evaluation was conducted to find out whether implementation of
+
+
+
+

+
6
+
the ensemble methods improved the accuracy of the learning algorithms, especially in the case of
+
tweet-based target market classification. One of the limitation of this study was, twitter posts
+
were being classified based on keyword frequency, which left the possibility of understanding
+
the post in its right context. Therefore, a post for a product could be incorrectly classified.
+
In this project, we tried some of the previous efforts attempted by other researchers. For
+
example, when we manually labeled the data to determine the concerning level of the post, we
+
came up with three levels such as, strongly concerning, concerning and safe to ignore based on a
+
previously done similar research [7]. We used criteria mentioned in official documentation from
+
NIH (National Institute of Medical Health) to establish ground truth for the labeling classes. We
+
also used cluster features from another study done on same Reddit data set [5] by using those
+
clusters as features for our posts.
+
4. Methodology
+
Overview
+
The research effort was broken down into four major work phases, data sampling, data
+
labeling, model building and tuning, and model evaluation. Figure 1 shows the high-level
+
diagram of the methodology of this research.
+
Fig. 1: Methodology Overview
+
4.1. Data Gathering and Pre-processing
+
Reddit describes itself as “a source for what's new and popular on the web. Users provide
+
the content and decide, through voting, what's good and what's junk.” [11] Items are categorized
+
into subreddits denoted with the
+
‘r/’ prefix. The data set used for research consists of
+
approximately 95,000 posts on the subReddit r/SuicideWatch between 2008 and 2016.
+
For this study, we focus on ‘title’ and the ‘selftext’ fields, since these fields together
+
captures the thoughts expressed by the member of the post. Often, the title represents the first
+
sentence in the overall post. Due to this, the team decided to combine the title and the selftext
+
fields into one single field.
+
+
+
+

+
7
+
4.2 Data Sampling
+
This data has no objective risk assessment, so the research team would have to assess the
+
risk of suicide in each post to complete the study. Given the resources of the team, it was
+
decided that a small representative subset of this data would should be selected, and manually
+
labeled. Previous studies have demonstrated that approximately 2000 posts would be sufficient
+
for a data-driven model to be derived. In “Detecting suicidality on Twitter.” Odea, Bridianne, et
+
al. were able to achieve 80 percent accuracy when assessing the risk of suicide as ‘strongly
+
concerning’ based on Twitter posts. Based on these results, the team decided to sample
+
approximately 2,000 posts from the r/SuicideWatch data. The process is represented below.
+
Fig. 2: Data Sampling Process Flow
+
Instead of taking a random sample of 2000 posts from the overall data set, the research
+
team decided to first cluster the posts into 50 clusters and then take the 20 posts closest to the
+
cluster centroid and 20 posts furthest from the cluster centroid. By doing this the team hoped to
+
obtain posts that were more representative of all posts.
+
The sampling process consists of four main steps; word tokenization, clustering, outlier
+
removal, and sample extraction.
+
Word Tokenization
+
The first step in the sampling process was to convert the text of the posts into a vector space of
+
word frequencies. During this process, the words were stemmed using the Python package
+
+
+
+

+
8
+
nltk.stem.snowball.EnglishStemmer. This reduces the number of words in the vector space and
+
increases the chances that the same word with different suffixes is represented consistently
+
within the vector space. Finally, the word frequencies were computed for each post within the
+
vector space. This was performed using the simple word frequencies in each post resulting in the
+
final vector space model. The vectorization was applied using the Python package
+
sklearn.feature_extraction.text.CountVectorizer. The CountVectorizer package has a facility to
+
apply an inline word stopping function. Word stopping removes common words that do not
+
contribute significant information to the word vector space model; words like ‘a’, ‘an’, and ‘the’.
+
This was applied by setting the stop_words parameter to ‘english’.
+
Clustering
+
K-means clustering was performed on the word vector space model using the
+
sklearn.cluster.KMeans package in Python. With the aim of identifying 2,000 samples, it was
+
decided to take 40 samples from 50 clusters. Clustering was performed with random starting
+
centroids. The k-means clustering assigns a numeric cluster label,
+
0 to
+
49 to each post.
+
Additionally, the clustering algorithm produced a vector representing the centroid of each
+
cluster.
+
Outlier Removal
+
Clusters that contained less than 40 posts were evaluated to see if they were outliers. If
+
cluster held less than
+
35 posts, they were determined to be outliers and removed for
+
consideration in future iterations. They were however, kept in the sample file for the current
+
iteration.
+
Sample Extraction
+
The goal with identifying the samples is to select samples close to the centroid, that most
+
closely describe the cluster and, samples from the outer edges of the cluster. The samples from
+
the outer edges are still more like the centroid than that of a neighboring cluster centroid, but
+
they are least like the centroid. To identify the desired samples, each post was compared to the
+
cluster centroid. The comparison was completed using cosine similarity.
+
Subsequently, the 20 posts with the highest cosine similarity score, and the 20 posts with
+
the lowest cosine similarity were exported as a representative data set. Since some clusters had
+
less than 40 posts, duplicate records were removed.
+
4.3 Data Labeling
+
When approaching this topic, there is a significant hurdle to overcome in that the data
+
source is a set of posts on a public website where people anonymously express thoughts of
+
suicide. There is no way to determine whether the author of the post subsequently attempted or
+
succeeded in committing suicide. Additionally, there is significant disparity in what each
+
individual member is willing to share and in what everyone is hoping to accomplish by posting
+
on the site. Some are genuinely seeking help to prevent what they see as a possible or probable
+
suicide attempt, some are posting what is a suicide note as they are committed to ending their
+
life, and yet others are confused, lonely, venting, or struggling with some form mental health
+
+
+
+

+
9
+
issue like depression. In addition to the variability associated with the posters, there is inherent
+
variability in the way the reader interprets the posts. Each annotator must make a judgment call
+
as to whether the poster is forthright in their desire to attempt or avoid suicide. Given these
+
issues and the objective of using supervised learning techniques to assess the risk of suicide, the
+
research team had to first devise a way to qualitatively and consistently assign a risk to each post.
+
The team used a rubric to try to assess the risk and stage consistently, but the labeling of each
+
post is still a subjective process. The process is illustrated below.
+
Fig. 3: Data Labeling Process Flow
+
Guidelines
+
Team developed definitions and guides for each label to help the human annotators label
+
the post with more confidence. These guidelines are developed from
+
3 different sources
+
mentioned in Fig. 3. One is the possible signs and symptoms that someone is thinking about
+
suicide provided by NIH [12].
+
+
+
+

+
10
+
Talking about wanting to die or wanting to kill themselves
+
Talking about feeling empty, hopeless, or having no reason to live
+
Planning or looking for a way to kill themselves, such as searching online, stockpiling pills, or
+
buying a gun
+
Talking about great guilt or shame
+
Talking about feeling trapped or feeling that there are no solutions
+
Feeling unbearable pain (emotional pain or physical pain)
+
Talking about being a burden to others
+
Using alcohol or drugs more often
+
Acting anxious or agitated
+
Withdrawing from family and friends
+
Changing eating and/or sleeping habits
+
Showing rage or talking about seeking revenge
+
Taking great risks that could lead to death, such as driving extremely fast
+
Talking or thinking about death often
+
Displaying extreme mood swings, suddenly changing from very sad to very calm or happy
+
Giving away important possessions
+
Saying goodbye to friends and family
+
Putting affairs in order, making a will
+
Fig. 4: Signs and Symptoms of Suicide provided by NIH
+
The guidelines were used to set definitions to the two set of labels we choose that assess
+
suicide risk and suicide stage. The labels were chosen from similar previous studies of suicide
+
communication in social media. Risk labels comes from [7]. We made some minor changes to
+
the label definition to adapt better on the reddit data we have while the original levels are used to
+
define twitter data.
+
+
+
+

+
11
+
Risk Label
+
Table 1: Table showing various risk label
+
The other set of labels comes from a different research done by Grant et al. on same
+
Reddit data set [5]. They don’t have definition for the categories, therefore, the team had to come
+
up with the definition based on personal interpretation literature reviews. The original label
+
consists of three categories: suicide ideation, suicide plan, suicide attempt. After some initial
+
human annotation, the team found that these three categories failed to cover many posts in the
+
dataset, especially those posts which talk about users’ depression thoughts, venting, or simply
+
asking for help with no sign of suicide or when users clearly states that they don’t want to “die”
+
or “kill self”. Therefore, the team decided to add another category called “Depression” for this
+
type of posts. Noticeably, this category is one of the dominant categories among this set of
+
labels.
+
+
+
+

+
12
+
Stage label
+
Table 2: Table showing various stage label
+
After the aforementioned steps, the definition of labels was finalized.
+
The first set of labels has four categories:
+
·
+
Strongly Concerning
+
·
+
Concerning
+
·
+
Safe to Ignore
+
·
+
Irrelevant
+
The second set of labels has five categories:
+
·
+
Depression
+
·
+
Suicide Ideation
+
·
+
Suicide Plan
+
·
+
Suicide Attempt
+
·
+
Irrelevant
+
+
+
+

+
13
+
Interface
+
Third party web-based Qualtrics survey interface was used for the human annotation task.
+
With the Qualtrics survey system we were able to feed the posts and the choices for labels and
+
distribute it via email to the annotators to record their responses. Posts were randomly selected
+
from the sampled data set to feed the survey tool. This interface displayed the posts as survey
+
type questions for the human annotators make choice between the different labels. The tool was
+
easy and efficient to use as the annotation tasks were easily trackable, the responses were saved
+
in real time and responses could be downloaded in various format for the team to analyze.
+
Fig. 5: Web-based Qualtrics Survey Interface [21]
+
Lack of prior training and expertise on mental illness made the human annotation task a
+
huge challenge. Based on the literature research and NIH guidelines, we were able to use the
+
suicide signs and symptoms to build clear definitions for each label that were used as a guide for
+
the human annotators. Due to the complexity and heterogeneous nature of the informal posts,
+
there were several rounds of discussions during annotation process to improve the quality of
+
labeling. To validate and improve the quality of labeling, we had two annotators label each post.
+
If there was disagreement between the labels by the two annotators, the final judgment was made
+
+
+
+

+
14
+
by the third person. This allowed majority vote on the posts that had disagreements. The labeling
+
process is illustrated in the chart below to show how a post gets annotated by two human
+
annotators and moves to final judgement before the label gets finalized.
+
Fig. 6: Human Annotation Flow
+
4.4 Data Classification
+
Feature Creation
+
In order to create the features used in the classification model, the team used the word
+
clusters from the previous study on informal topics relating to suicide. The primary features used
+
for classification are contained in a vector space model where the words identified in the
+
informal topic clusters are identified and counted in the posts. For example, a cluster in the
+
informal topics study included the words; 'suffer', 'add', 'severe', ‘developed', 'mental_illness',
+
'related', 'condition', 'causes'. The corresponding feature used for classification is set as the
+
frequency of these words within each post. For example, assuming the words above are from
+
informal topics cluster j, and a post contains the text “My severe acne causes me to suffer severe
+
ridicule daily”. Since ‘severe’ occurs twice, and ‘causes’, and ‘suffer’ occur once, feature j
+
would have a value of 4 for the post.
+
The informal topics study created 100 clusters, so the primary vector space model has
+
100 features. There is one additional feature representing the number of word in the post. This
+
was calculated because the team wanted to know if the length of a post contributed to a
+
classifier’s ability to accurately assess the risk or stage of a post. With the addition of the word
+
count feature, the vector space model has 101 features.
+
Model Creation Overview
+
After labeling was completed on 1025 posts, all posts that had been labeled as Irrelevant
+
on either set of labels, risk or stage, were removed from use within the models. The remaining
+
+
+
+

+
15
+
971 posts were split 70/30 into a training data set and a test data set. Based on the objectives of
+
the research, six different supervised learning techniques were identified for evaluation; Naive
+
Bayes, K Nearest Neighbors, Decision Tree, Support Vector Machines, Neural Networks, and an
+
Ensemble classifier.
+
Fig. 7: Data Classification Overview
+
Machine Classification Method
+
As discussed earlier, each post was decomposed into a vector space model consisting of
+
100 topic features. Each topic consisted of a set of words that have previously been demonstrated
+
to be similar in meaning. The final feature value is the count of the relevant topic words in each
+
post. Additionally, the word count was added as a feature. This means the vector space model
+
used to train the models contained 101 features.
+
Each model was trained on the same training data set and evaluated on the same test data
+
set. Where reasonable a grid search was performed to identify optimal parameters, but due to the
+
unbalanced nature of the data set and the grid search seeking to optimize accuracy, many of the
+
resulting parameter values were ineffective when considering the minority class sensitivity.
+
Most models were refined using empirical methods within the resource constraints of the team.
+
As will be discussed later in this paper, the three best techniques were Naive Bayes, Decision
+
Trees, and an Ensemble Classifier.
+
Model - Naive Bayes
+
Naïve Bayes is a probabilistic classifier that is commonly used in applications of text
+
categorization. The data points are represented as vectors of feature values, and naïve Bayes
+
assumes the value of a feature is independent of other features. This is based on Bayes’ theorem
+
+
+
+

+
16
+
which describes the relationship between the probability of hypothesis before getting the
+
evidence P(h) and the probability of hypothesis after getting the evidence P (h|e) by:
+
NB classifier uses the training data to estimate the parameters of probability distribution which is
+
the right-hand side of the above equation. During testing, the classifier uses these estimations to
+
compute the posterior probability for a given sample. The hypothesis with the largest posterior
+
probability is used to classify the test sample. [14]
+
Model - Decision Trees
+
A decision tree is a flowchart-like tree structure, where each internal node denotes a test
+
on an attribute, each branch represents an outcome of the test, and each leaf node or terminal
+
node is a class label. The top-most node in a tree is the root node. Given a tuple, X, for which the
+
associated class label is unknown, the attribute values of the tuple are tested against the decision
+
tree. A path is traced from the root to a leaf node, which holds the class prediction for that tuple.
+
Decision trees can easily be converted to classification rules. Decision tree models can be used
+
for text classification from labeled training text data. The capability of a decision tree to learn
+
disjunctive expressions and its robustness to noise makes it a good model for text classification.
+
Model - Ensemble Classifier
+
Ensemble classifiers are a group of machine learning techniques that seek to exploit
+
different aspects of the learning process. Ensemble methods typically consist of several models
+
whose predictions are used voting style to classify a record based on majority rules. Each model
+
may have an equal vote or a weighted vote. Ensembles may leverage the same underlying
+
models that exploit different attributes of the data, others may leverage the same data attributes
+
while employing several different modeling methods. In some cases, an ensemble of weak
+
classifiers can outperform more sophisticated modeling techniques.
+
For this research, the best ensemble classifier consisted of an ensemble of three different
+
models predicting on the same data. The underlying models were the same models as the team
+
tested individually; Naive Bayes, Decision Tree, and K Nearest Neighbors. Each model
+
contributed an equal vote.
+
5. Results
+
5.1 Data Acquisition and Cleansing
+
After data cleansing, data set comprised of 94,671 posts from 46077 unique users. These
+
posts used all together 20,716,133 words which had 71,353 unique words. ‘title’ and ‘self_text’
+
were combined feature of the posts were derived from the words used in these fields. ‘from’ and
+
‘from_id’ columns were empty, so they were removed from the dataset.
+
+
+
+

+
17
+
Variable Name
+
Data Type
+
Definition & Comments
+
title
+
Text
+
Title of the posts
+
created_utc
+
Numeric
+
UTC Time Zone (Coordinated Universal Time)
+
ups
+
Numeric
+
Number of users that liked this post
+
downs
+
Numeric
+
Number of users that don't like this post
+
Num_comments
+
Numeric
+
Number of comments for the post
+
Name
+
Text
+
A fullname which indicates a combination of a thing's type (t3_) and its
+
unique ID which forms a compact encoding of a globally unique ID on
+
reddit.
+
Id
+
Text
+
Unique post Id
+
from
+
Float
+
No data available (all blanks)
+
From_id
+
Float
+
No data available (all blanks)
+
selftext
+
Text
+
Body of posts
+
subreddit
+
Text
+
It indicates the Reddit sub-site which is "SuicideWatch"
+
score
+
Numeric
+
Reddit voting system that evaluates the popularity of a post
+
author
+
Text
+
It seems to not contain valid information
+
url
+
Text
+
Link to post
+
permalink
+
Text
+
URL link to a single comment in a thread.
+
Example: url:
+
/r/SuicideWatch/comments/3fcrkb/i_have_no_idea_what_to_do/
+
permalink:
+
http://www.reddit.com/r/SuicideWatch/comments/3fcrkb/i_have_no_ide
+
a_what_to_do/
+
Table 3: Definition of Attributes
+
+
+
+

+
18
+
5.2 Data Sampling
+
Word Tokenization
+
After acquiring and cleansing the data, the works were tokenized in an effort to improve
+
the results of clustering the posts. The posts contained 76,674 unique words before tokenization.
+
After applying the stopping function, the number of words dropped to 76,362, just a minor
+
reduction. After applying the stemming function, the number of unique words dropped to
+
53,024. This represents a 31 percent reduction in the unique words.
+
Clustering and Outlier Removal
+
As discussed, the clustering function was executed to generate 50 clusters. The KMeans
+
clustering function was executed four times in total and outlier removal was performed after each
+
round. Clusters that did not contain
+
40 posts were evaluated to determine if the cluster
+
represented outliers, and if so, the outlier posts were removed, and the clustering repeated after
+
taking samples from each cluster. The objective of this exercise was to ensure broad
+
representation of the posts in the original data set while maximizing the number of posts in the
+
working data set. Clusters with less than 40 posts showed some to be obscure with very low
+
population, but others to be nearly filled with what appeared to be relevant posts. Due to this,
+
the decision was made to keep the posts in clusters that contained close to 40 posts but remove
+
the nearly empty clusters as outliers.
+
Sample Extraction
+
The initial clustering and sampling exercise produced 1447 sample posts; significantly
+
less than the desired 2,000 posts. Each of the first (1763) and second (1874) iterations of outlier
+
removal improved the number of samples, but the third iteration (1852) saw the number of
+
samples fall from the peak. The following chart shows how the removal of outliers affected the
+
number of samples after each run.
+
+
+
+

+
19
+
Fig. 8: Number of Posts in each iteration after Outlier removal
+
Since the third iteration contained fewer posts than the second iteration, and due to resource
+
constraints, the decision was made to use the data sampled during the second iteration.
+
+
+
+

+
20
+
5.3 Data Labeling - Human annotation
+
Fig. 9: Distribution of stage label
+
Fig. 10: Distribution of risk label
+
In the distribution of risk (Fig. 9), it’s clear that concerning category takes up most of the
+
proportion. It’s also consistent with the label definition of concerning being the default category.
+
While in the distribution of stage (Fig. 10), depression and suicide ideation take up most of the
+
proportion and then suicide plan and attempt in a descending way. It implies that most of the
+
users on this subreddit are here to express their issues and depressed feelings. Some of them may
+
show the signs of suicide but they mostly want help to keep going. Social media platforms are
+
working as an outlet from their real daily life.
+
The shade of blue of the bar graphs is the average word count in each category. Except for
+
irrelevant, most of the categories don’t have too much difference in terms of wordcount from
+
what we can tell in the bar graphs.
+
+
+
+

+
21
+
Fig. 11: Figure showing distribution of risk within different stages
+
Table 4: Figure showing confusion matrix between risk and stage
+
Fig. 11 shows the distribution of risk categories in different stage. Depression is only safe to
+
ignore and concerning and concerning is almost twice the amount of safe to ignore. No
+
depression is strongly concerning. Considering the depression category here stands for
+
depression feelings without suicide ideation, it’s consistent with the distribution. There are few
+
suicide ideation posts are annotated as safe to ignore, because some users may talk about their
+
thoughts of committing suicide a lot, but the content of the posts is mostly just venting about
+
issues that’s not as severe and commonly seen such as teenager complaining about parents or
+
relationships. And some of the suicide ideation are annotated as strongly concerning because
+
some users might not disclose their previous suicide attempt or reveal any of their suicide plan
+
+
+
+

+
22
+
but are actually in dangerous status based on their wording and emotions shown in the posts. As
+
of suicide plan and attempt, most of them are strongly concerning, which is consistent with the
+
definition of the labels.
+
Table 4 is the confusion matrix between risk labels and stage labels. It shows the precise values
+
of the distribution of each category.
+
Table 5: Confusion matrix showing relation between two rounds of annotations on Risk
+
Risk category
+
Labeled Data
+
(n=1025)
+
Category
+
Rates of agreement
+
frequencies
+
Strongly
+
209
+
20.4%
+
97
+
46.4%
+
Concerning
+
Concerning
+
617
+
60.2%
+
384
+
62.2%
+
Safe to Ignore
+
147
+
14.3%
+
50
+
34.0%
+
Irrelevant
+
52
+
5.1%
+
27
+
51.9%
+
Overall
+
1025
+
100%
+
558
+
54.4%
+
Table 6: Table showing category frequency and agreement rate for risk label
+
Table 7: Confusion matrix showing relation between two rounds of annotations on Stage
+
+
+
+

+
23
+
Stage category
+
Labeled Data
+
(n=1025)
+
Category
+
Rates of agreement
+
frequencies
+
Depression
+
400
+
39.0%
+
225
+
56.3%
+
Suicide
+
385
+
37.6%
+
202
+
52.5%
+
Ideation
+
Suicide Plan
+
109
+
10.6%
+
46
+
42.2%
+
Suicide
+
79
+
7.7%
+
30
+
38.0%
+
Attempt
+
Irrelevant
+
52
+
5.1%
+
27
+
51.9%
+
Overall
+
1025
+
100%
+
530
+
51.7%
+
Table 8:Table showing category frequency and agreement rate for stage label
+
Table 5 and 7 are two confusion matrices between two annotations on 1025 posts on two
+
sets of labels. It shows that depression and suicide ideation can be confused, and suicide plan and
+
attempt is hard to differentiate, which is also consistent with the annotating process.
+
Concerning as the default label takes up the most proportion among all categories, while
+
safe to ignore and strongly concerning have comparatively less elements. This could lead to a
+
mediocre performance of the classifiers when the distribution of categories is not balanced. It’s
+
the same situation with depression and suicide ideation.
+
Among all 1025 posts, 20.4% (n = 209/1025) were coded as ‘strongly concerning’,
+
60.2% (n = 617/1025) ‘concerning’, 14.3% (n = 147/1025) were coded as ‘safe to ignore’.
+
Overall, the rate of agreement on the risk label for the data set was 54.4% (n = 558/1025), 46.4%
+
(n = 97/209) for ‘strongly concerning’, 62.2% (n = 384/617) for ‘concerning’, and 34.0% (n =
+
50/147) agreement for ‘safe to ignore’.
+
For the stage label, among all
+
1025 posts,
+
39% (n = 400/1025) were coded as
+
‘depression’, 37.6% (n = 385/1025) ‘suicide ideation’, 10.6% (n = 109/1025) were coded as
+
‘suicide plan’, and 7.7% (n=79/1025) ‘suicide attempt’. Overall, the rate of agreement on the
+
stage label for the data set was 51.7% (n = 530/1025), 56.3% (n = 225/400) for ‘depression’,
+
52.5% (n = 202/385) for ‘suicide ideation’, 42.2% (n = 46/109) for ‘suicide plan’, and 38.0% (n
+
= 30/79) agreement for ‘suicide attempt’.
+
The agreement rates might not be at the best level, but we believe it’s comparatively good
+
considering all 5 annotators come from different background and hold different judgement.
+
+
+
+

+
24
+
5.4 Effectiveness of Machine Classifiers
+
1025 posts were labeled by the human annotators and 52 of those posts were marked
+
‘Irrelevant’. There ‘Irrelevant’ posts were removed before classification to remove the noisy
+
data. After the initial run of classification model and evaluation of misclassification matrix, 2 of
+
‘Strongly Concerning’ posts that were deemed to be irrelevant were removed. The total number
+
of posts used in the final run of classification models was 971 (training: 679, testing: 292). For
+
the Risk Labels, only 17% of the total data set and 9% of the test data were labeled as 'Strongly
+
Concerning'. Similarly, for the Stage Labels, only 6% of the test data set was labeled as 'Suicide
+
Plan' and 11% was labeled as 'Suicide Attempt'. For the two sets of labels, we focus on
+
improving results for these class labels since they are the posts that have the high risk of suicide.
+
Data Set
+
Test Data
+
Stage Label
+
Count (Percent)
+
Count (Percent)
+
Strongly Concerning
+
162 (17%)
+
26 (9%)
+
Concerning
+
663 (68%)
+
220 (75%)
+
Safe to Ignore
+
146 (15%)
+
46 (16%)
+
Total
+
971 (100%)
+
292 (100%)
+
Table 9: Distribution of Risk Labels
+
Data Set
+
Test Data
+
Stage Label
+
Count (Percent)
+
Count (Percent)
+
Suicide Attempt
+
78 (8%)
+
31 (11%)
+
Suicide Plan
+
108 (11%)
+
19 (6%)
+
Suicide Ideation
+
386 (40%)
+
122 (42%)
+
Depression
+
399 (41%)
+
120 (41%)
+
Total
+
971 (100%)
+
292 (100%)
+
Table 10: Distribution of Stage Labels
+
Table 11 and table 13 record the performance results of each classifier on the two sets of
+
labels. Each classifier was trained on 10-fold cross validation for a stable model and the test
+
accuracy, sensitivity and the specificity metrics are based on the model run on the test data.
+
For the Risk labels, based on the overall accuracies, the ensemble classifier performed the
+
best with 72.6% training accuracy and 74.3% test accuracy. For these extreme classes that we are
+
most concerned with, NB classifier yielded the best result. For NB classifier, the sensitivity for
+
'Strongly Concerning' was highest which means this model can predict more instances of
+
'Strongly Concerning' correctly. The confusion matrix tables show how different predictions are
+
distributed for each classifier. NB model was able to predict
+
11 instances of
+
‘Strongly
+
Concerning' posts out of 26 ‘Strongly Concerning' posts from the test data set. Compared to this,
+
DT and Ensemble models were only able to predict 7 instances of ‘Strongly Concerning' posts
+
correctly. NB and DT both predicted
+
6 instances of
+
‘Safe to Ignore' posts correctly but
+
+
+
+

+
25
+
comparatively, NB seems better model since NB was able to perform better on the other two
+
labels also.
+
Running the classifiers on the Stage Labels produced similar results as observed on the
+
risk labels since we used the same dataset with a different set of labels. Overall accuracies
+
dropped for stage labels compared to risk labels. For the overall accuracy, Ensemble classifier
+
performed the best with 42.8% training and 44.2% test accuracy. Decision tree also had the
+
44.2% test accuracy, but this model was not as stable as ensemble model since it had a lower
+
training accuracy of 37.7%. NB classifier performs the best for these individual classes. Overall
+
accuracy for NB classifier is 33.7% which is not as good as the other models, but other models
+
are able to achieve higher accuracy by improving the majority classes. NB classifier was able to
+
achieve the highest sensitivity for the 'Suicide Attempt' and 'Suicide Plan' classes.
+
Furthermore, when producing the ROC curve for NB model on both set of labels, we see
+
that the classes that we are most concerned with are doing better than the random chance line
+
that is denoted by the diagonal dotted line. For Risk labels, ‘Strongly Concerning’ class has the
+
highest AUC among the 3 classes which means it can predict both positive cases and negative
+
cases correctly for the ‘Strongly Concerning’ class. In case of Stage labels, ‘Depression’ class
+
has highest AUC of 0.67 and ‘Suicide Plan’ class has AUC of 0.55 which is much better than the
+
random chance line (AUC =0.5).
+
Fig. 12: ROC curve for NB Classifier on Risk Labels shows the performance of individual class
+
+
+
+

+
26
+
Fig. 13: ROC curve for NB Classifier on Stage Labels shows the performance of individual class
+
Classifier
+
Measure
+
Result
+
Naive Bayes
+
Training Accuracy
+
0.564
+
Test Accuracy
+
0.664
+
Strongly Concerning - Sensitivity |
+
0.423 | 0.898
+
Specificity
+
0.804 | 0.319
+
Concerning - Sensitivity | Specificity
+
0.130 | 0.911
+
Safe to Ignore - Sensitivity | Specificity
+
Decision Tree
+
Training Accuracy
+
0.471
+
Test Accuracy
+
0.561
+
Strongly Concerning - Sensitivity |
+
0.269 | 0.823
+
Specificity
+
0.686 | 0.375
+
Concerning - Sensitivity | Specificity
+
0.130 | 0.854
+
Safe to Ignore - Sensitivity | Specificity
+
Ensemble
+
Training Accuracy
+
0.726
+
Classifier
+
Test Accuracy
+
0.743
+
Strongly Concerning - Sensitivity |
+
0.269 | 0.955
+
Specificity
+
0.955 | 0.167
+
Concerning - Sensitivity | Specificity
+
0.000 | 0.988
+
Safe to Ignore - Sensitivity | Specificity
+
Table 11: Results of Classifiers on Risk Labels
+
+
+
+

+
27
+
Predicted
+
Safe to Ignore
+
Concerning
+
Strongly
+
Actual
+
Concerning
+
Safe to Ignore
+
6
+
36
+
4
+
Concerning
+
20
+
177
+
23
+
Strongly
+
2
+
13
+
11
+
Concerning
+
Table 12: Confusion Matrix - Naive Bayes - Suicide Risk
+
Classifier
+
Measure
+
Result
+
Naive Bayes
+
Training Accuracy
+
0.327
+
Test Accuracy
+
0.337
+
Suicide Attempt - Sensitivity | Specificity
+
0.194 | 0.877
+
Suicide Plan - Sensitivity | Specificity
+
0.421 | 0.788
+
Suicide Ideation - Sensitivity | Specificity
+
0.393 | 0.629
+
Depression - Sensitivity | Specificity
+
0.300 | 0.762
+
Decision Tree
+
Training Accuracy
+
0.377
+
Test Accuracy
+
0.442
+
Suicide Attempt - Sensitivity | Specificity
+
0.133 | 0.954
+
Suicide Plan - Sensitivity | Specificity
+
0.105 | 0.816
+
Suicide Ideation - Sensitivity | Specificity
+
0.492 | 0.722
+
Depression - Sensitivity | Specificity
+
0.517 | 0.684
+
Ensemble
+
Training Accuracy
+
0.428
+
Classifier
+
Test Accuracy
+
0.442
+
Suicide Attempt - Sensitivity | Specificity
+
0.129 | 0.966
+
Suicide Plan - Sensitivity | Specificity
+
0.105 | 0.934
+
Suicide Ideation - Sensitivity | Specificity
+
0.418 | 0.682
+
Depression - Sensitivity | Specificity
+
0.600 | 0.523
+
Table 13: Results of Classifiers on Stage Labels
+
Predicted
+
Depression
+
Suicide
+
Suicide
+
Suicide
+
Actual
+
Ideation
+
Plan
+
Attempt
+
Depression
+
36
+
48
+
19
+
17
+
Suicide Ideation
+
33
+
48
+
29
+
12
+
Suicide Plan
+
4
+
4
+
8
+
3
+
Suicide Attempt
+
4
+
11
+
10
+
6
+
Table 14: Confusion Matrix - Naive Bayes - Suicide Stage
+
+
+
+

+
28
+
6. Discussion
+
6.1 Human Annotation
+
In our assessment of the multi-variant classification model, we first created sampled data
+
using k-means clustering. To ensure our sampled data is representative of all posts, we clustered
+
the posts into 50 clusters and used the 20 posts closest to the cluster centroid and 20 posts
+
furthest from the cluster centroid. Then we hope to label the 1874 sampled posts manually. To
+
ensure the quality of the labeling, two of the team members labeled the same post independently
+
and the third member judged the posts where there were disagreements. By the end of this
+
research paper, the team were able to manually label 1025 posts.
+
For the 1025 labeled posts, the agreement rate was 54.4% and 51.7% for risk and stage
+
labels respectively. There are several reasons why the agreement rates are not at its best. Firstly,
+
the posts from r/suicidewatch are rather long comparing to posts from other social media
+
platforms such as twitter. Longer text makes it harder to read and annotate the stage because the
+
post itself might be talking about a long-time span which includes different stages. Secondly, all
+
5 annotators come from different background and experience without any prior mental health
+
training, which leads to different judgement. Thirdly, limited time and human resources couldn’t
+
allow the annotators to spend more time reading each post, instead sometimes they had to scan
+
long posts to achieve better efficiency. Fourthly, the definition of the labels was not validated by
+
domain expert, which might also lead to the difficulty of human annotation. To sum up, the
+
agreement rates of human annotations shows that there is still certain disagreement between
+
annotators and it might have an impact on the quality of the classification. However, considering
+
all factors and the difficulty of manual labeling, the agreement rates are acceptable and hopefully
+
will get better with possible help from domain experts. Then we built classification models using
+
different algorithms. We applied Decision Trees, Naive Bayes, K-Nearest Neighbor, Ensemble
+
method, Support Vector Machine (SVM), and Neural Network (NN) techniques.
+
6.2 Human in the Loop Review
+
After labeling 1025 posts and removing those deemed irrelevant, the best performing
+
classification model was Naive Bayes, but it still performed poorly on the Risk labels. Upon
+
closer inspection, there were 54 posts that were labeled as ‘Strongly Concerni ng’ but the model
+
classified the posts as ‘Concerning’ or ‘Safe to Ignore’. These posts were represented in the grey
+
boxes seen in Table 11. If these were in fact strongly concerning posts, the failure to identify
+
these posts represents missed opportunities to intervene in the most critical cases.
+
Acknowledging the difficulty of assigning labels, these posts could have been mislabeled.
+
Therefore, a final review was performed on these posts with respect to the risk labels to justify
+
the assigned labels. If these posts were mislabeled, they would impact the training of the models
+
and the results.
+
Review focused on critical aspects of the post relative to the heuristic. Most of the posts
+
in question contained compelling emotional content, which the team noticed tends to lead to
+
+
+
+

+
29
+
higher risk assignment. But a careful review revealed that certain critical aspects were not found.
+
For example, some posters indicate they have a plan, but it is for some future date, not imminent,
+
or that it is dependent on something yet to happen, or not to happen. Others posts that were not
+
authored by the person at risk, but someone trying to help someone at risk. This was one criteria
+
for marking posts irrelevant in the labeling process. Some posts were actually strongly
+
concerning.
+
After the review 7 posts were left as ‘Strongly Concerning’, 45 were relabeled as
+
‘Concerning’, and 2 were relabeled as ‘Irrelevant’. It is important to note that these posts are
+
some of the hardest to label because how the label is assigned is somewhat dependent on how
+
much emphasis the reader places on different aspects of the post. When focused on the
+
emotional aspects it is easier to understand the higher risk assessment, while focusing on the
+
language used tends to allow a more realistic risk assessment.
+
6.3 Evaluating the Machine Learning Models
+
Typically, machine learning is evaluated based on the ability of a model to accurately
+
classify data that was not used to train the model. The accuracy of a model is calculated by
+
dividing the total number of correct predictions by the total number of instances predicted. This
+
spans all classes of the data. The accuracy of a model may be appropriate in some circumstances
+
where the data is balanced over all classes. However, the data used in this research is not
+
balanced across the classes. There are far fewer ‘Strongly Concerning’ posts than ‘Concerning’.
+
In fact, the best model based on accuracy did not identify a single minority class. Due to this, the
+
team decided that the sensitivity of the model on the most severe minority classes is the best way
+
to evaluate the models. Sensitivity measures the ability of the model to identify true positives
+
versus false negatives within class. A high sensitivity, approaching 1, indicates that the model is
+
identifying most of the instances within the class.
+
Another measure to help identify the most effective model is specificity. Specificity
+
measures the model’s ability to correctly classify true negative instances. A high specificity
+
indicates that there are few false positive predictions. In the case of the risk labels, it indicates
+
that there are fewer posts identified as ‘Strongly Concerning’ that are actually ‘Concerning’ or
+
‘Safe to Ignore’. Specificity is important in this case because false positives could drain
+
resources away from those that need them most.
+
Based on these measures, the model that had the best sensitivity on the ‘Strongly
+
Concerning’ risk labels, and the model that had the highest sensitivity on the ‘Suicide Plan’ and
+
‘Suicide Attempt’ stage labels would be the preferred models respectively, assuming they had
+
reasonable specificity measures. Upon review of the results, the Naive Bayes model performed
+
best for both labels, followed by the Ensemble method and then the Decision Tree model. Other
+
models evaluated performed worse than random chance or marginally better than random
+
chance. The Ensemble Classifier and the Decision Tree actually have the same sensitivity
+
(26.9%) but the specificity of the Ensemble Classifier is significantly better than the Decision
+
Tree (95.5% vs 82.3%) when considering the risk labels.
+
+
+
+

+
30
+
As the best model, the Naive Bayes performed around 4 times better than random chance
+
for the ‘Strongly Concerning’ class with a sensitivity of 42.3 percent versus the prior probability
+
of 9 percent. When considering the stage labels, ‘Suicide Plan’ and ‘Suicide Attempt’, Naive
+
Bayes again outperformed the prior probabilities. Sensitivity for ‘Suicide Plan’ was 42 percent
+
versus the prior probability of 6 percent. Sensitivity for ‘Suicide Attempt’ was 19 percent versus
+
the prior probability of 11 percent.
+
The research team expected that the Support Vector Machine would perform better since
+
other research indicated that it was successful. After re-reading the literature we realized that the
+
SVM performed well when the data was balanced. This explained why the SVM performed
+
poorly on the data set since the data is unbalanced.
+
7. Conclusion
+
The purpose of this research was to answer the question
+
“Can machine learning
+
techniques be used to detect the risk of suicide based on informal posts made in online forums?”
+
There are two sets of labels that were assessed; the risk labels (Safe to Ignore, Concerning, and
+
Strongly Concerning), and the stage labels (Depression, Suicide Ideation, Suicide Plan, Suicide
+
Attempt). Based on the results presented, the research indicates that indeed, machine learning
+
techniques can detect the risk of suicide from informal online posts. This paper has
+
demonstrated that at least two models perform substantially better than random chance at
+
identifying the posts that present as the most serious and immediate risk of suicide. Naive Bayes
+
performed the best on both sets of labels, while Decision Tree performed better than random
+
chance.
+
It is important to realize that there are limitations that impact the effectiveness of machine
+
learning techniques on this data.
+
The quality of the label appears to have a huge impact on the results of the model. This is
+
clearly demonstrated in the fact that the best model performed worse than random chance on the
+
data prior to the Human in the Middle review, but after relabeling less than five percent of the
+
posts, the sensitivity of the model increased to be over four times better than random chance
+
(42% vs 9%).
+
Labeling posts is time consuming and prone to variability due to:
+
▪ Misunderstanding due to poor spelling / slang.
+
▪ Ability of the annotator to focus effectively when reading long rambling posts.
+
▪ Consistent assessment of risk is difficult because each annotator’s risk assessment is
+
filtered through personal experience.
+
▪ Highly emotional content tends to lead the annotator to assign a higher risk unless care is
+
taken to dampen the effects of the emotional content.
+
Assigning the risk of suicide based on one anonymous post is difficult particularly since
+
the actual outcomes are unknown. This research demonstrates that there is frequent
+
disagreement between annotators regarding the level of risk for specific posts. In fact, there were
+
+
+
+

+
31
+
posts that an annotator read twice, and changed their risk assessment. This demonstrates how
+
difficult it is to assign labels consistently without having professional experience in this domain.
+
8. Future Work
+
Different aspects of future work can be done based on this research.
+
For the data set, some data cleansing techniques can be applied such as translating
+
misspelling and slangs, to make both human annotation and machine learning more efficient and
+
accurate. Also, further research can be done using comments component for each post, as an
+
addition to current components or other individual researches.
+
For human annotation part, consider engaging domain experts to consult to regarding
+
multiple parts of the annotation. Include more labelled data into classification model to enrich
+
the sampled data set. Better interface can be applied for labeling data to make the annotation
+
more efficient. Improve the risk level by using an evidence-based evaluation method. Since some
+
long posts have multiple paragraphs and different suicide stages, consider assess the risk or stage
+
of each paragraph instead of the whole post.
+
For classification approaches, other classification methods can be applied such as
+
Doc2Vec and LDA. Moreover, try using different sets of features as the representation of the
+
posts. For example, the 300 latent features extracted from previous Word2Vec research [5]. To
+
solve the unbalance data issue, try adapting oversampling/under-sampling techniques to help
+
balance the data.
+
+
+
+

+
32
+
9. References
+
[1] “Suicide Statistics.” AFSP, afsp.org/about-suicide/suicide-statistics/.
+
[2] Tavernise, Sabrina. “U.S. Suicide Rate Surges to a 30-Year High.” The New York Times, The
+
+
a-30-year-high.html.
+
[3] Nock, Matthew K., et al. “Cross-National prevalence and risk factors for suicidal ideation,
+
plans and attempts.” The British Journal of Psychiatry, The Royal College of Psychiatrists, 1
+
Feb. 2008, bjp.rcpsych.org/content/192/2/98.
+
[4] Deboer, Fredrik. “America's Suicide Epidemic Is a National Security Crisis.” Foreign Policy,
+
Foreign Policy, 29 Apr. 2016, foreignpolicy.com/2016/04/28/americas-suicide-epidemic-is-a-
+
national-security-crisis/.
+
[5] Grant, Reilly, et al. “Automatic extraction of informal topics from online suicidal ideation.”
+
ACM 11 International Workshop on Data and Text Mining in Biomedical Informatics, 2017
+
[6] Grant, Reilly, et al. “Discovery of Informal Topics from Post Traumatic Stress Disorder
+
Forums.” 2017 IEEE International Conference on Data Mining Workshops (ICDMW), 2017,
+
doi:10.1109/icdmw.2017.65.
+
[7] Odea, Bridianne, et al. “Detecting suicidality on Twitter.” Internet Interventions, vol. 2, no.
+
2, 2015, pp. 183-188., doi:10.1016/j.invent.2015.03.005.
+
[8] Burnap, Pete, et al. “Multi-Class machine classification of suicide-Related communication on
+
Twitter.” Online Social Networks and Media, vol. 2, 2017, pp. 32-44.,
+
doi:10.1016/j.osnem.2017.08.001.
+
[9] Chen, Xin, et al. “Mining Social Media Data for Understanding Students’ Learning
+
Experiences.” IEEE Transactions on Learning Technologies, vol. 7, no. 3, 2014, pp. 246-259.,
+
doi:10.1109/tlt.2013.2296520.
+
[10] Anshary, Muhammad AdiKhairul, and BambangRiyantoTrilaksono. “Tweet-Based Target
+
Market Classification Using Ensemble Method.” Journal of ICT Research and Applications, vol.
+
10, no. 2, 2016, pp. 123-139., doi:10.5614/itbj.ict.res.appl.2016.10.2.3.
+
+
[12] “Suicide Prevention.” National Institute of Mental Health, U.S. Department of Health and
+
+
+
+
+

+
33
+
[13] Patodkar, Vaibhavi N, and Sheikh I.r. “Twitter as a Corpus for Sentiment Analysis and
+
Opinion Mining.”Ijarcce, vol. 5, no. 12, 2016, pp. 320-322., doi:10.17148/ijarcce.2016.51274.
+
[14] Spasić, Irena, et al. “A Naïve Bayes Approach to Classifying Topics in Suicide Notes.”
+
Biomedical Informatics Insights, vol. 5s1, 2012, doi:10.4137/bii.s8945.
+
[15] Wang, Zhe, and XiangyangXue. “Multi-Class Support Vector Machine.” Support Vector
+
Machines Applications, 2014, pp. 23-48., doi:10.1007/978-3-319-02300-7_2.
+
[16] “Support vector machine.” Wikipedia, Wikimedia Foundation, 16 Feb. 2018,
+
en.wikipedia.org/wiki/Support_vector_machine.
+
[17] “Universal approximation theorem.” Wikipedia, Wikimedia Foundation, 13 Feb. 2018,
+
en.wikipedia.org/wiki/Universal_approximation_theorem.
+
[18] “The Number of Hidden Layers.” Heaton Research, 19 Jan. 2018,
+
+
[19] Nielsen, Michael A. “Neural Networks and Deep Learning.” Neural networks and deep
+
learning, Determination Press, 1 Jan. 1970, neuralnetworksanddeeplearning.com/index.html.
+
[20] Srivastava, Nitish, et al. “A Simple Way to Prevent Neural Networks from Overfitting.”
+
Journal of Machine Learning Research, 2014.
+
[21] Qualtrics Survey Website used for human annotation.
+
+
+
+
+

+
34
+
Appendix
+
mean
+
4.75
+
0.00
+
5.13
+
4.75
+
std
+
8.10
+
0.00
+
7.90
+
8.10
+
min
+
0
+
0
+
-3
+
0
+
25.00%
+
1
+
0
+
1
+
1
+
50.00%
+
3
+
0
+
3
+
3
+
75.00%
+
5
+
0
+
6
+
5
+
max
+
483
+
0
+
473
+
483
+
Table 15: Table describing numeric features of the data set
+
Predicted
+
Depression
+
Suicide
+
Suicide
+
Suicide
+
Actual
+
Ideation
+
Plan
+
Attempt
+
Depression
+
36
+
48
+
19
+
17
+
Suicide Ideation
+
33
+
48
+
29
+
12
+
Suicide Plan
+
4
+
4
+
8
+
3
+
Suicide Attempt
+
4
+
11
+
10
+
6
+
Table 16: Confusion Matrix - Naive Bayes - Suicide Stage
+
Predicted
+
Depression
+
Suicide
+
Suicide
+
Suicide
+
Actual
+
Ideation
+
Plan
+
Attempt
+
Depression
+
62
+
30
+
22
+
6
+
Suicide Ideation
+
34
+
60
+
22
+
6
+
Suicide Plan
+
8
+
9
+
2
+
0
+
Suicide Attempt
+
12
+
8
+
6
+
4
+
Table 17: Confusion Matrix - Decision Tree - Suicide Stage
+
Predicted
+
Depression
+
Suicide
+
Suicide
+
Suicide
+
Actual
+
Ideation
+
Plan
+
Attempt
+
Depression
+
52
+
56
+
11
+
1
+
Suicide Ideation
+
52
+
56
+
11
+
3
+
Suicide Plan
+
10
+
6
+
3
+
0
+
Suicide Attempt
+
10
+
19
+
2
+
0
+
Table 18: Confusion Matrix - K Nearest Neighbors - Suicide Stage
+
+
+
+

+
35
+
Predicted
+
Depression
+
Suicide
+
Suicide
+
Suicide
+
Actual
+
Ideation
+
Plan
+
Attempt
+
Depression
+
72
+
38
+
6
+
4
+
Suicide Ideation
+
57
+
51
+
10
+
4
+
Suicide Plan
+
11
+
5
+
2
+
1
+
Suicide Attempt
+
14
+
11
+
2
+
4
+
Table 19: Confusion Matrix - Ensemble Classifier - Suicide Stage
+
Predicted
+
Depression
+
Suicide
+
Suicide
+
Suicide
+
Actual
+
Ideation
+
Plan
+
Attempt
+
Depression
+
114
+
6
+
0
+
0
+
Suicide Ideation
+
119
+
3
+
0
+
0
+
Suicide Plan
+
19
+
0
+
0
+
0
+
Suicide Attempt
+
31
+
0
+
0
+
0
+
Table 20: Confusion Matrix - Support Vector Machines - Suicide Stage
+
Predicted
+
Depression
+
Suicide
+
Suicide
+
Suicide
+
Actual
+
Ideation
+
Plan
+
Attempt
+
Depression
+
118
+
0
+
2
+
0
+
Suicide Ideation
+
119
+
0
+
3
+
0
+
Suicide Plan
+
19
+
0
+
0
+
0
+
Suicide Attempt
+
31
+
0
+
0
+
0
+
Table 21: Confusion Matrix - Neural Network - Suicide Stage
+
Predicted
+
Safe to Ignore
+
Concerning
+
Strongly Concerning
+
Actual
+
Safe to Ignore
+
6
+
28
+
12
+
Concerning
+
34
+
151
+
35
+
Strongly Concerning
+
2
+
17
+
7
+
Table 22: Confusion Matrix- Decision Tree - Suicide Risk
+
+
+
+

+
36
+
Predicted
+
Safe to Ignore
+
Concerning
+
Strongly Concerning
+
Actual
+
Safe to Ignore
+
1
+
37
+
8
+
Concerning
+
11
+
192
+
17
+
Strongly Concerning
+
2
+
21
+
3
+
Table 23: Confusion Matrix- K Nearest Neighbors - Suicide Risk
+
Predicted
+
Safe to Ignore
+
Concerning
+
Strongly Concerning
+
Actual
+
Safe to Ignore
+
0
+
42
+
4
+
Concerning
+
2
+
210
+
8
+
Strongly Concerning
+
1
+
18
+
7
+
Table 24: Confusion Matrix- Ensemble Classifier - Suicide Risk
+
Predicted
+
Safe to Ignore
+
Concerning
+
Strongly Concerning
+
Actual
+
Safe to Ignore
+
0
+
46
+
0
+
Concerning
+
0
+
220
+
0
+
Strongly Concerning
+
0
+
26
+
0
+
Table 25: Confusion Matrix- Support Vector Machine - Suicide Risk
+
Predicted
+
Safe to Ignore
+
Concerning
+
Strongly Concerning
+
Actual
+
Safe to Ignore
+
4
+
42
+
0
+
Concerning
+
12
+
208
+
0
+
Strongly Concerning
+
0
+
26
+
0
+
Table 26: Confusion Matrix- Neural Network - Suicide Risk
+
+
+
+

+
37
+
Fig 14: Conceptual diagram of the labeling interface
+
+
+
+

+
38
+
Assessing the Risk of Suicide Based
+
on Informal Online Discussions
+
Using Machine Learning
+
RAS Team
+
DOLORES KE DING (IN-CLASS)
+
UTSAB SHRESTHA (ONLINE)
+
ROYA PASHMINEH (ONLINE)
+
DAVID MARKS (ONLINE)
+
BISWAJIT DHAR (ONLINE)
+
Can machine learning techniques be used to detect the risk of
+
suicide based on informal posts made in online forums?
+
+
+
+

+
39
+
❖ Odea et al. manually coded Twitter data with three level of severity. [3]
+
• Found 76% overall agreement rate with Twitter data.
+
❖ Burnap et al. used base and ensemble classifier to identify suicide ideation features for machine learning. [4]
+
• Ensemble classifier performed better than base classifier.
+
• One of the challenge was to differentiate between suicide ideation and flippant reference to it due to
+
sarcasm used
+
❖ Anshary et al. built machine learning classifier to analyze Twitter data for target market classification. [5]
+
• Posts were classified solely based on keyword frequency, leaving the possibility of not understanding
+
the post in its right context.
+
❖Grant et al. used Word2Vec language model with word embedding on the same Reddit data being used in
+
this project. [2]
+
• They were able to capture many semantic information from the corpus relevant to suicidal ideation
+
topic.
+
Variable Name
+
Data Type
+
Definition & Comments
+
• Number of posts:
+
title
+
Text
+
Title of the posts
+
94671
+
created_utc
+
Numeric
+
UTC Time Zone (Coordinated Universal Time)
+
ups
+
Numeric
+
Number of users that liked this post
+
• Number of users:
+
downs
+
Numeric
+
Number of users that don't like this post
+
46077
+
Num_commen
+
Numeric
+
Number of comments for the post
+
ts
+
A fullname which indicates a combination of a thing's type (t3_) and its unique ID which forms a compact encoding
+
• Number of words used
+
Name
+
Text
+
of a globally unique ID on reddit.
+
in the posts: 20716133
+
Id
+
Text
+
Unique post Id
+
from
+
Float
+
No data available (all blanks)
+
• Number of unique
+
From_id
+
Float
+
No data available (all blanks)
+
selftext
+
Text
+
Body of posts
+
words in the posts:
+
subreddit
+
Text
+
It indicates the Reddit sub-site which is "SuicideWatch"
+
71353
+
score
+
Numeric
+
Reddit voting system that evaluates the popularity of a post
+
author
+
Text
+
It seems to not contain valid information
+
url
+
Text
+
Link to post
+
Challenge:
+
URL link to a single comment in a thread.
+
• Lack of ground truth
+
permalink
+
Text
+
Example: url: /r/SuicideWatch/comments/3fcrkb/i_have_no_idea_what_to_do/
+
• Huge datasize
+
+
• Incomplete data w/o labels
+
+
+
+

+
40
+
Evaluation
+
• Data Gathering & Pre-
+
&
+
Data
+
processing
+
Prediction
+
Classification
+
Data Labeling
+
Data Gathering
+
Data Sampling
+
• Data Sampling
+
&
+
Pre-Processing
+
• Data Labeling
+
• Data Classification
+
• Data Evaluation & Prediction
+
DATA SAMPLING - OVERVIEW
+
Total Records
+
1874
+
Similarity Measure
+
Cosine
+
Clusters
+
Count
+
> 40 Posts
+
35
+
10 to 39 Posts
+
3
+
< 10 Posts
+
2
+
Total Clusters
+
50
+
Column
+
Definition
+
Similarity
+
Cosine similarity
+
Label
+
Label number
+
id
+
Unique id of post
+
Post
+
Text of post
+
word-
+
Count of words
+
count
+
+
+
+

+
41
+
LABELING
+
- OVERVIEW
+
+
** Paper presentation provided by Dr. Gammell
+
+
Risk Category
+
Labeled Data (n=874)
+
Stage Category
+
Labeled Data (n=874)
+
Category Frequency
+
Agreement Rate
+
Category Frequency
+
Agreement Rate
+
Safe to Ignore
+
129
+
(14.9%)
+
47
+
(36.4%)
+
Depression
+
345
+
(39.7%)
+
187
+
(54.2%)
+
Concerning
+
528
+
(60.8%)
+
309
+
(58.5%)
+
Suicide Ideation
+
328
+
(37.8%)
+
166
+
(50.6%)
+
Suicide Plan
+
95
+
(10.9%)
+
40
+
(42.1%)
+
Strongly Concerning
+
176
+
(20.3%)
+
89
+
(50.6%)
+
Suicide Attempt
+
65
+
(7.5%)
+
27
+
(41.5%)
+
Irrelevant
+
41
+
(4.7%)
+
37
+
(90.2%)
+
Irrelevant
+
41
+
(4.7%)
+
37
+
(90.2%)
+
Overall
+
874
+
(100%)
+
482
+
(55.1%)
+
Overall
+
874
+
(100%)
+
457
+
(52.3%)
+
+
+
+

+
42
+
CLASSIFICATION - OVERVIEW
+
CLASSIFICATION
+
• Use the clusters identified in the previous study as a vector space model.
+
Classifier
+
Parameters
+
• Compute the aggregate term frequencies for words within each cluster by
+
Alpha (smoothing parameter), fit_prior, class_prior
+
post.
+
NB
+
(prior class probability)
+
criterion, splitter, max_depth, max_leaf_nodes,
+
DT
+
min_samples_leaf, min_samples_split,
+
min_impurity_split, max_features, class_weight
+
n_neighbors, weights, algorithm, leaf_size, metric,
+
KNN
+
metric_params, p
+
C1
+
C2
+
C3
+
…
+
C100
+
Label
+
C (penalty of error term), kernel, degree( degree of
+
SVM
+
kernel), gamma (kernel coef), class_weight, probability,
+
P1
+
decision_function_shape (ovo, ovr)
+
P2
+
Num of epochs, num of batch
+
P3
+
Optimizers: rmsprop, adam, SGD
+
NN
+
SGF optimizer: training rate, momentum
+
…
+
Dropuout Rate
+
Pn
+
•
+
874 records
+
833 records to feed classifiers
+
•
+
41 irrelevant.
+
+
+
+

+
43
+
RESULT
+
- RISK LABEL
+
Overall Accuracy
+
Model
+
Categories
+
Sensitivity
+
Specificity
+
Parameters
+
Training
+
Testing
+
Safe to ignore
+
0.431
+
0.773
+
DT
+
Concerning
+
0.634
+
0.494
+
0.477
+
0.509
+
Default
+
Strongly concerning
+
0.116
+
0.846
+
Safe to ignore
+
0.250
+
0.801
+
'alpha': 0.2, 'fit_prior': True,
+
NB
+
Concerning
+
0.530
+
0.620
+
0.499
+
0.454
+
'class_prior': None
+
Strongly concerning
+
2
+
0.697
+
Safe to ignore
+
0.046
+
0.918
+
KNN
+
Concerning
+
0.823
+
0.195
+
0.543
+
0.573
+
K=5
+
Strongly concerning
+
0.159
+
0.903
+
Safe to ignore
+
0
+
0.995
+
'kernel': 'rbf', 'C': 10, 'probability': False,
+
'degree': 3, 'shrinking': True, 'max_iter': -
+
SVM
+
Concerning
+
0.993
+
0
+
0.573
+
0.649
+
1, 'random_state': 123, 'tol': 0.001,
+
'cache_size': 200, 'coef0': 0.0, 'gamma':
+
Strongly concerning
+
0
+
1
+
0.1, 'class_weight': 'balanced'
+
Safe to ignore
+
0.069
+
0.913
+
Ensemble
+
Concerning
+
0.823
+
0.264
+
0.628
+
0.581
+
Strongly concerning
+
0.181
+
0.888
+
Safe to ignore
+
0
+
0.990
+
2-Hidden Layers
+
Concerning
+
0.981
+
0.514
+
optimizer=adam
+
NN
+
0.610
+
0.642
+
Epochs=1000
+
Batch=10
+
Strongly concerning
+
0.023
+
0.99
+
Activation=Relu/ softmax
+
RESULT
+
CONFUSION MATRIX - RISK LABEL
+
NB
+
PREDICTED
+
Ensemble
+
PREDICTED
+
Strongly
+
Strongly
+
Safe to Ignore
+
Concerning
+
Safe to Ignore
+
Concerning
+
Concerning
+
Concerning
+
Safe to Ignore
+
16
+
13
+
14
+
Safe to Ignore
+
3
+
29
+
11
+
Concerning
+
50
+
87
+
27
+
Concerning
+
17
+
135
+
12
+
Strongly Concerning
+
13
+
20
+
11
+
Strongly Concerning
+
1
+
35
+
8
+
DT
+
PREDICTED
+
SVM
+
PREDICTED
+
Strongly
+
Strongly
+
Safe to Ignore
+
Concerning
+
Safe to Ignore
+
Concerning
+
Concerning
+
Concerning
+
Safe to Ignore
+
5
+
24
+
14
+
Safe to Ignore
+
0
+
43
+
0
+
Concerning
+
27
+
104
+
33
+
Concerning
+
1
+
163
+
0
+
Strongly Concerning
+
5
+
20
+
19
+
Strongly Concerning
+
0
+
44
+
0
+
KNN
+
PREDICTED
+
NN
+
PREDICTED
+
Strongly
+
Strongly
+
Safe to Ignore
+
Concerning
+
Safe to Ignore
+
Concerning
+
Concerning
+
Concerning
+
Safe to Ignore
+
2
+
34
+
7
+
Safe to Ignore
+
1
+
42
+
0
+
ACTU
+
AL
+
Concerning
+
16
+
135
+
13
+
Concerning
+
1
+
161
+
2
+
Strongly Concerning
+
1
+
36
+
7
+
Strongly Concerning
+
1
+
43
+
0
+
+
+
+

+
44
+
RESULT
+
- STAGE LABEL
+
Overall accuracy
+
Model
+
Categories
+
Sensitivity
+
Specificity
+
Training
+
Testing
+
Depression
+
0.381
+
0.603
+
Suicide Ideation
+
0.460
+
0.609
+
DT
+
0.368
+
0.362
+
Suicide Plan
+
0.111
+
0.879
+
Suicide Attempt
+
0.105
+
0.936
+
Depression
+
0.257
+
0.829
+
Suicide Ideation
+
0.550
+
0.563
+
NB
+
0.339
+
0.342
+
Suicide Plan
+
0
+
0.866
+
Suicide Attempt
+
8
+
0.849
+
Depression
+
0.485
+
0.452
+
Suicide Ideation
+
0.33
+
0.596
+
KNN
+
0.371
+
0.378
+
Suicide Plan
+
0.07
+
0.9375
+
Suicide Attempt
+
0
+
0.956
+
Depression
+
0.99
+
0.020
+
Suicide Ideation
+
0.02
+
0.986
+
SVM
+
0.422
+
0.390
+
Suicide Plan
+
0
+
1
+
Suicide Attempt
+
0
+
1
+
Depression
+
0.523
+
0.486
+
Suicide Ideation
+
0.39
+
0.609
+
Ensemble
+
0.396
+
0.422
+
Suicide Plan
+
0.148
+
0.955
+
Suicide Attempt
+
0
+
0.961
+
Depression
+
0.914
+
0.095
+
Suicide Ideation
+
0.04
+
0.880
+
NN
+
0.426
+
0.401
+
Suicide Plan
+
0.037
+
0.980
+
Suicide Attempt
+
0
+
0.990
+
RESULT
+
CONFUSION MATRIX - STAGE LABEL
+
NB
+
PREDICTED
+
Ensemble
+
PREDICTED
+
Suicide
+
Suicide
+
Suicide
+
Suicide
+
Depression
+
Suicide Plan
+
Depression
+
Suicide Plan
+
Ideation
+
Attempt
+
Ideation
+
Attempt
+
Depression
+
27
+
46
+
12
+
20
+
Depression
+
55
+
39
+
5
+
6
+
Suicide Ideation
+
18
+
55
+
13
+
14
+
Suicide Ideation
+
55
+
39
+
3
+
3
+
Suicide Plan
+
6
+
10
+
10
+
1
+
Suicide Plan
+
13
+
10
+
4
+
0
+
Suicide Attempt
+
1
+
10
+
5
+
3
+
Suicide Attempt
+
7
+
10
+
2
+
0
+
DT
+
PREDICTED
+
SVM
+
PREDICTED
+
Suicide
+
Suicide
+
Suicide
+
Suicide
+
Depression
+
Suicide Plan
+
Depression
+
Suicide Plan
+
Ideation
+
Attempt
+
Ideation
+
Attempt
+
Depression
+
40
+
42
+
16
+
7
+
Depression
+
104
+
1
+
0
+
0
+
Suicide Ideation
+
38
+
46
+
9
+
7
+
Suicide Ideation
+
98
+
2
+
0
+
0
+
Suicide Plan
+
13
+
9
+
3
+
2
+
Suicide Plan
+
27
+
0
+
0
+
0
+
Suicide Attempt
+
7
+
8
+
2
+
2
+
Suicide Attempt
+
18
+
1
+
0
+
0
+
KNN
+
PREDICTED
+
NN
+
PREDICTED
+
Suicide
+
Suicide
+
Suicide
+
Suicide
+
Depression
+
Suicide Plan
+
Depression
+
Suicide Plan
+
Ideation
+
Attempt
+
Ideation
+
Attempt
+
Depression
+
51
+
41
+
10
+
3
+
Depression
+
96
+
9
+
0
+
0
+
Suicide Ideation
+
59
+
33
+
3
+
5
+
Suicide Ideation
+
96
+
4
+
0
+
0
+
Suicide Plan
+
12
+
11
+
2
+
2
+
Suicide Plan
+
37
+
0
+
0
+
0
+
Suicide Attempt
+
12
+
7
+
0
+
0
+
Suicide Attempt
+
9
+
9
+
1
+
0
+
+
+
+

+
45
+
RESULT
+
• Hoped that accuracy improved as the number of annotated posts increased - Inconclusive
+
• Hoped that the 95% confidence intervals when training would narrow as the number of posts
+
increased - Inconclusive
+
3 Random Sample Results on Risk Labels
+
EVALUATION
+
Model
+
Predictability
+
Stability
+
Computability
+
Interpretability
+
Simple, fast, works
+
Naïve Bayes
+
Low
+
High
+
Easy
+
great with large dataset
+
Quick to train, quick to
+
Decision Tree
+
Low
+
Medium
+
Easy
+
predict
+
Easy to build model,
+
K-Nearest
+
Low
+
High
+
longer when adding
+
Easy
+
Neighbors
+
new data
+
Depending on the
+
component
+
Since the classifiers
+
Ensemble
+
Low
+
High
+
classifiers. Ensemble
+
were simple, relatively
+
of NB, DT, KNN was
+
easy to interpret.
+
relatively fast.
+
Depends on kernel and
+
The complexity is
+
number of support
+
Super Vector
+
usually high, and
+
Low on minority class
+
High
+
vectors. Faster than
+
Machine
+
interpretability is also
+
NN, but training takes
+
impaired
+
longer than NB.
+
Longer training times,
+
Neural Network
+
Low
+
Medium
+
Not easy
+
quick to predict
+
+
+
+

+
46
+
● Generating high quality labels is difficult, time consuming, and subject to wide variation in interpretation of the posts.
+
● Agreement among annotators was relatively low for the highest risk categories (‘Strongly Concerning’ - 51%, ‘Suicide
+
Attempt’ - 42%)
+
● Accuracy not the best measure of success as the data is not balanced. Sensitivity and specificity are better measures.
+
RELATED WORK
+
In [1], the author experiments and proves that NB is a good classifier for sentiment analysis on the informal social media data (twitter). The NB
+
classifier was the best performing classifier in our case.
+
LIMITATIONS
+
•
+
Lack of expertise in the research team for annotation of the posts.
+
•
+
The reddit data used in this study was unable to provide demographic features such as age, gender, location,. Results are generalized based
+
solely on the textual content.
+
•
+
The current study was not able to validate the risk of suicide. The labels were based on analysis of the posts.
+
•
+
Computational system limitation for Neural Network and SVM: didn’t have proper environment to leverage parallel processing capabilities
+
of Tensorflow framework on GPU or effectively GridSearch for the optimal parameters for the SVM.
+
CHALLENGE
+
•
+
Difficulty in labeling the post due to strong emotions and different ways each authors of the post expresses themselves. This caused lot of
+
disagreements between the annotators.
+
•
+
Due to the length of the post, it was difficult using words features. Strong words were used in posts even that were not severe.
+
•
+
There was unbalance in class labels for both set of labels that had impact on the classification results.
+
[1] A. Pak and P. Paroubek. Twitter as a corpus for sentiment analysis and opinion mining. Proceedings of the LREC, 2010.
+
Research Question
+
Can machine learning techniques be used to detect the risk of suicide based on informal posts
+
made in online forums?
+
Conclusion
+
❖ Machine learning techniques can perform better than random chance at identifying the risk of
+
suicide with the following limitations:
+
• The quality of the label appears to have a huge impact on the results of the model.
+
• Labeling posts is time consuming and prone to variability due to:
+
• Misunderstanding due to poor spelling / slang.
+
• Ability to focus effectively
+
• Qualitative assessment of risk based on personal experience.
+
• Assigning the risk of suicide based on one post is difficult.
+
❖ Naive Bayes had the all around best performance of the classifiers tried.
+
+
+
+

+
47
+
❖ Data set ---
+
• Cleanse posts by translating misspellings and slang.
+
• Consider using comments component for each post.
+
❖ Human annotation ---
+
• Consider engaging a subject matter expert for human annotation of the posts.
+
• Include more labelled data into classification model.
+
• Application of better interface for labeling data.
+
• Use a evidence based evaluation method to determine level of risk.
+
• Assess the risk of for each paragraph instead of posts.
+
❖ Classification approaches ---
+
• Apply other classification approaches such as Doc2Vec.
+
• Use different set of features as the representation of the posts. For example, the 300 latent
+
features extracted from previous Word2Vec research.
+
• Adapt oversampling/undersampling techniques to help balance the data.
+
• Human in the loop or active learning labeling of posts.
+
•
+
[1] A. Pak and P. Paroubek. Twitter as a corpus for sentiment analysis and opinion mining. Proceedings of the
+
LREC, 2010.
+
•
+
[2] Grant, Reilly, et al. “Automatic extraction of informal topics from online suicidal ideation.” ACM 11
+
International Workshop on Data and Text Mining in Biomedical Informatics, 2017
+
•
+
[3] Odea, Bridianne, et al. “Detecting suicidality on Twitter.” Internet Interventions, vol. 2, no. 2, 2015, pp. 183-
+
188., doi:10.1016/j.invent.2015.03.005.
+
•
+
[4] Burnap, Pete, et al. “Multi-Class machine classification of suicide-Related communication on Twitter.” Online
+
Social Networks and Media, vol. 2, 2017, pp. 32-44., doi:10.1016/j.osnem.2017.08.001.
+
•
+
[5] Anshary, Muhammad AdiKhairul, and BambangRiyantoTrilaksono. “Tweet-Based Target Market Classification
+
Using Ensemble Method.” Journal of ICT Research and Applications, vol. 10, no. 2, 2016, pp. 123-139.,
+
doi:10.5614/itbj.ict.res.appl.2016.10.2.3.
+
+
+
+

+
48
+
+
+
+
diff --git a/docs/site_libs/highlightjs-1.1/default.css b/docs/site_libs/highlightjs-1.1/default.css
new file mode 100644
index 000000000..17d80de9c
--- /dev/null
+++ b/docs/site_libs/highlightjs-1.1/default.css
@@ -0,0 +1,30 @@
+pre .operator,
+pre .paren {
+ color: rgb(104, 118, 135)
+}
+
+pre .literal {
+ color: #990073
+}
+
+pre .number {
+ color: #099;
+}
+
+pre .comment {
+ color: #998;
+ font-style: italic
+}
+
+pre .keyword {
+ color: #900;
+ font-weight: bold
+}
+
+pre .identifier {
+ color: rgb(0, 0, 0);
+}
+
+pre .string {
+ color: #d14;
+}
diff --git a/docs/site_libs/highlightjs-1.1/highlight.js b/docs/site_libs/highlightjs-1.1/highlight.js
new file mode 100644
index 000000000..c12ccc24f
--- /dev/null
+++ b/docs/site_libs/highlightjs-1.1/highlight.js
@@ -0,0 +1,3 @@
+var hljs=new function(){function m(p){return p.replace(/&/gm,"&").replace(/"}while(y.length||w.length){var v=u().splice(0,1)[0];z+=m(x.substr(q,v.offset-q));q=v.offset;if(v.event=="start"){z+=t(v.node);s.push(v.node)}else{if(v.event=="stop"){var p,r=s.length;do{r--;p=s[r];z+=(""+p.nodeName.toLowerCase()+">")}while(p!=v.node);s.splice(r,1);while(r
'+M[0]+""}else{r+=M[0]}O=P.lR.lastIndex;M=P.lR.exec(L)}return r+L.substr(O,L.length-O)}function J(L,M){if(M.sL&&e[M.sL]){var r=d(M.sL,L);x+=r.keyword_count;return r.value}else{return F(L,M)}}function I(M,r){var L=M.cN?'':"";if(M.rB){y+=L;M.buffer=""}else{if(M.eB){y+=m(r)+L;M.buffer=""}else{y+=L;M.buffer=r}}D.push(M);A+=M.r}function G(N,M,Q){var R=D[D.length-1];if(Q){y+=J(R.buffer+N,R);return false}var P=q(M,R);if(P){y+=J(R.buffer+N,R);I(P,M);return P.rB}var L=v(D.length-1,M);if(L){var O=R.cN?"":"";if(R.rE){y+=J(R.buffer+N,R)+O}else{if(R.eE){y+=J(R.buffer+N,R)+O+m(M)}else{y+=J(R.buffer+N+M,R)+O}}while(L>1){O=D[D.length-2].cN?"":"";y+=O;L--;D.length--}var r=D[D.length-1];D.length--;D[D.length-1].buffer="";if(r.starts){I(r.starts,"")}return R.rE}if(w(M,R)){throw"Illegal"}}var E=e[B];var D=[E.dM];var A=0;var x=0;var y="";try{var s,u=0;E.dM.buffer="";do{s=p(C,u);var t=G(s[0],s[1],s[2]);u+=s[0].length;if(!t){u+=s[1].length}}while(!s[2]);if(D.length>1){throw"Illegal"}return{r:A,keyword_count:x,value:y}}catch(H){if(H=="Illegal"){return{r:0,keyword_count:0,value:m(C)}}else{throw H}}}function g(t){var p={keyword_count:0,r:0,value:m(t)};var r=p;for(var q in e){if(!e.hasOwnProperty(q)){continue}var s=d(q,t);s.language=q;if(s.keyword_count+s.r>r.keyword_count+r.r){r=s}if(s.keyword_count+s.r>p.keyword_count+p.r){r=p;p=s}}if(r.language){p.second_best=r}return p}function i(r,q,p){if(q){r=r.replace(/^((<[^>]+>|\t)+)/gm,function(t,w,v,u){return w.replace(/\t/g,q)})}if(p){r=r.replace(/\n/g,"
")}return r}function n(t,w,r){var x=h(t,r);var v=a(t);var y,s;if(v){y=d(v,x)}else{return}var q=c(t);if(q.length){s=document.createElement("pre");s.innerHTML=y.value;y.value=k(q,c(s),x)}y.value=i(y.value,w,r);var u=t.className;if(!u.match("(\\s|^)(language-)?"+v+"(\\s|$)")){u=u?(u+" "+v):v}if(/MSIE [678]/.test(navigator.userAgent)&&t.tagName=="CODE"&&t.parentNode.tagName=="PRE"){s=t.parentNode;var p=document.createElement("div");p.innerHTML=""+y.value+"
";t=p.firstChild.firstChild;p.firstChild.cN=s.cN;s.parentNode.replaceChild(p.firstChild,s)}else{t.innerHTML=y.value}t.className=u;t.result={language:v,kw:y.keyword_count,re:y.r};if(y.second_best){t.second_best={language:y.second_best.language,kw:y.second_best.keyword_count,re:y.second_best.r}}}function o(){if(o.called){return}o.called=true;var r=document.getElementsByTagName("pre");for(var p=0;p|>=|>>|>>=|>>>|>>>=|\\?|\\[|\\{|\\(|\\^|\\^=|\\||\\|=|\\|\\||~";this.ER="(?![\\s\\S])";this.BE={b:"\\\\.",r:0};this.ASM={cN:"string",b:"'",e:"'",i:"\\n",c:[this.BE],r:0};this.QSM={cN:"string",b:'"',e:'"',i:"\\n",c:[this.BE],r:0};this.CLCM={cN:"comment",b:"//",e:"$"};this.CBLCLM={cN:"comment",b:"/\\*",e:"\\*/"};this.HCM={cN:"comment",b:"#",e:"$"};this.NM={cN:"number",b:this.NR,r:0};this.CNM={cN:"number",b:this.CNR,r:0};this.BNM={cN:"number",b:this.BNR,r:0};this.inherit=function(r,s){var p={};for(var q in r){p[q]=r[q]}if(s){for(var q in s){p[q]=s[q]}}return p}}();hljs.LANGUAGES.bash=function(){var e={"true":1,"false":1};var b={cN:"variable",b:"\\$([a-zA-Z0-9_]+)\\b"};var a={cN:"variable",b:"\\$\\{(([^}])|(\\\\}))+\\}",c:[hljs.CNM]};var f={cN:"string",b:'"',e:'"',i:"\\n",c:[hljs.BE,b,a],r:0};var c={cN:"string",b:"'",e:"'",c:[{b:"''"}],r:0};var d={cN:"test_condition",b:"",e:"",c:[f,c,b,a,hljs.CNM],k:{literal:e},r:0};return{dM:{k:{keyword:{"if":1,then:1,"else":1,fi:1,"for":1,"break":1,"continue":1,"while":1,"in":1,"do":1,done:1,echo:1,exit:1,"return":1,set:1,declare:1},literal:e},c:[{cN:"shebang",b:"(#!\\/bin\\/bash)|(#!\\/bin\\/sh)",r:10},b,a,hljs.HCM,hljs.CNM,f,c,hljs.inherit(d,{b:"\\[ ",e:" \\]",r:0}),hljs.inherit(d,{b:"\\[\\[ ",e:" \\]\\]"})]}}}();hljs.LANGUAGES.cpp=function(){var a={keyword:{"false":1,"int":1,"float":1,"while":1,"private":1,"char":1,"catch":1,"export":1,virtual:1,operator:2,sizeof:2,dynamic_cast:2,typedef:2,const_cast:2,"const":1,struct:1,"for":1,static_cast:2,union:1,namespace:1,unsigned:1,"long":1,"throw":1,"volatile":2,"static":1,"protected":1,bool:1,template:1,mutable:1,"if":1,"public":1,friend:2,"do":1,"return":1,"goto":1,auto:1,"void":2,"enum":1,"else":1,"break":1,"new":1,extern:1,using:1,"true":1,"class":1,asm:1,"case":1,typeid:1,"short":1,reinterpret_cast:2,"default":1,"double":1,register:1,explicit:1,signed:1,typename:1,"try":1,"this":1,"switch":1,"continue":1,wchar_t:1,inline:1,"delete":1,alignof:1,char16_t:1,char32_t:1,constexpr:1,decltype:1,noexcept:1,nullptr:1,static_assert:1,thread_local:1,restrict:1,_Bool:1,complex:1},built_in:{std:1,string:1,cin:1,cout:1,cerr:1,clog:1,stringstream:1,istringstream:1,ostringstream:1,auto_ptr:1,deque:1,list:1,queue:1,stack:1,vector:1,map:1,set:1,bitset:1,multiset:1,multimap:1,unordered_set:1,unordered_map:1,unordered_multiset:1,unordered_multimap:1,array:1,shared_ptr:1}};return{dM:{k:a,i:"",c:[hljs.CLCM,hljs.CBLCLM,hljs.QSM,{cN:"string",b:"'\\\\?.",e:"'",i:"."},{cN:"number",b:"\\b(\\d+(\\.\\d*)?|\\.\\d+)(u|U|l|L|ul|UL|f|F)"},hljs.CNM,{cN:"preprocessor",b:"#",e:"$"},{cN:"stl_container",b:"\\b(deque|list|queue|stack|vector|map|set|bitset|multiset|multimap|unordered_map|unordered_set|unordered_multiset|unordered_multimap|array)\\s*<",e:">",k:a,r:10,c:["self"]}]}}}();hljs.LANGUAGES.css=function(){var a={cN:"function",b:hljs.IR+"\\(",e:"\\)",c:[{eW:true,eE:true,c:[hljs.NM,hljs.ASM,hljs.QSM]}]};return{cI:true,dM:{i:"[=/|']",c:[hljs.CBLCLM,{cN:"id",b:"\\#[A-Za-z0-9_-]+"},{cN:"class",b:"\\.[A-Za-z0-9_-]+",r:0},{cN:"attr_selector",b:"\\[",e:"\\]",i:"$"},{cN:"pseudo",b:":(:)?[a-zA-Z0-9\\_\\-\\+\\(\\)\\\"\\']+"},{cN:"at_rule",b:"@(font-face|page)",l:"[a-z-]+",k:{"font-face":1,page:1}},{cN:"at_rule",b:"@",e:"[{;]",eE:true,k:{"import":1,page:1,media:1,charset:1},c:[a,hljs.ASM,hljs.QSM,hljs.NM]},{cN:"tag",b:hljs.IR,r:0},{cN:"rules",b:"{",e:"}",i:"[^\\s]",r:0,c:[hljs.CBLCLM,{cN:"rule",b:"[^\\s]",rB:true,e:";",eW:true,c:[{cN:"attribute",b:"[A-Z\\_\\.\\-]+",e:":",eE:true,i:"[^\\s]",starts:{cN:"value",eW:true,eE:true,c:[a,hljs.NM,hljs.QSM,hljs.ASM,hljs.CBLCLM,{cN:"hexcolor",b:"\\#[0-9A-F]+"},{cN:"important",b:"!important"}]}}]}]}]}}}();hljs.LANGUAGES.ini={cI:true,dM:{i:"[^\\s]",c:[{cN:"comment",b:";",e:"$"},{cN:"title",b:"^\\[",e:"\\]"},{cN:"setting",b:"^[a-z0-9_\\[\\]]+[ \\t]*=[ \\t]*",e:"$",c:[{cN:"value",eW:true,k:{on:1,off:1,"true":1,"false":1,yes:1,no:1},c:[hljs.QSM,hljs.NM]}]}]}};hljs.LANGUAGES.perl=function(){var d={getpwent:1,getservent:1,quotemeta:1,msgrcv:1,scalar:1,kill:1,dbmclose:1,undef:1,lc:1,ma:1,syswrite:1,tr:1,send:1,umask:1,sysopen:1,shmwrite:1,vec:1,qx:1,utime:1,local:1,oct:1,semctl:1,localtime:1,readpipe:1,"do":1,"return":1,format:1,read:1,sprintf:1,dbmopen:1,pop:1,getpgrp:1,not:1,getpwnam:1,rewinddir:1,qq:1,fileno:1,qw:1,endprotoent:1,wait:1,sethostent:1,bless:1,s:0,opendir:1,"continue":1,each:1,sleep:1,endgrent:1,shutdown:1,dump:1,chomp:1,connect:1,getsockname:1,die:1,socketpair:1,close:1,flock:1,exists:1,index:1,shmget:1,sub:1,"for":1,endpwent:1,redo:1,lstat:1,msgctl:1,setpgrp:1,abs:1,exit:1,select:1,print:1,ref:1,gethostbyaddr:1,unshift:1,fcntl:1,syscall:1,"goto":1,getnetbyaddr:1,join:1,gmtime:1,symlink:1,semget:1,splice:1,x:0,getpeername:1,recv:1,log:1,setsockopt:1,cos:1,last:1,reverse:1,gethostbyname:1,getgrnam:1,study:1,formline:1,endhostent:1,times:1,chop:1,length:1,gethostent:1,getnetent:1,pack:1,getprotoent:1,getservbyname:1,rand:1,mkdir:1,pos:1,chmod:1,y:0,substr:1,endnetent:1,printf:1,next:1,open:1,msgsnd:1,readdir:1,use:1,unlink:1,getsockopt:1,getpriority:1,rindex:1,wantarray:1,hex:1,system:1,getservbyport:1,endservent:1,"int":1,chr:1,untie:1,rmdir:1,prototype:1,tell:1,listen:1,fork:1,shmread:1,ucfirst:1,setprotoent:1,"else":1,sysseek:1,link:1,getgrgid:1,shmctl:1,waitpid:1,unpack:1,getnetbyname:1,reset:1,chdir:1,grep:1,split:1,require:1,caller:1,lcfirst:1,until:1,warn:1,"while":1,values:1,shift:1,telldir:1,getpwuid:1,my:1,getprotobynumber:1,"delete":1,and:1,sort:1,uc:1,defined:1,srand:1,accept:1,"package":1,seekdir:1,getprotobyname:1,semop:1,our:1,rename:1,seek:1,"if":1,q:0,chroot:1,sysread:1,setpwent:1,no:1,crypt:1,getc:1,chown:1,sqrt:1,write:1,setnetent:1,setpriority:1,foreach:1,tie:1,sin:1,msgget:1,map:1,stat:1,getlogin:1,unless:1,elsif:1,truncate:1,exec:1,keys:1,glob:1,tied:1,closedir:1,ioctl:1,socket:1,readlink:1,"eval":1,xor:1,readline:1,binmode:1,setservent:1,eof:1,ord:1,bind:1,alarm:1,pipe:1,atan2:1,getgrent:1,exp:1,time:1,push:1,setgrent:1,gt:1,lt:1,or:1,ne:1,m:0};var f={cN:"subst",b:"[$@]\\{",e:"\\}",k:d,r:10};var c={cN:"variable",b:"\\$\\d"};var b={cN:"variable",b:"[\\$\\%\\@\\*](\\^\\w\\b|#\\w+(\\:\\:\\w+)*|[^\\s\\w{]|{\\w+}|\\w+(\\:\\:\\w*)*)"};var h=[hljs.BE,f,c,b];var g={b:"->",c:[{b:hljs.IR},{b:"{",e:"}"}]};var e={cN:"comment",b:"^(__END__|__DATA__)",e:"\\n$",r:5};var a=[c,b,hljs.HCM,e,g,{cN:"string",b:"q[qwxr]?\\s*\\(",e:"\\)",c:h,r:5},{cN:"string",b:"q[qwxr]?\\s*\\[",e:"\\]",c:h,r:5},{cN:"string",b:"q[qwxr]?\\s*\\{",e:"\\}",c:h,r:5},{cN:"string",b:"q[qwxr]?\\s*\\|",e:"\\|",c:h,r:5},{cN:"string",b:"q[qwxr]?\\s*\\<",e:"\\>",c:h,r:5},{cN:"string",b:"qw\\s+q",e:"q",c:h,r:5},{cN:"string",b:"'",e:"'",c:[hljs.BE],r:0},{cN:"string",b:'"',e:'"',c:h,r:0},{cN:"string",b:"`",e:"`",c:[hljs.BE]},{cN:"string",b:"{\\w+}",r:0},{cN:"string",b:"-?\\w+\\s*\\=\\>",r:0},{cN:"number",b:"(\\b0[0-7_]+)|(\\b0x[0-9a-fA-F_]+)|(\\b[1-9][0-9_]*(\\.[0-9_]+)?)|[0_]\\b",r:0},{b:"("+hljs.RSR+"|\\b(split|return|print|reverse|grep)\\b)\\s*",k:{split:1,"return":1,print:1,reverse:1,grep:1},r:0,c:[hljs.HCM,e,{cN:"regexp",b:"(s|tr|y)/(\\\\.|[^/])*/(\\\\.|[^/])*/[a-z]*",r:10},{cN:"regexp",b:"(m|qr)?/",e:"/[a-z]*",c:[hljs.BE],r:0}]},{cN:"sub",b:"\\bsub\\b",e:"(\\s*\\(.*?\\))?[;{]",k:{sub:1},r:5},{cN:"operator",b:"-\\w\\b",r:0},{cN:"pod",b:"\\=\\w",e:"\\=cut"}];f.c=a;g.c[1].c=a;return{dM:{k:d,c:a}}}();hljs.LANGUAGES.python=function(){var b=[{cN:"string",b:"(u|b)?r?'''",e:"'''",r:10},{cN:"string",b:'(u|b)?r?"""',e:'"""',r:10},{cN:"string",b:"(u|r|ur)'",e:"'",c:[hljs.BE],r:10},{cN:"string",b:'(u|r|ur)"',e:'"',c:[hljs.BE],r:10},{cN:"string",b:"(b|br)'",e:"'",c:[hljs.BE]},{cN:"string",b:'(b|br)"',e:'"',c:[hljs.BE]}].concat([hljs.ASM,hljs.QSM]);var d={cN:"title",b:hljs.UIR};var c={cN:"params",b:"\\(",e:"\\)",c:b.concat([hljs.CNM])};var a={bWK:true,e:":",i:"[${]",c:[d,c],r:10};return{dM:{k:{keyword:{and:1,elif:1,is:1,global:1,as:1,"in":1,"if":1,from:1,raise:1,"for":1,except:1,"finally":1,print:1,"import":1,pass:1,"return":1,exec:1,"else":1,"break":1,not:1,"with":1,"class":1,assert:1,yield:1,"try":1,"while":1,"continue":1,del:1,or:1,def:1,lambda:1,nonlocal:10},built_in:{None:1,True:1,False:1,Ellipsis:1,NotImplemented:1}},i:"(|->|\\?)",c:b.concat([hljs.HCM,hljs.inherit(a,{cN:"function",k:{def:1}}),hljs.inherit(a,{cN:"class",k:{"class":1}}),hljs.CNM,{cN:"decorator",b:"@",e:"$"}])}}}();hljs.LANGUAGES.r={dM:{c:[hljs.HCM,{cN:"number",b:"\\b0[xX][0-9a-fA-F]+[Li]?\\b",e:hljs.IMMEDIATE_RE,r:0},{cN:"number",b:"\\b\\d+(?:[eE][+\\-]?\\d*)?L\\b",e:hljs.IMMEDIATE_RE,r:0},{cN:"number",b:"\\b\\d+\\.(?!\\d)(?:i\\b)?",e:hljs.IMMEDIATE_RE,r:1},{cN:"number",b:"\\b\\d+(?:\\.\\d*)?(?:[eE][+\\-]?\\d*)?i?\\b",e:hljs.IMMEDIATE_RE,r:0},{cN:"number",b:"\\.\\d+(?:[eE][+\\-]?\\d*)?i?\\b",e:hljs.IMMEDIATE_RE,r:1},{cN:"keyword",b:"(?:tryCatch|library|setGeneric|setGroupGeneric)\\b",e:hljs.IMMEDIATE_RE,r:10},{cN:"keyword",b:"\\.\\.\\.",e:hljs.IMMEDIATE_RE,r:10},{cN:"keyword",b:"\\.\\.\\d+(?![\\w.])",e:hljs.IMMEDIATE_RE,r:10},{cN:"keyword",b:"\\b(?:function)",e:hljs.IMMEDIATE_RE,r:2},{cN:"keyword",b:"(?:if|in|break|next|repeat|else|for|return|switch|while|try|stop|warning|require|attach|detach|source|setMethod|setClass)\\b",e:hljs.IMMEDIATE_RE,r:1},{cN:"literal",b:"(?:NA|NA_integer_|NA_real_|NA_character_|NA_complex_)\\b",e:hljs.IMMEDIATE_RE,r:10},{cN:"literal",b:"(?:NULL|TRUE|FALSE|T|F|Inf|NaN)\\b",e:hljs.IMMEDIATE_RE,r:1},{cN:"identifier",b:"[a-zA-Z.][a-zA-Z0-9._]*\\b",e:hljs.IMMEDIATE_RE,r:0},{cN:"operator",b:"<\\-(?!\\s*\\d)",e:hljs.IMMEDIATE_RE,r:2},{cN:"operator",b:"\\->|<\\-",e:hljs.IMMEDIATE_RE,r:1},{cN:"operator",b:"%%|~",e:hljs.IMMEDIATE_RE},{cN:"operator",b:">=|<=|==|!=|\\|\\||&&|=|\\+|\\-|\\*|/|\\^|>|<|!|&|\\||\\$|:",e:hljs.IMMEDIATE_RE,r:0},{cN:"operator",b:"%",e:"%",i:"\\n",r:1},{cN:"identifier",b:"`",e:"`",r:0},{cN:"string",b:'"',e:'"',c:[hljs.BE],r:0},{cN:"string",b:"'",e:"'",c:[hljs.BE],r:0},{cN:"paren",b:"[[({\\])}]",e:hljs.IMMEDIATE_RE,r:0}]}};hljs.LANGUAGES.ruby=function(){var a="[a-zA-Z_][a-zA-Z0-9_]*(\\!|\\?)?";var j="[a-zA-Z_]\\w*[!?=]?|[-+~]\\@|<<|>>|=~|===?|<=>|[<>]=?|\\*\\*|[-/+%^&*~`|]|\\[\\]=?";var f={keyword:{and:1,"false":1,then:1,defined:1,module:1,"in":1,"return":1,redo:1,"if":1,BEGIN:1,retry:1,end:1,"for":1,"true":1,self:1,when:1,next:1,until:1,"do":1,begin:1,unless:1,END:1,rescue:1,nil:1,"else":1,"break":1,undef:1,not:1,"super":1,"class":1,"case":1,require:1,yield:1,alias:1,"while":1,ensure:1,elsif:1,or:1,def:1},keymethods:{__id__:1,__send__:1,abort:1,abs:1,"all?":1,allocate:1,ancestors:1,"any?":1,arity:1,assoc:1,at:1,at_exit:1,autoload:1,"autoload?":1,"between?":1,binding:1,binmode:1,"block_given?":1,call:1,callcc:1,caller:1,capitalize:1,"capitalize!":1,casecmp:1,"catch":1,ceil:1,center:1,chomp:1,"chomp!":1,chop:1,"chop!":1,chr:1,"class":1,class_eval:1,"class_variable_defined?":1,class_variables:1,clear:1,clone:1,close:1,close_read:1,close_write:1,"closed?":1,coerce:1,collect:1,"collect!":1,compact:1,"compact!":1,concat:1,"const_defined?":1,const_get:1,const_missing:1,const_set:1,constants:1,count:1,crypt:1,"default":1,default_proc:1,"delete":1,"delete!":1,delete_at:1,delete_if:1,detect:1,display:1,div:1,divmod:1,downcase:1,"downcase!":1,downto:1,dump:1,dup:1,each:1,each_byte:1,each_index:1,each_key:1,each_line:1,each_pair:1,each_value:1,each_with_index:1,"empty?":1,entries:1,eof:1,"eof?":1,"eql?":1,"equal?":1,"eval":1,exec:1,exit:1,"exit!":1,extend:1,fail:1,fcntl:1,fetch:1,fileno:1,fill:1,find:1,find_all:1,first:1,flatten:1,"flatten!":1,floor:1,flush:1,for_fd:1,foreach:1,fork:1,format:1,freeze:1,"frozen?":1,fsync:1,getc:1,gets:1,global_variables:1,grep:1,gsub:1,"gsub!":1,"has_key?":1,"has_value?":1,hash:1,hex:1,id:1,include:1,"include?":1,included_modules:1,index:1,indexes:1,indices:1,induced_from:1,inject:1,insert:1,inspect:1,instance_eval:1,instance_method:1,instance_methods:1,"instance_of?":1,"instance_variable_defined?":1,instance_variable_get:1,instance_variable_set:1,instance_variables:1,"integer?":1,intern:1,invert:1,ioctl:1,"is_a?":1,isatty:1,"iterator?":1,join:1,"key?":1,keys:1,"kind_of?":1,lambda:1,last:1,length:1,lineno:1,ljust:1,load:1,local_variables:1,loop:1,lstrip:1,"lstrip!":1,map:1,"map!":1,match:1,max:1,"member?":1,merge:1,"merge!":1,method:1,"method_defined?":1,method_missing:1,methods:1,min:1,module_eval:1,modulo:1,name:1,nesting:1,"new":1,next:1,"next!":1,"nil?":1,nitems:1,"nonzero?":1,object_id:1,oct:1,open:1,pack:1,partition:1,pid:1,pipe:1,pop:1,popen:1,pos:1,prec:1,prec_f:1,prec_i:1,print:1,printf:1,private_class_method:1,private_instance_methods:1,"private_method_defined?":1,private_methods:1,proc:1,protected_instance_methods:1,"protected_method_defined?":1,protected_methods:1,public_class_method:1,public_instance_methods:1,"public_method_defined?":1,public_methods:1,push:1,putc:1,puts:1,quo:1,raise:1,rand:1,rassoc:1,read:1,read_nonblock:1,readchar:1,readline:1,readlines:1,readpartial:1,rehash:1,reject:1,"reject!":1,remainder:1,reopen:1,replace:1,require:1,"respond_to?":1,reverse:1,"reverse!":1,reverse_each:1,rewind:1,rindex:1,rjust:1,round:1,rstrip:1,"rstrip!":1,scan:1,seek:1,select:1,send:1,set_trace_func:1,shift:1,singleton_method_added:1,singleton_methods:1,size:1,sleep:1,slice:1,"slice!":1,sort:1,"sort!":1,sort_by:1,split:1,sprintf:1,squeeze:1,"squeeze!":1,srand:1,stat:1,step:1,store:1,strip:1,"strip!":1,sub:1,"sub!":1,succ:1,"succ!":1,sum:1,superclass:1,swapcase:1,"swapcase!":1,sync:1,syscall:1,sysopen:1,sysread:1,sysseek:1,system:1,syswrite:1,taint:1,"tainted?":1,tell:1,test:1,"throw":1,times:1,to_a:1,to_ary:1,to_f:1,to_hash:1,to_i:1,to_int:1,to_io:1,to_proc:1,to_s:1,to_str:1,to_sym:1,tr:1,"tr!":1,tr_s:1,"tr_s!":1,trace_var:1,transpose:1,trap:1,truncate:1,"tty?":1,type:1,ungetc:1,uniq:1,"uniq!":1,unpack:1,unshift:1,untaint:1,untrace_var:1,upcase:1,"upcase!":1,update:1,upto:1,"value?":1,values:1,values_at:1,warn:1,write:1,write_nonblock:1,"zero?":1,zip:1}};var c={cN:"yardoctag",b:"@[A-Za-z]+"};var k=[{cN:"comment",b:"#",e:"$",c:[c]},{cN:"comment",b:"^\\=begin",e:"^\\=end",c:[c],r:10},{cN:"comment",b:"^__END__",e:"\\n$"}];var d={cN:"subst",b:"#\\{",e:"}",l:a,k:f};var i=[hljs.BE,d];var b=[{cN:"string",b:"'",e:"'",c:i,r:0},{cN:"string",b:'"',e:'"',c:i,r:0},{cN:"string",b:"%[qw]?\\(",e:"\\)",c:i,r:10},{cN:"string",b:"%[qw]?\\[",e:"\\]",c:i,r:10},{cN:"string",b:"%[qw]?{",e:"}",c:i,r:10},{cN:"string",b:"%[qw]?<",e:">",c:i,r:10},{cN:"string",b:"%[qw]?/",e:"/",c:i,r:10},{cN:"string",b:"%[qw]?%",e:"%",c:i,r:10},{cN:"string",b:"%[qw]?-",e:"-",c:i,r:10},{cN:"string",b:"%[qw]?\\|",e:"\\|",c:i,r:10}];var h={cN:"function",b:"\\bdef\\s+",e:" |$|;",l:a,k:f,c:[{cN:"title",b:j,l:a,k:f},{cN:"params",b:"\\(",e:"\\)",l:a,k:f}].concat(k)};var g={cN:"identifier",b:a,l:a,k:f,r:0};var e=k.concat(b.concat([{cN:"class",b:"\\b(class|module)\\b",e:"$|;",k:{"class":1,module:1},c:[{cN:"title",b:"[A-Za-z_]\\w*(::\\w+)*(\\?|\\!)?",r:0},{cN:"inheritance",b:"<\\s*",c:[{cN:"parent",b:"("+hljs.IR+"::)?"+hljs.IR}]}].concat(k)},h,{cN:"constant",b:"(::)?([A-Z]\\w*(::)?)+",r:0},{cN:"symbol",b:":",c:b.concat([g]),r:0},{cN:"number",b:"(\\b0[0-7_]+)|(\\b0x[0-9a-fA-F_]+)|(\\b[1-9][0-9_]*(\\.[0-9_]+)?)|[0_]\\b",r:0},{cN:"number",b:"\\?\\w"},{cN:"variable",b:"(\\$\\W)|((\\$|\\@\\@?)(\\w+))"},g,{b:"("+hljs.RSR+")\\s*",c:k.concat([{cN:"regexp",b:"/",e:"/[a-z]*",i:"\\n",c:[hljs.BE]}]),r:0}]));d.c=e;h.c[1].c=e;return{dM:{l:a,k:f,c:e}}}();hljs.LANGUAGES.scala=function(){var b={cN:"annotation",b:"@[A-Za-z]+"};var a={cN:"string",b:'u?r?"""',e:'"""',r:10};return{dM:{k:{type:1,yield:1,lazy:1,override:1,def:1,"with":1,val:1,"var":1,"false":1,"true":1,sealed:1,"abstract":1,"private":1,trait:1,object:1,"null":1,"if":1,"for":1,"while":1,"throw":1,"finally":1,"protected":1,"extends":1,"import":1,"final":1,"return":1,"else":1,"break":1,"new":1,"catch":1,"super":1,"class":1,"case":1,"package":1,"default":1,"try":1,"this":1,match:1,"continue":1,"throws":1},c:[{cN:"javadoc",b:"/\\*\\*",e:"\\*/",c:[{cN:"javadoctag",b:"@[A-Za-z]+"}],r:10},hljs.CLCM,hljs.CBLCLM,hljs.ASM,hljs.QSM,a,{cN:"class",b:"((case )?class |object |trait )",e:"({|$)",i:":",k:{"case":1,"class":1,trait:1,object:1},c:[{bWK:true,k:{"extends":1,"with":1},r:10},{cN:"title",b:hljs.UIR},{cN:"params",b:"\\(",e:"\\)",c:[hljs.ASM,hljs.QSM,a,b]}]},hljs.CNM,b]}}}();hljs.LANGUAGES.sql={cI:true,dM:{i:"[^\\s]",c:[{cN:"operator",b:"(begin|start|commit|rollback|savepoint|lock|alter|create|drop|rename|call|delete|do|handler|insert|load|replace|select|truncate|update|set|show|pragma|grant)\\b",e:";|"+hljs.ER,k:{keyword:{all:1,partial:1,global:1,month:1,current_timestamp:1,using:1,go:1,revoke:1,smallint:1,indicator:1,"end-exec":1,disconnect:1,zone:1,"with":1,character:1,assertion:1,to:1,add:1,current_user:1,usage:1,input:1,local:1,alter:1,match:1,collate:1,real:1,then:1,rollback:1,get:1,read:1,timestamp:1,session_user:1,not:1,integer:1,bit:1,unique:1,day:1,minute:1,desc:1,insert:1,execute:1,like:1,ilike:2,level:1,decimal:1,drop:1,"continue":1,isolation:1,found:1,where:1,constraints:1,domain:1,right:1,national:1,some:1,module:1,transaction:1,relative:1,second:1,connect:1,escape:1,close:1,system_user:1,"for":1,deferred:1,section:1,cast:1,current:1,sqlstate:1,allocate:1,intersect:1,deallocate:1,numeric:1,"public":1,preserve:1,full:1,"goto":1,initially:1,asc:1,no:1,key:1,output:1,collation:1,group:1,by:1,union:1,session:1,both:1,last:1,language:1,constraint:1,column:1,of:1,space:1,foreign:1,deferrable:1,prior:1,connection:1,unknown:1,action:1,commit:1,view:1,or:1,first:1,into:1,"float":1,year:1,primary:1,cascaded:1,except:1,restrict:1,set:1,references:1,names:1,table:1,outer:1,open:1,select:1,size:1,are:1,rows:1,from:1,prepare:1,distinct:1,leading:1,create:1,only:1,next:1,inner:1,authorization:1,schema:1,corresponding:1,option:1,declare:1,precision:1,immediate:1,"else":1,timezone_minute:1,external:1,varying:1,translation:1,"true":1,"case":1,exception:1,join:1,hour:1,"default":1,"double":1,scroll:1,value:1,cursor:1,descriptor:1,values:1,dec:1,fetch:1,procedure:1,"delete":1,and:1,"false":1,"int":1,is:1,describe:1,"char":1,as:1,at:1,"in":1,varchar:1,"null":1,trailing:1,any:1,absolute:1,current_time:1,end:1,grant:1,privileges:1,when:1,cross:1,check:1,write:1,current_date:1,pad:1,begin:1,temporary:1,exec:1,time:1,update:1,catalog:1,user:1,sql:1,date:1,on:1,identity:1,timezone_hour:1,natural:1,whenever:1,interval:1,work:1,order:1,cascade:1,diagnostics:1,nchar:1,having:1,left:1,call:1,"do":1,handler:1,load:1,replace:1,truncate:1,start:1,lock:1,show:1,pragma:1},aggregate:{count:1,sum:1,min:1,max:1,avg:1}},c:[{cN:"string",b:"'",e:"'",c:[hljs.BE,{b:"''"}],r:0},{cN:"string",b:'"',e:'"',c:[hljs.BE,{b:'""'}],r:0},{cN:"string",b:"`",e:"`",c:[hljs.BE]},hljs.CNM]},hljs.CBLCLM,{cN:"comment",b:"--",e:"$"}]}};hljs.LANGUAGES.stan={dM:{c:[hljs.HCM,hljs.CLCM,hljs.QSM,hljs.CNM,{cN:"operator",b:"(?:<-|~|\\|\\||&&|==|!=|<=?|>=?|\\+|-|\\.?/|\\\\|\\^|\\^|!|'|%|:|,|;|=)\\b",e:hljs.IMMEDIATE_RE,r:10},{cN:"paren",b:"[[({\\])}]",e:hljs.IMMEDIATE_RE,r:0},{cN:"function",b:"(?:Phi|Phi_approx|abs|acos|acosh|append_col|append_row|asin|asinh|atan|atan2|atanh|bernoulli_ccdf_log|bernoulli_cdf|bernoulli_cdf_log|bernoulli_log|bernoulli_logit_log|bernoulli_rng|bessel_first_kind|bessel_second_kind|beta_binomial_ccdf_log|beta_binomial_cdf|beta_binomial_cdf_log|beta_binomial_log|beta_binomial_rng|beta_ccdf_log|beta_cdf|beta_cdf_log|beta_log|beta_rng|binary_log_loss|binomial_ccdf_log|binomial_cdf|binomial_cdf_log|binomial_coefficient_log|binomial_log|binomial_logit_log|binomial_rng|block|categorical_log|categorical_logit_log|categorical_rng|cauchy_ccdf_log|cauchy_cdf|cauchy_cdf_log|cauchy_log|cauchy_rng|cbrt|ceil|chi_square_ccdf_log|chi_square_cdf|chi_square_cdf_log|chi_square_log|chi_square_rng|cholesky_decompose|col|cols|columns_dot_product|columns_dot_self|cos|cosh|crossprod|csr_extract_u|csr_extract_v|csr_extract_w|csr_matrix_times_vector|csr_to_dense_matrix|cumulative_sum|determinant|diag_matrix|diag_post_multiply|diag_pre_multiply|diagonal|digamma|dims|dirichlet_log|dirichlet_rng|distance|dot_product|dot_self|double_exponential_ccdf_log|double_exponential_cdf|double_exponential_cdf_log|double_exponential_log|double_exponential_rng|e|eigenvalues_sym|eigenvectors_sym|erf|erfc|exp|exp2|exp_mod_normal_ccdf_log|exp_mod_normal_cdf|exp_mod_normal_cdf_log|exp_mod_normal_log|exp_mod_normal_rng|expm1|exponential_ccdf_log|exponential_cdf|exponential_cdf_log|exponential_log|exponential_rng|fabs|falling_factorial|fdim|floor|fma|fmax|fmin|fmod|frechet_ccdf_log|frechet_cdf|frechet_cdf_log|frechet_log|frechet_rng|gamma_ccdf_log|gamma_cdf|gamma_cdf_log|gamma_log|gamma_p|gamma_q|gamma_rng|gaussian_dlm_obs_log|get_lp|gumbel_ccdf_log|gumbel_cdf|gumbel_cdf_log|gumbel_log|gumbel_rng|head|hypergeometric_log|hypergeometric_rng|hypot|if_else|int_step|inv|inv_chi_square_ccdf_log|inv_chi_square_cdf|inv_chi_square_cdf_log|inv_chi_square_log|inv_chi_square_rng|inv_cloglog|inv_gamma_ccdf_log|inv_gamma_cdf|inv_gamma_cdf_log|inv_gamma_log|inv_gamma_rng|inv_logit|inv_phi|inv_sqrt|inv_square|inv_wishart_log|inv_wishart_rng|inverse|inverse_spd|is_inf|is_nan|lbeta|lgamma|lkj_corr_cholesky_log|lkj_corr_cholesky_rng|lkj_corr_log|lkj_corr_rng|lmgamma|log|log10|log1m|log1m_exp|log1m_inv_logit|log1p|log1p_exp|log2|log_determinant|log_diff_exp|log_falling_factorial|log_inv_logit|log_mix|log_rising_factorial|log_softmax|log_sum_exp|logistic_ccdf_log|logistic_cdf|logistic_cdf_log|logistic_log|logistic_rng|logit|lognormal_ccdf_log|lognormal_cdf|lognormal_cdf_log|lognormal_log|lognormal_rng|machine_precision|max|mdivide_left_tri_low|mdivide_right_tri_low|mean|min|modified_bessel_first_kind|modified_bessel_second_kind|multi_gp_cholesky_log|multi_gp_log|multi_normal_cholesky_log|multi_normal_cholesky_rng|multi_normal_log|multi_normal_prec_log|multi_normal_rng|multi_student_t_log|multi_student_t_rng|multinomial_log|multinomial_rng|multiply_log|multiply_lower_tri_self_transpose|neg_binomial_2_ccdf_log|neg_binomial_2_cdf|neg_binomial_2_cdf_log|neg_binomial_2_log|neg_binomial_2_log_log|neg_binomial_2_log_rng|neg_binomial_2_rng|neg_binomial_ccdf_log|neg_binomial_cdf|neg_binomial_cdf_log|neg_binomial_log|neg_binomial_rng|negative_infinity|normal_ccdf_log|normal_cdf|normal_cdf_log|normal_log|normal_rng|not_a_number|num_elements|ordered_logistic_log|ordered_logistic_rng|owens_t|pareto_ccdf_log|pareto_cdf|pareto_cdf_log|pareto_log|pareto_rng|pareto_type_2_ccdf_log|pareto_type_2_cdf|pareto_type_2_cdf_log|pareto_type_2_log|pareto_type_2_rng|pi|poisson_ccdf_log|poisson_cdf|poisson_cdf_log|poisson_log|poisson_log_log|poisson_log_rng|poisson_rng|positive_infinity|pow|prod|qr_Q|qr_R|quad_form|quad_form_diag|quad_form_sym|rank|rayleigh_ccdf_log|rayleigh_cdf|rayleigh_cdf_log|rayleigh_log|rayleigh_rng|rep_array|rep_matrix|rep_row_vector|rep_vector|rising_factorial|round|row|rows|rows_dot_product|rows_dot_self|scaled_inv_chi_square_ccdf_log|scaled_inv_chi_square_cdf|scaled_inv_chi_square_cdf_log|scaled_inv_chi_square_log|scaled_inv_chi_square_rng|sd|segment|sin|singular_values|sinh|size|skew_normal_ccdf_log|skew_normal_cdf|skew_normal_cdf_log|skew_normal_log|skew_normal_rng|softmax|sort_asc|sort_desc|sort_indices_asc|sort_indices_desc|sqrt|sqrt2|square|squared_distance|step|student_t_ccdf_log|student_t_cdf|student_t_cdf_log|student_t_log|student_t_rng|sub_col|sub_row|sum|tail|tan|tanh|tcrossprod|tgamma|to_array_1d|to_array_2d|to_matrix|to_row_vector|to_vector|trace|trace_gen_quad_form|trace_quad_form|trigamma|trunc|uniform_ccdf_log|uniform_cdf|uniform_cdf_log|uniform_log|uniform_rng|variance|von_mises_log|von_mises_rng|weibull_ccdf_log|weibull_cdf|weibull_cdf_log|weibull_log|weibull_rng|wiener_log|wishart_log|wishart_rng)\\b",e:hljs.IMMEDIATE_RE,r:10},{cN:"function",b:"(?:bernoulli|bernoulli_logit|beta|beta_binomial|binomial|binomial_logit|categorical|categorical_logit|cauchy|chi_square|dirichlet|double_exponential|exp_mod_normal|exponential|frechet|gamma|gaussian_dlm_obs|gumbel|hypergeometric|inv_chi_square|inv_gamma|inv_wishart|lkj_corr|lkj_corr_cholesky|logistic|lognormal|multi_gp|multi_gp_cholesky|multi_normal|multi_normal_cholesky|multi_normal_prec|multi_student_t|multinomial|neg_binomial|neg_binomial_2|neg_binomial_2_log|normal|ordered_logistic|pareto|pareto_type_2|poisson|poisson_log|rayleigh|scaled_inv_chi_square|skew_normal|student_t|uniform|von_mises|weibull|wiener|wishart)\\b",e:hljs.IMMEDIATE_RE,r:10},{cN:"keyword",b:"(?:for|in|while|if|then|else|return|lower|upper|print|increment_log_prob|integrate_ode|reject)\\b",e:hljs.IMMEDIATE_RE,r:10},{cN:"keyword",b:"(?:int|real|vector|simplex|unit_vector|ordered|positive_ordered|row_vector|matrix|cholesky_factor_cov|cholesky_factor_corr|corr_matrix|cov_matrix|void)\\b",e:hljs.IMMEDIATE_RE,r:5},{cN:"keyword",b:"(?:functions|data|transformed\\s+data|parameters|transformed\\s+parameters|model|generated\\s+quantities)\\b",e:hljs.IMMEDIATE_RE,r:5}]}};hljs.LANGUAGES.xml=function(){var b="[A-Za-z0-9\\._:-]+";var a={eW:true,c:[{cN:"attribute",b:b,r:0},{b:'="',rB:true,e:'"',c:[{cN:"value",b:'"',eW:true}]},{b:"='",rB:true,e:"'",c:[{cN:"value",b:"'",eW:true}]},{b:"=",c:[{cN:"value",b:"[^\\s/>]+"}]}]};return{cI:true,dM:{c:[{cN:"pi",b:"<\\?",e:"\\?>",r:10},{cN:"doctype",b:"",r:10,c:[{b:"\\[",e:"\\]"}]},{cN:"comment",b:"",r:10},{cN:"cdata",b:"<\\!\\[CDATA\\[",e:"\\]\\]>",r:10},{cN:"tag",b:"",rE:true,sL:"css"}},{cN:"tag",b:"",rE:!0,sL:["actionscript","javascript","handlebars","xml"]}},{cN:"meta",v:[{b:/<\?xml/,e:/\?>/,r:10},{b:/<\?\w+/,e:/\?>/}]},{cN:"tag",b:"?",e:"/?>",c:[{cN:"name",b:/[^\/><\s]+/,r:0},t]}]}});hljs.registerLanguage("markdown",function(e){return{aliases:["md","mkdown","mkd"],c:[{cN:"section",v:[{b:"^#{1,6}",e:"$"},{b:"^.+?\\n[=-]{2,}$"}]},{b:"<",e:">",sL:"xml",r:0},{cN:"bullet",b:"^([*+-]|(\\d+\\.))\\s+"},{cN:"strong",b:"[*_]{2}.+?[*_]{2}"},{cN:"emphasis",v:[{b:"\\*.+?\\*"},{b:"_.+?_",r:0}]},{cN:"quote",b:"^>\\s+",e:"$"},{cN:"code",v:[{b:"^```w*s*$",e:"^```s*$"},{b:"`.+?`"},{b:"^( {4}| )",e:"$",r:0}]},{b:"^[-\\*]{3,}",e:"$"},{b:"\\[.+?\\][\\(\\[].*?[\\)\\]]",rB:!0,c:[{cN:"string",b:"\\[",e:"\\]",eB:!0,rE:!0,r:0},{cN:"link",b:"\\]\\(",e:"\\)",eB:!0,eE:!0},{cN:"symbol",b:"\\]\\[",e:"\\]",eB:!0,eE:!0}],r:10},{b:/^\[[^\n]+\]:/,rB:!0,c:[{cN:"symbol",b:/\[/,e:/\]/,eB:!0,eE:!0},{cN:"link",b:/:\s*/,e:/$/,eB:!0}]}]}});hljs.registerLanguage("json",function(e){var i={literal:"true false null"},n=[e.QSM,e.CNM],r={e:",",eW:!0,eE:!0,c:n,k:i},t={b:"{",e:"}",c:[{cN:"attr",b:/"/,e:/"/,c:[e.BE],i:"\\n"},e.inherit(r,{b:/:/})],i:"\\S"},c={b:"\\[",e:"\\]",c:[e.inherit(r)],i:"\\S"};return n.splice(n.length,0,t,c),{c:n,k:i,i:"\\S"}});
\ No newline at end of file
diff --git a/docs/site_libs/highlightjs-9.12.0/textmate.css b/docs/site_libs/highlightjs-9.12.0/textmate.css
deleted file mode 100644
index 6efd43560..000000000
--- a/docs/site_libs/highlightjs-9.12.0/textmate.css
+++ /dev/null
@@ -1,19 +0,0 @@
-.hljs-literal {
- color: rgb(88, 72, 246);
-}
-
-.hljs-number {
- color: rgb(0, 0, 205);
-}
-
-.hljs-comment {
- color: rgb(76, 136, 107);
-}
-
-.hljs-keyword {
- color: rgb(0, 0, 255);
-}
-
-.hljs-string {
- color: rgb(3, 106, 7);
-}
diff --git a/docs/site_libs/navigation-1.1/FileSaver.min.js b/docs/site_libs/navigation-1.1/FileSaver.min.js
new file mode 100644
index 000000000..6268ec99d
--- /dev/null
+++ b/docs/site_libs/navigation-1.1/FileSaver.min.js
@@ -0,0 +1,2 @@
+/*! @source http://purl.eligrey.com/github/FileSaver.js/blob/master/FileSaver.js */
+var saveAs=saveAs||function(e){"use strict";if("undefined"==typeof navigator||!/MSIE [1-9]\./.test(navigator.userAgent)){var t=e.document,n=function(){return e.URL||e.webkitURL||e},o=t.createElementNS("http://www.w3.org/1999/xhtml","a"),r="download"in o,i=function(e){var t=new MouseEvent("click");e.dispatchEvent(t)},a=/Version\/[\d\.]+.*Safari/.test(navigator.userAgent),c=e.webkitRequestFileSystem,f=e.requestFileSystem||c||e.mozRequestFileSystem,u=function(t){(e.setImmediate||e.setTimeout)(function(){throw t},0)},d="application/octet-stream",s=0,l=4e4,v=function(e){var t=function(){"string"==typeof e?n().revokeObjectURL(e):e.remove()};setTimeout(t,l)},p=function(e,t,n){t=[].concat(t);for(var o=t.length;o--;){var r=e["on"+t[o]];if("function"==typeof r)try{r.call(e,n||e)}catch(i){u(i)}}},w=function(e){return/^\s*(?:text\/\S*|application\/xml|\S*\/\S*\+xml)\s*;.*charset\s*=\s*utf-8/i.test(e.type)?new Blob(["\ufeff",e],{type:e.type}):e},y=function(t,u,l){l||(t=w(t));var y,m,S,h=this,R=t.type,O=!1,g=function(){p(h,"writestart progress write writeend".split(" "))},b=function(){if(m&&a&&"undefined"!=typeof FileReader){var o=new FileReader;return o.onloadend=function(){var e=o.result;m.location.href="data:attachment/file"+e.slice(e.search(/[,;]/)),h.readyState=h.DONE,g()},o.readAsDataURL(t),void(h.readyState=h.INIT)}if((O||!y)&&(y=n().createObjectURL(t)),m)m.location.href=y;else{var r=e.open(y,"_blank");void 0===r&&a&&(e.location.href=y)}h.readyState=h.DONE,g(),v(y)},E=function(e){return function(){return h.readyState!==h.DONE?e.apply(this,arguments):void 0}},N={create:!0,exclusive:!1};return h.readyState=h.INIT,u||(u="download"),r?(y=n().createObjectURL(t),void setTimeout(function(){o.href=y,o.download=u,i(o),g(),v(y),h.readyState=h.DONE})):(e.chrome&&R&&R!==d&&(S=t.slice||t.webkitSlice,t=S.call(t,0,t.size,d),O=!0),c&&"download"!==u&&(u+=".download"),(R===d||c)&&(m=e),f?(s+=t.size,void f(e.TEMPORARY,s,E(function(e){e.root.getDirectory("saved",N,E(function(e){var n=function(){e.getFile(u,N,E(function(e){e.createWriter(E(function(n){n.onwriteend=function(t){m.location.href=e.toURL(),h.readyState=h.DONE,p(h,"writeend",t),v(e)},n.onerror=function(){var e=n.error;e.code!==e.ABORT_ERR&&b()},"writestart progress write abort".split(" ").forEach(function(e){n["on"+e]=h["on"+e]}),n.write(t),h.abort=function(){n.abort(),h.readyState=h.DONE},h.readyState=h.WRITING}),b)}),b)};e.getFile(u,{create:!1},E(function(e){e.remove(),n()}),E(function(e){e.code===e.NOT_FOUND_ERR?n():b()}))}),b)}),b)):void b())},m=y.prototype,S=function(e,t,n){return new y(e,t,n)};return"undefined"!=typeof navigator&&navigator.msSaveOrOpenBlob?function(e,t,n){return n||(e=w(e)),navigator.msSaveOrOpenBlob(e,t||"download")}:(m.abort=function(){var e=this;e.readyState=e.DONE,p(e,"abort")},m.readyState=m.INIT=0,m.WRITING=1,m.DONE=2,m.error=m.onwritestart=m.onprogress=m.onwrite=m.onabort=m.onerror=m.onwriteend=null,S)}}("undefined"!=typeof self&&self||"undefined"!=typeof window&&window||this.content);"undefined"!=typeof module&&module.exports?module.exports.saveAs=saveAs:"undefined"!=typeof define&&null!==define&&null!==define.amd&&define([],function(){return saveAs});
\ No newline at end of file
diff --git a/docs/site_libs/navigation-1.1/sourceembed.js b/docs/site_libs/navigation-1.1/sourceembed.js
index 8464b0c88..e1b2cf852 100644
--- a/docs/site_libs/navigation-1.1/sourceembed.js
+++ b/docs/site_libs/navigation-1.1/sourceembed.js
@@ -1,12 +1,9 @@
+
window.initializeSourceEmbed = function(filename) {
$("#rmd-download-source").click(function() {
- var src = $("#rmd-source-code").html();
- var a = document.createElement('a');
- a.href = "data:text/x-r-markdown;base64," + src;
- a.download = filename;
- document.body.appendChild(a);
- a.click();
- document.body.removeChild(a);
+ var src = window.atob($("#rmd-source-code").html());
+ var blob = new Blob([src], {type: "text/x-r-markdown"});
+ saveAs(blob, filename);
});
};
diff --git a/images/profile.png b/images/profile.png
new file mode 100644
index 000000000..7e9ffbbd1
Binary files /dev/null and b/images/profile.png differ
diff --git a/images/utsab_profile.jpg b/images/utsab_profile.jpg
new file mode 100644
index 000000000..4d15a4510
Binary files /dev/null and b/images/utsab_profile.jpg differ
diff --git a/index.Rmd b/index.Rmd
index 12fa1cb9f..2d1a5ccbd 100644
--- a/index.Rmd
+++ b/index.Rmd
@@ -1,43 +1,70 @@
---
-title: "MyLabJournal"
+title: "Utsab"
output: html_document
---
-# **My Lab Journal**
-This is a template example for lab journaling. Students in Matt Crump's Human Cognition and Performance Lab will use this template to learn R, and other things. Students can replace this text with more fun things. People not in my lab can use this too.
+
Hi! I am Utsab. I am a proactive and fast learning individual with strong analytical and quantitaive skills. I am currently looking to work in dynamic data analytics role where I can utilize my relevant expertise to help the company achieve business goals while sticking to company's vision, mission and values.
+
+
BACKGROUND
-## How to use
+I have always been a numbers person, with exceptional mathemathis and computer skills. I have spent the past 5 years analyzing complex business requirements, compiling market and trend data, and designing enterprise-level solutions to accelerate efficiency and revenue growth. I am fluent in several data management systems and software, including Excel, Microsoft SQL Server, Python, R and Power BI. Statistical significance, A/B testing, and data-driven optimization are some of the areas of my expertise.
-1. fork the repo for this website and follow instructions on read me to get set up. [https://github.com/CrumpLab/LabJournalWebsite](https://github.com/CrumpLab/LabJournalWebsite)
+As much as I’m into data manipulation, it’s the analysis of data that really gets me going. I enjoy exploring the relationships between data and translating those data into stories. In the age of big data, these stories become actionable solutions and strategies for businesses, and I take pride in my ability to make data accessible to both executive decision-makers and frontline sales staff.
-2. Blog/journal what you are doing in R, by editing the Journal.Rmd. See the [Journal page](https://crumplab.github.io/LabJournalWebsite/Journal.html) for an example of what to do to get started learning R.
+In my last role, as a business intelligence analyst for an online travel agency, I worked with cross-functional teams to structure problems, identify appropriate data sources, extract data and develop integrated information delivery solutions. I was leading the analysis on pricing opportunities, customer behavior, and competitive positioning to build a long-term pricing strategy for the firm.
-3. See the [links page](https://crumplab.github.io/LabJournalWebsite/Links.html) for lots of helpful links on learning R.
+On a personal level, I am detail-oriented, organized, and precise in my work. I have strong communication skills with a knack for clear and illuminating presentation. I’m comfortable on my own facing the numbers, but I really enjoy being part of a motivated team of smart people.
+
+I have completed my graduate program in Predictive Analytics, and I have bachelor’s degree in computer information systems with business concentration.
+
+Besides spreadsheets and charts, I am passionate about nature and mountains, and live an active lifestyle.
+
+---
+
+
PROJECTS
+
+ 1.
Study of Interaction Patterns in Primary School Children Network
+
+ 2. Dating Compatibiligy and Chemistry: Analysis on a Speed-dating Experiment
+
+ 3.
Risk assessment of Suicide using Machine Learning
+
-4. Change everything to make it your own.