diff --git a/README.md b/README.md index 8b9b8b7..42bf058 100644 --- a/README.md +++ b/README.md @@ -1,11 +1,12 @@ # PolNeAR v.1.0.0 - Political News Attribution Relations Corpus PolNeAR is a corpus of news articles in which _attributions_ have been -annotated. An attribution occurrs when an article cites statements, or +annotated. An attribution occurs when an article cites statements, or describes the internal state (thoughts, intentions, etc.) of some person or group. A direct verbatim quote is an example of attribution, as is the paraphrasing of a source's intentions or beliefs. ## Benefits + In 2018, PolNeAR is the largest attribution dataset by total number of annotated attribution relations. It is also, based on analysis described in [1], the most _complete_ attribution corpus, in the sense of having high @@ -24,6 +25,7 @@ See the section entitled "Accompanying software" at the end of this README for details. ## News Publishers + PolNeAR consists of news articles from 7 US national news publishers \*: - Huffington Post (`huff-post`) @@ -49,8 +51,8 @@ publisher, and candidate of focus in the article, as follows. 1. **Publisher**: 144 articles were sampled from each publisher. - 2. **Time**: 84 articles were sampled uniformly from each 12 month-long - period between 8-Nov-2015 to 8-Nov-2016. + 2. **Time**: 84 articles were sampled uniformly from each of 12 separate month-long + periods between 8-Nov-2015 to 8-Nov-2016. 3. **Focal Candidate**: 504 articles were respectively sampled from articles mentioning Trump or Clinton the weak majority of the time. A @@ -67,6 +69,7 @@ total of 1008 articles. ## Genre + We endeavored to include only the hard news genre, and to exclude soft news, and other genres such as editorials, real estate, travel, advice, letters, obituaries, reviews, essays, etc. @@ -81,14 +84,16 @@ Breitbart. ## Train, Dev, Test splits -PolNeAR is split into training, development, testing subsets. The analyst + +PolNeAR is split into training, development, and testing subsets. The analyst should avoid viewing the dev and test subsets, and should only test a model architecture once on the test set. The train subset includes all articles from -the first 10 month-long periods of coverage. The dev and test subsets include -respectively articles drawn from the 11th and 12th month. +the first 10 month-long periods of coverage. The dev and test subsets include, +respectively, articles drawn from the 11th and 12th month. ## Statistics +
==========================================================
# Articles, core dataset | 1008 |
@@ -109,10 +114,10 @@ the corpus
## Data File Structure
-The PolNeAR data resides under the /data directory. There is one subdirectory
+The PolNeAR data resides under the [`PolNeAR/data`](data) directory. There is one subdirectory
for each _compartment_ of the dataset. There are 5 compartments. Three of the
compartments correspond to the core dataset's train/test/dev subsets. The
-other two relate to quality control during annotation. The /data directory
+other two relate to quality control during annotation. The [`PolNeAR/data`](data) directory
also contains a file called metadata.tsv, which provides a listing of all the
news articles along with metadata, including which annotators have annotated
it.
@@ -166,24 +171,27 @@ original text files, which should be obtained from the Penn Treebank 2 corpus.
## Preprocessing
## Annotation
+
The annotation of attributions was performed manually by 6 trained annotators,
who each annotated approximately 168 articles in the core dataset, 4 articles
for assessing training, and 54 articles for comparison to PARC3.
-To provide core NLP annotations, such as tokenization, sentnece splitting,
+To provide core NLP annotations, such as tokenization, sentence splitting,
part-of-speech tagging, constituency and dependency parsing, named entity
recognition, and coreference resolution, we provide annotations produced automatically by the CoreNLP software in parallel to the manual attribution annotations. See _Automated Annotation by CoreNLP_ below.
## Manual Annotation
+
### Training
+
All annotators were trained in two 2-hour periods, in which they reviewed the
-the guidelines (see /annotation-guidelines/guidelines.pdf). after each major
+the guidelines (see [`PolNeAR/annotation-guidelines/guidelines.pdf`](annotation-guidelines/guidelines.pdf)). After each major
section in the guidelines, we conducted a group discussion amongst the
annotators to answer any questions and rectify any misconceptions. Annotators
-were provided practice 2 practice articles as practice annotation.
+were provided 2 practice articles as practice annotation.
Annotators were then provided the templates document
-(/annotation-guidelines/templates.pdf), which was designed to provide quick
+([`PolNeAR/annotation-guidelines/templates.pdf`](annotation-guidelines/templates.pdf)), which was designed to provide quick
reference and examples to guide annotation.
After annotating the practice articles, we discussed the annotations as a
@@ -191,17 +199,19 @@ group, using the existing language in the guidelines to resolve disagreements
or misconceptions.
Near the end of the second training session, annotators were shown examples in
-/annotation-guidelines/guidelines-training-interactive.pdf, and asked to
+[`PolNeAR/annotation-guidelines/guidelines-training-interactive.pdf`](annotation-guidelines/guidelines-training-interactive.pdf), and asked to
describe how they would annotate it. The examples were designed to be
difficult, but to have a correct answer according to the guidelines.
### Training Articles
+
After training was complete, annotators annotated 4 articles, to measure their initial agreement and verify that training had been successful. These articles provide an indication of agreement level for annotators immediately after the training process.
### Ongoing Monitoring of Annotation Quality
+
Each annotator annotated approximately 18 articles every week. As a quality
control measure, weekly group meetings were held with all annotators in which
-which we reviewed two articles that had been annotated by all annotators.
+we reviewed two articles that had been annotated by all annotators.
During the meeting, the annotations that each annotator made in the two shared
articles were aligned to clearly show the cases where annotators had agreed or
disagreed on how to perform the annotation. The discussions were conducted to
@@ -209,6 +219,7 @@ encourage consensus by appealing to the existing guidelines and, especially,
the templates.
## Automated Annotation by CoreNLP
+
Automated annotations within directories named "corenlp" were produced by
running the CoreNLP software [2], using the following annotators: `tokenize`,
`ssplit`, `pos`, `lemma`, `ner`, `parse`, and `dcoref`; and with the output
@@ -217,16 +228,18 @@ format 'xml' chosen. The following was set in the properties file:
ner.model = 'edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz'
## Annotation Quality
+
The quality of annotations was assessed using various agreement-based metrics.
Please see the associated paper for results [1].
## Article Metadata
-The file `PolNeAR/data/metadat.tsv` lists every article in PolNeAR and provides
+The file [`PolNeAR/data/metadata.tsv`](data/metadata.tsv) lists every article in PolNeAR and provides
several metadata fields containing information about the article itself, and how it was annotated.
### Metadata about the articles
-The following fields are hopefully self-explanatory:
+
+The following fields are, hopefully, self-explanatory:
`filename`, `publisher`, `publication_date`, `author`, `title`,
The fields `trump_count` and `clinton_count` indicate the number of times
@@ -239,6 +252,7 @@ publisher has given credit for a story to another news publisher, or to a
wire service, such as AP or Reuters.
### Metadata about annotation
+
The fields `compartment`, `level`, and `annotators` indicate how the article was subjected to annotation. First, `compartment` indicates the compartment into which the annotation falls:
- `annotator-training` indicates that the articles were used to during
training of the annotators, to test their interannotator agreement and
@@ -276,14 +290,19 @@ PARC3 approach to annotation.
## Accompanying software
+
If you are a Python user, the easiest way to work with this dataset in Python
is to install the polnear module, and import it into your programs.
-Go to /data/software and do:
+To install the module to your current environment, simply navigate to the [`PolNeAR/software`](software) subdirectory and execute:
$ python setup.py install
-Then, in your Python program, import the dataset as follows:
+(**NOTE:** In case you are managing Python virtual environments via [`conda`](https://docs.continuum.io/anaconda/),
+be aware that the `PolNeAR` dependencies will implicitly be installed directly by [`pip`](https://pip.pypa.io/en/stable/installing/).
+In this case, it is recommended to locallize the [`pip`](https://pip.pypa.io/en/stable/installing/) installations during conda environment creation, to avoid dependency conflicts, by instantiating the environment with its own [`pip`](https://pip.pypa.io/en/stable/installing/) setup, à la `conda create --name custom_venv_name pip`).
+
+Following installation, you can then easily import the dataset at runtime via the Python statement:
from polnear import data
@@ -363,7 +382,7 @@ an article as a `unicode`:
More interestingly, you can get a representation of the article with
annotations:
- >>> annotated_article = data[0].annotated()
+ >>> annotated_article = article.annotated()
Here, `annotated_article` is an `AnnotatedText` object which is modelled
after the `corenlp_xml_reader.AnnotatedText` object. It
@@ -375,7 +394,7 @@ documentation](http://corenlp-xml-reader.readthedocs.io/en/latest/). Here, we
document the access of attribution annotations.
First, let's suppose you want to iterate over the sentences of a document,
-and then do something every time you encounter an attribution.
+and then do something each time you encounter an attribution.
>>> for sentence in annotated_article.sentences:
... for attribution_id in sentence['attributions']:
@@ -479,7 +498,7 @@ So in all, there are three ways to access attribution information:
2. Starting from a sentence, look to the value of `sentence['attributions']`,
3. Starting from a token, look to the value of `token['attributions']`.
-Again, for more information on how to navigate the AnnotatedText object and access other annotations such as coreference resolution, dependency and constituency parses, etc., refer to documentation for [`corenlp_xml_reader.AnnotatedText`](http://corenlp-xml-reader.readthedocs.io/en/latest/).
+Again, for more information on how to navigate the AnnotatedText object and access other annotations (such as coreference resolution, dependency and constituency parses, etc.), refer to the documentation for [`corenlp_xml_reader.AnnotatedText`](http://corenlp-xml-reader.readthedocs.io/en/latest/).
[1] _An attribution relations corpus for political news_,
diff --git a/software/.gitignore b/software/.gitignore
new file mode 100644
index 0000000..ff703be
--- /dev/null
+++ b/software/.gitignore
@@ -0,0 +1,91 @@
+## *** Ignores courtesy of (https://github.com/kennethreitz/samplemod/blob/master/.gitignore) ***
+
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+env/
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib64/
+parts/
+sdist/
+var/
+*.egg-info/
+.installed.cfg
+*.egg
+
+# PyInstaller
+# Usually these files are written by a python script from a template
+# before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*,cover
+.hypothesis/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# IPython Notebook
+.ipynb_checkpoints
+
+# pyenv
+.python-version
+
+# celery beat schedule file
+celerybeat-schedule
+
+# dotenv
+.env
+
+# virtualenv
+.venv/
+venv/
+ENV/
+
+# Spyder project settings
+.spyderproject
+
+# Rope project settings
+.ropeproject
\ No newline at end of file
diff --git a/software/setup.py b/software/setup.py
index 09f838b..49206d4 100644
--- a/software/setup.py
+++ b/software/setup.py
@@ -60,7 +60,7 @@
# What does your project relate to?
keywords= (
- 'NLP natrual language processing computational linguistics ',
+ 'NLP natural language processing computational linguistics ',
'PolNeAR Political News Attribution Relations Corpus'
),
@@ -69,5 +69,5 @@
packages=['polnear'],
#indlude_package_data=True,
install_requires=[
- 'parc-reader==0.1.5', 't4k', 'corenlp-xml-reader', 'brat-reader']
+ 'parc-reader==0.1.5', 't4k>=0.6.4', 'corenlp-xml-reader>=0.1.3', 'brat-reader>=0.0.0']
)