Skip to content

Some thoughts and suggestions #4

@talolard

Description

@talolard

Hey!
Thanks for putting these together. I'm the founder of LightTag. I have some suggestions which I'll try to keep as unbiased as possible.

I think that the repo content currently focuses on DIY annotations, e.g. how to put together your own ad-hoc tool. There are many good tools out there, including a few that are open source. I imagine that the majority of readers don't want to roll there own rather are looking for a way to quickly solve there problem.

The division into binary, multi class and span annotations is very good. These are a progression in terms of complexity and are to a degree distinct use cases. Perhaps consider adding an introductory section that describes what each of these are and how they relate, possibly with pictures gifs (we're happy to provide some)

Aside from actually annotation, the typical user has non functional requirements. Notably, how do I bring a team, can I and how to use active learning, how do I evaluate the quality of annotations and what is a good output format for my use case. Another common consideration is the tokenization scheme used by the annotation interface, e.g. can I annotate subwords and phrases or am I pre-commited to tokens as defined by some external system ? All of these are worth discussing.

Some users will be interested in outsourcing their annotations to mechanical turks or similar services. It's worth discussing the tradeoffs on this (quality/price tradeoff, expertise and language limitations). MTurks has external questions which let the user provide their own annotation interface which is often very handy.

It's worth talking about why you annotate data. People do this for different reasons. The obvious one is to collect labeled data to train machine learning models on. Often, companies will realize in hindsight that they also need to collect a test set with different procedures (test for inter annotator agreement, no active learning to test for bias).

Sometimes, often in fact, people annotate for non ML reasons. Linguists do this all the time, annotate treebanks and such to study linguistic phenomena. This is also common in "computational social sciences", where the point of annotation is to learn something about society, not train a model. It would be useful for the repo to talk about these distinctions, if only to help potential readers articulate what they are trying to do.

Hope this helps, I'm happy to submit some PRs here, but am hesitant because it might bias the repo towards LightTag. We'd love that, but it would undermine what you are trying to do here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions