Skip to content

Tagger Cache

WinnowTag edited this page Sep 14, 2010 · 2 revisions

The Tagger Cache provides a cache of the taggers that the classifier can use to classify items. Taggers are stored in the cache using the URL of their training document as the key.

The classification worker threads will access the get_tagger function of the tagger cache. This function operates as described in the following diagram:

Tagger Retrieval

When a tagger is retrieved from the tagger cache, the cache will attempt to update the tagger definition by fetching the document from the tagger’s URL. The tagger cache will include an IF-MODIFIED-SINCE header in the request and the tagger will only be updated if the response to the request indicates that the tag definition has changed since it was last retrieved.

Building Taggers

The tagger cache is also responsible for building taggers from the definition documents. This involves fetching all the items defined as examples in the document, training the tagger with those examples and pre-calculating all the token probabilities for the tokens in the trained tagger. The taggers returned by the tagger cache are always in the precomputed state which means they are ready to be used in classification without any further processing.

In some cases, examples defined in a tagger document may not exist in the classifier’s item cache. When this occurs the tagger cache will add these items to the item cache, place the tagger in the “partially trained” state and return an error. Future calls to get a tagger in this state will each attempt to train the tagger with the missing items. When the items have been successfully added to the cache the tagger’s training will be completed and the tagger will be checked out and returned to the next caller. This allows jobs for partially trained taggers to just be added to the back of the job queue and processed again when another worker gets the job off the queue.

Locking

The Tagger Cache includes a locking mechanism that ensures each tagger can only be used by one thread at a time. This is implemented as a checkout mechanism. When a thread gets a tagger from the cache the cache first checks that the tagger has not been checked out by another thread. If the tagger is available it is marked as checked out and returned to the caller. If the tagger is checked out already an error is returned to the caller. When a thread is finished with the tagger is must call the release_tagger function which checks the tagger back in and makes it available for to other threads.

Persistence

The Tagger Cache does not currently support any form of persistence.

Tag Index

The Tagger Cache also provides a facility for fetching an index of tag urls. The index is expected to be an atom document containing a list of tags as entries with each entry containing a link to the tag’s training document. The cache will apply the IF-MODIFIED-SINCE HTTP caching functional to this index too. The index allows the system to get a list of all the tags that may be classified in another system in order to generate automatic classification jobs. For example, this facility is used when new items are published to the classifier’s item cache, the classifier will generate new item classification jobs for each tag in Winnow.

Clone this wiki locally