Skip to content
This repository was archived by the owner on Apr 8, 2021. It is now read-only.

Comments

[WIP] Feature encoders#80

Open
tixxit wants to merge 8 commits intomasterfrom
thomas-feature-encoder
Open

[WIP] Feature encoders#80
tixxit wants to merge 8 commits intomasterfrom
thomas-feature-encoder

Conversation

@tixxit
Copy link
Contributor

@tixxit tixxit commented Jan 27, 2016

This is super early work, but we have a CsvTrainerJob, which can run on ~arbitrary CSVs, with the labels provided by the user. The actual types of the values will be inferred before training.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

20 seems pretty small, I'd think we'd want to allow at least 100 or 200 here.

@tixxit
Copy link
Contributor Author

tixxit commented Feb 12, 2016

@avibryant So, this has some of the stuff I've been doing (still very WIP), but the important bits are:

  • FValue ADT, which is basically just Number | Text | Boolean right now
  • FeatureParser[-A] that is basically just A => Map[String, FValue] (eg A could be CsvRow)
    • there is also a TrainingDataFeatureParser[-A] which includes ID/timestamp/target parsers
  • FeatureEncoder[K, +V] that can convert some Map[String, FValue] => Map[K, V]
    • the expectation is that FeatureEncoders will be (de)serialized and stored alongside the model
  • FeatureEncoding[K, V, T] that holds a FeatureEncoder[K, V] and Splitter[V, T] (we can get rid of the T after your PR lands)
    • the splitter is used for training and then thrown away, but we serialize the encoder
  • TrainingPlatform#Trainer[-A, +B] which describes some series of passes over data of type A to end up with a B, while passing some state between each of the passes
    • this looks suspiciously like an Iteratee and is pretty much a type Trainer[A, B] = Free[({ type f[x] = Aggregator[A, _, x] })#f, B] currently, though we tag along some platform-specific context with the aggregator

There is an implementation, DispatchedFeatureEncoding, of a FeatureEncoding for the dispatched type, which has a "trainer" that does a pass over the data and attempts to infer the sub-type of Dispatched to use for it.

The main goal of the FeatureParser vs FeatureEncoder split is so that we can separate the input type from the input-type agnostic feature encoding bits from the tree K/V type. So, we can train off CSV data or thrift and still write a web service that accepts JSON.

a <- this.uniques
b <- that.uniques
c = a ++ b
if (c.size < 20)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @avibryant mentioned earlier, this should probably be bumped up or made configurable.

import com.twitter.algebird.{ Aggregator, Semigroup, Monoid }

case class DispatchedFeatureEncoding[L](encoder: DispatchedFeatureEncoder)
extends FeatureEncoding[String, FeatureValue, Map[L, Long]] with Defaults {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the terrible name collisions here, but FeatureValue is a type alias for Dispatched[Double, String, Double, String]

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants