[WIP] Feature encoders by tixxit · Pull Request #80 · stripe-archive/brushfire

tixxit · 2016-01-27T15:53:29Z

This is super early work, but we have a CsvTrainerJob, which can run on ~arbitrary CSVs, with the labels provided by the user. The actual types of the values will be inferred before training.

avibryant · 2016-01-29T05:22:18Z

brushfire-features/src/main/scala/com/stripe/brushfire/features/CsvRowFeatureMapping.scala

20 seems pretty small, I'd think we'd want to allow at least 100 or 200 here.

tixxit · 2016-02-12T14:43:25Z

@avibryant So, this has some of the stuff I've been doing (still very WIP), but the important bits are:

FValue ADT, which is basically just Number | Text | Boolean right now
FeatureParser[-A] that is basically just A => Map[String, FValue] (eg A could be CsvRow)
- there is also a TrainingDataFeatureParser[-A] which includes ID/timestamp/target parsers
FeatureEncoder[K, +V] that can convert some Map[String, FValue] => Map[K, V]
- the expectation is that FeatureEncoders will be (de)serialized and stored alongside the model
FeatureEncoding[K, V, T] that holds a FeatureEncoder[K, V] and Splitter[V, T] (we can get rid of the T after your PR lands)
- the splitter is used for training and then thrown away, but we serialize the encoder
TrainingPlatform#Trainer[-A, +B] which describes some series of passes over data of type A to end up with a B, while passing some state between each of the passes
- this looks suspiciously like an Iteratee and is pretty much a type Trainer[A, B] = Free[({ type f[x] = Aggregator[A, _, x] })#f, B] currently, though we tag along some platform-specific context with the aggregator

There is an implementation, DispatchedFeatureEncoding, of a FeatureEncoding for the dispatched type, which has a "trainer" that does a pass over the data and attempts to infer the sub-type of Dispatched to use for it.

The main goal of the FeatureParser vs FeatureEncoder split is so that we can separate the input type from the input-type agnostic feature encoding bits from the tree K/V type. So, we can train off CSV data or thrift and still write a web service that accepts JSON.

tixxit · 2016-02-12T15:02:56Z

brushfire-features/src/main/scala/com/stripe/brushfire/features/DispatchedFeatureEncoder.scala

+        a <- this.uniques
+        b <- that.uniques
+        c = a ++ b
+        if (c.size < 20)


As @avibryant mentioned earlier, this should probably be bumped up or made configurable.

tixxit · 2016-02-12T15:05:24Z

brushfire-features/src/main/scala/com/stripe/brushfire/features/DispatchedFeatureEncoder.scala

+import com.twitter.algebird.{ Aggregator, Semigroup, Monoid }
+
+case class DispatchedFeatureEncoding[L](encoder: DispatchedFeatureEncoder)
+extends FeatureEncoding[String, FeatureValue, Map[L, Long]] with Defaults {


Sorry for the terrible name collisions here, but FeatureValue is a type alias for Dispatched[Double, String, Double, String]

thomas-stripe added 4 commits January 26, 2016 14:41

WIP.

fd56359

First pass at TSV feature encoder.

5510dd7

Add CsvTrainerJob.

4740d1a

Update iris script.

102338b

avibryant reviewed Jan 29, 2016
View reviewed changes

thomas-stripe added 2 commits February 9, 2016 14:57

Get Iris example job working with CsvTrainer.

5969fb1

Add FeatureParser and FeatureEncoding.

3042705

Use splitters from encoding in CsvTrainer.

bf8a5ce

tixxit reviewed Feb 12, 2016
View reviewed changes

Remove old FeatureMapping crud.

903eeb0

tixxit reviewed Feb 12, 2016
View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[WIP] Feature encoders#80

[WIP] Feature encoders#80
tixxit wants to merge 8 commits intomasterfrom
thomas-feature-encoder

tixxit commented Jan 27, 2016

Uh oh!

avibryant Jan 29, 2016

Uh oh!

tixxit commented Feb 12, 2016

Uh oh!

tixxit Feb 12, 2016

Uh oh!

tixxit Feb 12, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

tixxit commented Jan 27, 2016

Uh oh!

avibryant Jan 29, 2016

Choose a reason for hiding this comment

Uh oh!

tixxit commented Feb 12, 2016

Uh oh!

tixxit Feb 12, 2016

Choose a reason for hiding this comment

Uh oh!

tixxit Feb 12, 2016

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants