Hi,
as far as I understand weightnorm-initialization, you calculate the mean/std of the 'preactivations' (without the nonlinearity applied) and use these to initialize the weights/biases.
In weightnorm.data_based_init(), it is implicitly assumed that the nonlinearity is applied in a separate layer, since we collect all layers which have a W and b attribute, and use their output to caclulate the mean/std of the preactivations:
layer_output_weight_bias = []
for l in model.layers:
if hasattr(l, 'W') and hasattr(l, 'b'):
assert(l.built)
layer_output_weight_bias.append( (l.name,l.get_output_at(0),l.W,l.b)
...
for l,o,W,b in layer_output_weight_bias:
print('Performing data dependent initialization for layer ' + l)
m,v = tf.nn.moments(o, [i for i in range(len(o.get_shape())-1)])
s = tf.sqrt(v + 1e-10)
However, if the layer has a nonlinearity built into it, e.g via
fc = Dense(output_dim=50, activation='relu')
the above approach will pick the 'postactivations' after the Relu in .get_output_at(0) and calculate the mean/std of the postactivations to rescale W and b, which is technically not correct I think.
Unfortunately, i dont know a straight workaround (except forcing nonlinearities as separate layers); no idea how to get the preactivations from such a layer that internally applies the nonlinearity.