Skip to content

Training failure when using non-default number of threads #3

@rszeto

Description

@rszeto

Hi Emily,

I ran into a problem when running the train_drnet.lua script when setting the --nThreads flag to values other than 0 (I tried 1, 2, and 5). I get the following output:

Found Environment variable CUDNN_PATH = /usr/local/cudnn-v5.1/lib64/libcudnn.so.5{
  contentDim : 64
  seed : 1
  beta1 : 0.9
  name : "default"
  learningRate : 0.002
  movingDigits : 1
  batchSize : 100
  imageSize : 64
  optimizer : "adam"
  model : "dcgan"
  save : "logs//moving_mnist/default"
  gpu : 0
  dataRoot : "data"
  depth : 18
  dataWarmup : 10
  advWeight : 0
  dataset : "moving_mnist"
  epochSize : 50000
  cropSize : 227
  maxStep : 12
  normalize : false
  nEpochs : 200
  poseDim : 5
  decoder : "dcgan"
  dataPool : 200
  nThreads : 1
  nShare : 1
}
<torch> set nb of threads to 1	
<gpu> using device 0	
Loaded models from file	
/home/szetor/build/torch/install/bin/luajit: .../szetor/build/torch/install/share/lua/5.1/trepl/init.lua:389: ...or/build/torch/install/share/lua/5.1/threads/threads.lua:183: [thread 1 callback] ./data/moving_mnist.lua:12: attempt to index local 'opt' (a nil value)
stack traceback:
	./data/moving_mnist.lua:12: in function 'getData'
	./data/moving_mnist.lua:32: in function '__init'
	.../szetor/build/torch/install/share/lua/5.1/torch/init.lua:91: in function <.../szetor/build/torch/install/share/lua/5.1/torch/init.lua:87>
	[C]: in function 'MovingMNISTLoader'
	./data/moving_mnist.lua:150: in main chunk
	[C]: in function 'require'
	./data/data.lua:10: in function 'getDatasourceFun'
	./data/threads.lua:28: in function <./data/threads.lua:18>
	[C]: in function 'xpcall'
	...or/build/torch/install/share/lua/5.1/threads/threads.lua:234: in function 'callback'
	...etor/build/torch/install/share/lua/5.1/threads/queue.lua:65: in function <...etor/build/torch/install/share/lua/5.1/threads/queue.lua:41>
	[C]: in function 'pcall'
	...etor/build/torch/install/share/lua/5.1/threads/queue.lua:40: in function 'dojob'
	[string "  local Queue = require 'threads.queue'..."]:13: in main chunk
stack traceback:
	[C]: in function 'error'
	.../szetor/build/torch/install/share/lua/5.1/trepl/init.lua:389: in function 'require'
	train_drnet.lua:321: in main chunk
	[C]: in function 'dofile'
	...uild/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
	[C]: at 0x00405d50

My guess is that the MovingMNISTLoader class doesn't have access to the global opt variable when threading, unlike in the case when opt.nThreads is 0.

I would appreciate your help in fixing this issue. Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions