-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restoring session: dataholder not found in checkpoint #13
Comments
Thanks for pointing this out. The problem has been identified here https://github.com/GPflow/GPflow/blob/monitor/doc/source/notebooks/monitor-tensorboard.ipynb, |
Thanks for your reply. Did you incorporate this yet? I did not see any remark on how to circumvent this in the above link. What is the temporary workaround? |
The workaround is to pass |
I think they solved the issue in GPFlow, but they have not yet released a new version of GPFlow, so I prefer using the current GPFlow library rather than manually cloning the GPFlow git. I solved the problem adding the name attribute to the Model (in the DGPModel class) and to the RBF and White kernels. Before doing this, I tried the following to control random names: |
Hi sorry for the slow reply: to run the restore using |
By that I mean use |
Hello, % Initialize data, parameters
X = ...
Y = ...
Xs = ...
Ys = ...
mm = ...
...
Z = kmeans2(X, mm, minit='points')[0]
def make_dgp_models(X, Y, Z):
models, names = [], []
for L in range(1, 4):
D = X.shape[1]
# the layer shapes are defined by the kernel dims, so here all hidden layers are D dimensional
kernels = []
for l in range(L):
if l==0:
kernels.append(RBF(D, name = str(l) + '_kernel'))
# between layer noise (doesn't actually make much difference but we include it anyway)
for kernel in kernels[:-1]:
kernel += White(D, variance=1e-5)
else:
kernels.append(RBF(5,name = str(l) + '_kernel'))
# between layer noise (doesn't actually make much difference but we include it anyway)
for kernel in kernels[:-1]:
kernel += White(5, variance=1e-5)
mb = 10000 if X.shape[0] > 10000 else None
print(mb)
model = DGP(X, Y, Z, kernels, Gaussian(name = str(L)), num_samples=5, minibatch_size=mb)
# start the inner layers almost deterministically
for layer in model.layers[:-1]:
layer.q_sqrt = layer.q_sqrt.value * 1e-3
models.append(model)
names.append('DGP{} {}'.format(L, len(Z)))
return models, names
models_dgp, names_dgp = make_dgp_models(X, Y, Z)
def batch_assess(model, assess_model, X, Y, SIGMA_y):
n_batches = max(int(X.shape[0]/1000.), 1)
lik, sq_diff = [], []
for X_batch, Y_batch in zip(np.array_split(X, n_batches), np.array_split(Y, n_batches)):
l, sq = assess_model(model, X_batch, Y_batch)
lik.append(l)
sq_diff.append(sq)
lik = np.concatenate(lik, 0)
sq_diff = np.array(np.concatenate(sq_diff, 0), dtype=float)
sq_diff = (sq_diff**0.5 * SIGMA_y)**2
return np.average(lik), np.average(sq_diff)**0.5
S = 50
def assess_sampled(model, X_batch, Y_batch):
m, v = model.predict_y(X_batch, S)
S_lik = np.sum(norm.logpdf(Y_batch*Y_std, loc=m*Y_std, scale=Y_std*v**0.5), 2)
lik = logsumexp(S_lik, 0, b=1/float(S))
mean = np.average(m, 0)
sq_diff = Y_std**2*((mean - Y_batch)**2)
return lik, sq_diff
iterations_few = 50
s = '{:<16} lik: {:.4f}, rmse: {:.4f}'
for iterations in [iterations_few]:
print('after {} iterations'.format(iterations))
for m, name in zip(models_dgp, names_dgp):
ng_vars = [[m.layers[-1].q_mu, m.layers[-1].q_sqrt]]
for v in ng_vars[0]:
v.set_trainable(False)
tic = time.time()
# tf.local_variables_initializer()
# tf.global_variables_initializer()
tf_graph = m.enquire_graph()
tf_session = m.enquire_session()
m.compile(tf_session)
# ng_action = NatGradOptimizer(gamma=0.1).make_optimize_action(m, var_list=ng_vars)
# adam_action = AdamOptimizer(0.1).make_optimize_action(m)
# Loop([ng_action, adam_action], stop=iterations)()
# lik, rmse = batch_assess(m, assess_sampled, Xs, Ys,SIGMA_y)
toc = time.time()
print('training-time:',toc-tic)
saver = tf.train.Saver()
# save_path = saver.save(tf_session, "/Doubly-Stochastic-DGP-master/model.ckpt")
# print("Model saved")
save_path = saver.restore(tf_session, "/Doubly-Stochastic-DGP-master/model.ckpt")
print("Model loaded")
|
What is the error you get? I'm not familiar with this approach, but I suspect you need to pass a name when you create the model. |
I have tried to run the script run_regression.py in the demo folder, but get the following exception after running the program again, which is related to session restoring. It is related to not all variables being stored in the checkpoint. Did you encounter this problem as well?
############################ kin8nm L=2 split=0
N: 7372, D: 8, Ns: 820
Restoring session from
/home/maximilian/Desktop/FisyMat/TrabajoMaster/Coding/Doubly-Stochastic-DGP/demos/Results/tmp_results_maximilian-p50s/kin8nm_L2_split0/checkpoints-5
.Traceback (most recent call last):
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _do_call
return fn(*args)
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1329, in _run_fn
status, run_metadata)
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.NotFoundError: Key DGP-2ea4f2f5-27/X/dataholder not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_DOUBLE], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "run_regression.py", line 99, in
model.enquire_session(), (s+'/checkpoints').format(dataset_name, L))
File "/home/maximilian/Desktop/FisyMat/TrabajoMaster/Coding/gpflow-monitor/gpflow_monitor/opt_tools.py", line 88, in init
self.saver.restore(session, restore_path)
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1686, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1128, in _run
feed_dict_tensor, options, run_metadata)
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1344, in _do_run
options, run_metadata)
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1363, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key DGP-2ea4f2f5-27/X/dataholder not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_DOUBLE], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
Caused by op 'save/RestoreV2', defined at:
File "run_regression.py", line 99, in
model.enquire_session(), (s+'/checkpoints').format(dataset_name, L))
File "/home/maximilian/Desktop/FisyMat/TrabajoMaster/Coding/gpflow-monitor/gpflow_monitor/opt_tools.py", line 75, in init
self.saver = tf.train.Saver(max_to_keep=3) if saver is None else saver
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1239, in init
self.build()
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1248, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1284, in _build
build_save=build_save, build_restore=build_restore)
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 765, in _build_internal
restore_sequentially, reshape)
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 428, in _AddRestoreOps
tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 268, in restore_op
[spec.tensor.dtype])[0])
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1031, in restore_v2
shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3160, in create_op
op_def=op_def)
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1625, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
NotFoundError (see above for traceback): Key DGP-2ea4f2f5-27/X/dataholder not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_DOUBLE], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
The text was updated successfully, but these errors were encountered: