Restoring session: dataholder not found in checkpoint #13

maximilianmordig · 2018-04-05T08:32:28Z

I have tried to run the script run_regression.py in the demo folder, but get the following exception after running the program again, which is related to session restoring. It is related to not all variables being stored in the checkpoint. Did you encounter this problem as well?

############################ kin8nm L=2 split=0
N: 7372, D: 8, Ns: 820
Restoring session from /home/maximilian/Desktop/FisyMat/TrabajoMaster/Coding/Doubly-Stochastic-DGP/demos/Results/tmp_results_maximilian-p50s/kin8nm_L2_split0/checkpoints-5.
Traceback (most recent call last):
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _do_call
return fn(*args)
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1329, in _run_fn
status, run_metadata)
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.NotFoundError: Key DGP-2ea4f2f5-27/X/dataholder not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_DOUBLE], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "run_regression.py", line 99, in
model.enquire_session(), (s+'/checkpoints').format(dataset_name, L))
File "/home/maximilian/Desktop/FisyMat/TrabajoMaster/Coding/gpflow-monitor/gpflow_monitor/opt_tools.py", line 88, in init
self.saver.restore(session, restore_path)
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1686, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1128, in _run
feed_dict_tensor, options, run_metadata)
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1344, in _do_run
options, run_metadata)
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1363, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key DGP-2ea4f2f5-27/X/dataholder not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_DOUBLE], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

Caused by op 'save/RestoreV2', defined at:
File "run_regression.py", line 99, in
model.enquire_session(), (s+'/checkpoints').format(dataset_name, L))
File "/home/maximilian/Desktop/FisyMat/TrabajoMaster/Coding/gpflow-monitor/gpflow_monitor/opt_tools.py", line 75, in init
self.saver = tf.train.Saver(max_to_keep=3) if saver is None else saver
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1239, in init
self.build()
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1248, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1284, in _build
build_save=build_save, build_restore=build_restore)
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 765, in _build_internal
restore_sequentially, reshape)
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 428, in _AddRestoreOps
tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 268, in restore_op
[spec.tensor.dtype])[0])
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1031, in restore_v2
shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3160, in create_op
op_def=op_def)
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1625, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

NotFoundError (see above for traceback): Key DGP-2ea4f2f5-27/X/dataholder not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_DOUBLE], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

The text was updated successfully, but these errors were encountered:

hughsalimbeni · 2018-04-05T10:09:00Z

Thanks for pointing this out. The problem has been identified here https://github.com/GPflow/GPflow/blob/monitor/doc/source/notebooks/monitor-tensorboard.ipynb,
and it is due to a new naming gpflow convention to avoid graph collisions. There is a work around in the notebook. Very soon (O(days)) the Actions class will be updated to include all the gpflow-monitor functionality, so things should work more smoothly.

maximilianmordig · 2018-04-13T16:01:53Z

Thanks for your reply. Did you incorporate this yet? I did not see any remark on how to circumvent this in the above link. What is the temporary workaround?

hughsalimbeni · 2018-04-13T17:30:47Z

The workaround is to pass name to the model constructor, though I haven't tried this myself. I'm eagerly awaiting GPflow/GPflow#660 to be sorted, then it should run smoothly.

maximilianmordig · 2018-04-23T19:44:39Z

I think they solved the issue in GPFlow, but they have not yet released a new version of GPFlow, so I prefer using the current GPFlow library rather than manually cloning the GPFlow git.

I solved the problem adding the name attribute to the Model (in the DGPModel class) and to the RBF and White kernels.

Before doing this, I tried the following to control random names:
tf.reset_default_graph()
np.random.seed(0)
tf.set_random_seed(0)
Indeed, the naming was then deterministic, but it still complained about the name not being found. I inspected the checkpoint file with inspect_checkpoint.py and, indeed, the name (e.g. Key RBF-992f3d23-7) was not found. Strangely however, it appears in the Tensorboard visualization tool (in the browser). I have no idea where the Tensorboard visualization tool gets it from when it is not in the checkpoint.

hughsalimbeni · 2018-05-03T08:06:36Z

Hi sorry for the slow reply: to run the restore using gpflow_monitor you need to add name to every class (the kernel and likelihood etc).

hughsalimbeni · 2018-05-03T08:07:45Z

By that I mean use name='something distinct' on the init of every Parameterized object you create

RomanFoell · 2018-11-01T17:48:14Z

Hello,
can you give a short instruction which objects in specific regarding my code below. I already named the kernels, likelihoods, not the white kernels, as there I got an error when initializing them. What else? Thanks for your help.

% Initialize data, parameters

X = ...
Y = ...
Xs = ...
Ys = ...
mm = ...
...

Z = kmeans2(X, mm, minit='points')[0]

def make_dgp_models(X, Y, Z):
    models, names = [], []
    
    for L in range(1, 4):
        D = X.shape[1]

        # the layer shapes are defined by the kernel dims, so here all hidden layers are D dimensional 
        kernels = []
        for l in range(L):
            if l==0:
                kernels.append(RBF(D, name = str(l) + '_kernel'))
                # between layer noise (doesn't actually make much difference but we include it anyway)
                for kernel in kernels[:-1]:
                    kernel += White(D, variance=1e-5)
            else:
                kernels.append(RBF(5,name = str(l) + '_kernel'))
                # between layer noise (doesn't actually make much difference but we include it anyway)
                for kernel in kernels[:-1]:
                    kernel += White(5, variance=1e-5)

        mb = 10000 if X.shape[0] > 10000 else None
        print(mb)
        model = DGP(X, Y, Z, kernels, Gaussian(name = str(L)), num_samples=5, minibatch_size=mb)

        # start the inner layers almost deterministically 
        for layer in model.layers[:-1]:
            layer.q_sqrt = layer.q_sqrt.value * 1e-3

        models.append(model)
        names.append('DGP{} {}'.format(L, len(Z)))

    return models, names

models_dgp, names_dgp = make_dgp_models(X, Y, Z)

def batch_assess(model, assess_model, X, Y, SIGMA_y):
    n_batches = max(int(X.shape[0]/1000.), 1)
    lik, sq_diff = [], []
    for X_batch, Y_batch in zip(np.array_split(X, n_batches), np.array_split(Y, n_batches)):
        l, sq = assess_model(model, X_batch, Y_batch)
        lik.append(l)
        sq_diff.append(sq)
    lik = np.concatenate(lik, 0)
    sq_diff = np.array(np.concatenate(sq_diff, 0), dtype=float)
    sq_diff = (sq_diff**0.5 * SIGMA_y)**2
    return np.average(lik), np.average(sq_diff)**0.5

S = 50
def assess_sampled(model, X_batch, Y_batch):
    m, v = model.predict_y(X_batch, S)
    S_lik = np.sum(norm.logpdf(Y_batch*Y_std, loc=m*Y_std, scale=Y_std*v**0.5), 2)
    lik = logsumexp(S_lik, 0, b=1/float(S))

    mean = np.average(m, 0)
    sq_diff = Y_std**2*((mean - Y_batch)**2)
    return lik, sq_diff

iterations_few = 50
s = '{:<16}  lik: {:.4f}, rmse: {:.4f}'

for iterations in [iterations_few]:
    print('after {} iterations'.format(iterations))
    for m, name in zip(models_dgp, names_dgp):
        ng_vars = [[m.layers[-1].q_mu, m.layers[-1].q_sqrt]]
        for v in ng_vars[0]:
            v.set_trainable(False)
        tic = time.time()

#        tf.local_variables_initializer()
#        tf.global_variables_initializer()
        tf_graph = m.enquire_graph()
        tf_session = m.enquire_session()
        m.compile(tf_session)

#        ng_action = NatGradOptimizer(gamma=0.1).make_optimize_action(m, var_list=ng_vars)
#        adam_action = AdamOptimizer(0.1).make_optimize_action(m)

#        Loop([ng_action, adam_action], stop=iterations)()
#        lik, rmse = batch_assess(m, assess_sampled, Xs, Ys,SIGMA_y)
        toc = time.time()
        print('training-time:',toc-tic)
        saver = tf.train.Saver()

#        save_path = saver.save(tf_session, "/Doubly-Stochastic-DGP-master/model.ckpt")
#        print("Model saved")
        save_path = saver.restore(tf_session, "/Doubly-Stochastic-DGP-master/model.ckpt")
        print("Model loaded")

hughsalimbeni · 2018-11-03T14:05:37Z

What is the error you get? I'm not familiar with this approach, but I suspect you need to pass a name when you create the model.

maximilianmordig changed the title ~~Restoring session~~ Restoring session: dataholder not found in checkpoint Apr 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restoring session: dataholder not found in checkpoint #13

Restoring session: dataholder not found in checkpoint #13

maximilianmordig commented Apr 5, 2018 •

edited

Loading

hughsalimbeni commented Apr 5, 2018

maximilianmordig commented Apr 13, 2018

hughsalimbeni commented Apr 13, 2018

maximilianmordig commented Apr 23, 2018

hughsalimbeni commented May 3, 2018

hughsalimbeni commented May 3, 2018

RomanFoell commented Nov 1, 2018 •

edited by hughsalimbeni

Loading

hughsalimbeni commented Nov 3, 2018

Restoring session: dataholder not found in checkpoint #13

Restoring session: dataholder not found in checkpoint #13

Comments

maximilianmordig commented Apr 5, 2018 • edited Loading

hughsalimbeni commented Apr 5, 2018

maximilianmordig commented Apr 13, 2018

hughsalimbeni commented Apr 13, 2018

maximilianmordig commented Apr 23, 2018

hughsalimbeni commented May 3, 2018

hughsalimbeni commented May 3, 2018

RomanFoell commented Nov 1, 2018 • edited by hughsalimbeni Loading

hughsalimbeni commented Nov 3, 2018

maximilianmordig commented Apr 5, 2018 •

edited

Loading

RomanFoell commented Nov 1, 2018 •

edited by hughsalimbeni

Loading