Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restoring session: dataholder not found in checkpoint #13

Open
maximilianmordig opened this issue Apr 5, 2018 · 8 comments
Open

Restoring session: dataholder not found in checkpoint #13

maximilianmordig opened this issue Apr 5, 2018 · 8 comments

Comments

@maximilianmordig
Copy link

maximilianmordig commented Apr 5, 2018

I have tried to run the script run_regression.py in the demo folder, but get the following exception after running the program again, which is related to session restoring. It is related to not all variables being stored in the checkpoint. Did you encounter this problem as well?

############################ kin8nm L=2 split=0
N: 7372, D: 8, Ns: 820
Restoring session from /home/maximilian/Desktop/FisyMat/TrabajoMaster/Coding/Doubly-Stochastic-DGP/demos/Results/tmp_results_maximilian-p50s/kin8nm_L2_split0/checkpoints-5.
Traceback (most recent call last):
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _do_call
return fn(*args)
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1329, in _run_fn
status, run_metadata)
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.NotFoundError: Key DGP-2ea4f2f5-27/X/dataholder not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_DOUBLE], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "run_regression.py", line 99, in
model.enquire_session(), (s+'/checkpoints').format(dataset_name, L))
File "/home/maximilian/Desktop/FisyMat/TrabajoMaster/Coding/gpflow-monitor/gpflow_monitor/opt_tools.py", line 88, in init
self.saver.restore(session, restore_path)
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1686, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1128, in _run
feed_dict_tensor, options, run_metadata)
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1344, in _do_run
options, run_metadata)
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1363, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key DGP-2ea4f2f5-27/X/dataholder not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_DOUBLE], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

Caused by op 'save/RestoreV2', defined at:
File "run_regression.py", line 99, in
model.enquire_session(), (s+'/checkpoints').format(dataset_name, L))
File "/home/maximilian/Desktop/FisyMat/TrabajoMaster/Coding/gpflow-monitor/gpflow_monitor/opt_tools.py", line 75, in init
self.saver = tf.train.Saver(max_to_keep=3) if saver is None else saver
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1239, in init
self.build()
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1248, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1284, in _build
build_save=build_save, build_restore=build_restore)
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 765, in _build_internal
restore_sequentially, reshape)
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 428, in _AddRestoreOps
tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 268, in restore_op
[spec.tensor.dtype])[0])
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1031, in restore_v2
shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3160, in create_op
op_def=op_def)
File "/home/maximilian/anaconda3/envs/testNew/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1625, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

NotFoundError (see above for traceback): Key DGP-2ea4f2f5-27/X/dataholder not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_DOUBLE], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

@maximilianmordig maximilianmordig changed the title Restoring session Restoring session: dataholder not found in checkpoint Apr 5, 2018
@hughsalimbeni
Copy link
Collaborator

Thanks for pointing this out. The problem has been identified here https://github.com/GPflow/GPflow/blob/monitor/doc/source/notebooks/monitor-tensorboard.ipynb,
and it is due to a new naming gpflow convention to avoid graph collisions. There is a work around in the notebook. Very soon (O(days)) the Actions class will be updated to include all the gpflow-monitor functionality, so things should work more smoothly.

@maximilianmordig
Copy link
Author

Thanks for your reply. Did you incorporate this yet? I did not see any remark on how to circumvent this in the above link. What is the temporary workaround?

@hughsalimbeni
Copy link
Collaborator

The workaround is to pass name to the model constructor, though I haven't tried this myself. I'm eagerly awaiting GPflow/GPflow#660 to be sorted, then it should run smoothly.

@maximilianmordig
Copy link
Author

I think they solved the issue in GPFlow, but they have not yet released a new version of GPFlow, so I prefer using the current GPFlow library rather than manually cloning the GPFlow git.

I solved the problem adding the name attribute to the Model (in the DGPModel class) and to the RBF and White kernels.

Before doing this, I tried the following to control random names:
tf.reset_default_graph()
np.random.seed(0)
tf.set_random_seed(0)
Indeed, the naming was then deterministic, but it still complained about the name not being found. I inspected the checkpoint file with inspect_checkpoint.py and, indeed, the name (e.g. Key RBF-992f3d23-7) was not found. Strangely however, it appears in the Tensorboard visualization tool (in the browser). I have no idea where the Tensorboard visualization tool gets it from when it is not in the checkpoint.

@hughsalimbeni
Copy link
Collaborator

Hi sorry for the slow reply: to run the restore using gpflow_monitor you need to add name to every class (the kernel and likelihood etc).

@hughsalimbeni
Copy link
Collaborator

By that I mean use name='something distinct' on the init of every Parameterized object you create

@RomanFoell
Copy link

RomanFoell commented Nov 1, 2018

Hello,
can you give a short instruction which objects in specific regarding my code below. I already named the kernels, likelihoods, not the white kernels, as there I got an error when initializing them. What else? Thanks for your help.

% Initialize data, parameters

X = ...
Y = ...
Xs = ...
Ys = ...
mm = ...
...

Z = kmeans2(X, mm, minit='points')[0]

def make_dgp_models(X, Y, Z):
    models, names = [], []
    
    for L in range(1, 4):
        D = X.shape[1]

        # the layer shapes are defined by the kernel dims, so here all hidden layers are D dimensional 
        kernels = []
        for l in range(L):
            if l==0:
                kernels.append(RBF(D, name = str(l) + '_kernel'))
                # between layer noise (doesn't actually make much difference but we include it anyway)
                for kernel in kernels[:-1]:
                    kernel += White(D, variance=1e-5)
            else:
                kernels.append(RBF(5,name = str(l) + '_kernel'))
                # between layer noise (doesn't actually make much difference but we include it anyway)
                for kernel in kernels[:-1]:
                    kernel += White(5, variance=1e-5)

        mb = 10000 if X.shape[0] > 10000 else None
        print(mb)
        model = DGP(X, Y, Z, kernels, Gaussian(name = str(L)), num_samples=5, minibatch_size=mb)

        # start the inner layers almost deterministically 
        for layer in model.layers[:-1]:
            layer.q_sqrt = layer.q_sqrt.value * 1e-3

        models.append(model)
        names.append('DGP{} {}'.format(L, len(Z)))

    return models, names

models_dgp, names_dgp = make_dgp_models(X, Y, Z)

def batch_assess(model, assess_model, X, Y, SIGMA_y):
    n_batches = max(int(X.shape[0]/1000.), 1)
    lik, sq_diff = [], []
    for X_batch, Y_batch in zip(np.array_split(X, n_batches), np.array_split(Y, n_batches)):
        l, sq = assess_model(model, X_batch, Y_batch)
        lik.append(l)
        sq_diff.append(sq)
    lik = np.concatenate(lik, 0)
    sq_diff = np.array(np.concatenate(sq_diff, 0), dtype=float)
    sq_diff = (sq_diff**0.5 * SIGMA_y)**2
    return np.average(lik), np.average(sq_diff)**0.5

S = 50
def assess_sampled(model, X_batch, Y_batch):
    m, v = model.predict_y(X_batch, S)
    S_lik = np.sum(norm.logpdf(Y_batch*Y_std, loc=m*Y_std, scale=Y_std*v**0.5), 2)
    lik = logsumexp(S_lik, 0, b=1/float(S))

    mean = np.average(m, 0)
    sq_diff = Y_std**2*((mean - Y_batch)**2)
    return lik, sq_diff

iterations_few = 50
s = '{:<16}  lik: {:.4f}, rmse: {:.4f}'

for iterations in [iterations_few]:
    print('after {} iterations'.format(iterations))
    for m, name in zip(models_dgp, names_dgp):
        ng_vars = [[m.layers[-1].q_mu, m.layers[-1].q_sqrt]]
        for v in ng_vars[0]:
            v.set_trainable(False)
        tic = time.time()

#        tf.local_variables_initializer()
#        tf.global_variables_initializer()
        tf_graph = m.enquire_graph()
        tf_session = m.enquire_session()
        m.compile(tf_session)

#        ng_action = NatGradOptimizer(gamma=0.1).make_optimize_action(m, var_list=ng_vars)
#        adam_action = AdamOptimizer(0.1).make_optimize_action(m)

#        Loop([ng_action, adam_action], stop=iterations)()
#        lik, rmse = batch_assess(m, assess_sampled, Xs, Ys,SIGMA_y)
        toc = time.time()
        print('training-time:',toc-tic)
        saver = tf.train.Saver()

#        save_path = saver.save(tf_session, "/Doubly-Stochastic-DGP-master/model.ckpt")
#        print("Model saved")
        save_path = saver.restore(tf_session, "/Doubly-Stochastic-DGP-master/model.ckpt")
        print("Model loaded")
     

@hughsalimbeni
Copy link
Collaborator

What is the error you get? I'm not familiar with this approach, but I suspect you need to pass a name when you create the model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants