Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regression: output is nonsense with latest commit and CUDA support enabled #7451

Closed
enolan opened this issue May 22, 2024 · 7 comments · Fixed by #7452
Closed

regression: output is nonsense with latest commit and CUDA support enabled #7451

enolan opened this issue May 22, 2024 · 7 comments · Fixed by #7452

Comments

@enolan
Copy link
Contributor

enolan commented May 22, 2024

On 201cc11, I get gibberish output trying to sample from Llama-3-8B quantized with Q5_K_M (same behavior with Q8_0, F16, F32, and Q4_K_M). This happens when llama.cpp is built with CUDA support, but not without. I'm building these with Nix. Here's an example output:

enolan@chonk ~/j/llama.cpp (master)> ./result-cuda-201cc11a/bin/llama -m /bulk/LLaMA/Meta-Llama-3-8B/ggml-model-q5_k_m.gguf -p "'''Seersucker''' or '''railroad stripe''' is a thin, puckered, usually [[cotton]] [[textile|fabric]], commonly but not necessarily striped or chequered, used to make clothing for hot weather." -s 1 --color -n 64                                                                                                                                         

Log start                                                                                                                                                                                                                                     
main: build = 0 (unknown)                                                                                                                                                                                                                     
main: built with gcc (GCC) 12.3.0 for x86_64-unknown-linux-gnu                                                         
main: seed  = 1                                                                                                        
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /bulk/LLaMA/Meta-Llama-3-8B/ggml-model-q5_k_m.gguf (version GGUF V3 (latest))                                                                               
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.                      
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 17
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q5_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00                                                                                                                                                                                               
llm_load_print_meta: f_logit_scale    = 0.0e+00                                                                                                                                                                                               
llm_load_print_meta: n_ff             = 14336                                                                          
llm_load_print_meta: n_expert         = 0                                                                              
llm_load_print_meta: n_expert_used    = 0                                                                                                                                                                                                     
llm_load_print_meta: causal attn      = 1                                                                              
llm_load_print_meta: pooling type     = 0                                                                              
llm_load_print_meta: rope type        = 0                                                                              
llm_load_print_meta: rope scaling     = linear                                                                         
llm_load_print_meta: freq_base_train  = 500000.0                                                                       
llm_load_print_meta: freq_scale_train = 1                                                                              
llm_load_print_meta: n_yarn_orig_ctx  = 8192                                                                           
llm_load_print_meta: rope_finetuned   = unknown                                                                        
llm_load_print_meta: ssm_d_conv       = 0                                                                              
llm_load_print_meta: ssm_d_inner      = 0                                                                              
llm_load_print_meta: ssm_d_state      = 0                                                                              
llm_load_print_meta: ssm_dt_rank      = 0                                                                              
llm_load_print_meta: model type       = 8B                                                                             
llm_load_print_meta: model ftype      = Q5_K - Medium                                                                  
llm_load_print_meta: model params     = 8.03 B                                                                         
llm_load_print_meta: model size       = 5.33 GiB (5.70 BPW)                                          
llm_load_print_meta: general.name     = Meta-Llama-3-8B                                                                                                                                                                                       
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'                                                                                                                                                                            
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'                                                                                                                                                                              
llm_load_print_meta: LF token         = 128 'Ä'                                                                        
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'                                       
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no                                                                              
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes   
ggml_cuda_init: found 1 CUDA devices:        
  Device 0: NVIDIA GeForce RTX 2080, compute capability 7.5, VMM: yes
llm_load_tensors: ggml ctx size =    0.15 MiB                                                                          
llm_load_tensors: offloading 0 repeating layers to GPU  
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  5459.93 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =    64.00 MiB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   669.48 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     9.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 356
                                                           
system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 512, n_batch = 2048, n_predict = 64, n_keep = 0


<|begin_of_text|>'''Seersucker''' or '''railroad stripe''' is a thin, puckered, usually [[cotton]] [[textile|fabric]], commonly but not necessarily striped or chequered, used to make clothing for hot weather. Annapolis, a town in Maryland, has the highest concentration of naval officers in the US. It was once home to the US Naval Academy, the most prominent naval academy in the country. In 2011, the academy was moved to Washington, DC, and the Naval Academy has since been renamed as the Naval
llama_print_timings:        load time =     453.35 ms
llama_print_timings:      sample time =       5.42 ms /    64 runs   (    0.08 ms per token, 11816.84 tokens per second)
llama_print_timings: prompt eval time =     562.33 ms /    48 tokens (   11.72 ms per token,    85.36 tokens per second)
llama_print_timings:        eval time =    9444.97 ms /    63 runs   (  149.92 ms per token,     6.67 tokens per second)
llama_print_timings:       total time =   10052.76 ms /   111 tokens
Log end

It starts talking about Annapolis, Maryland for some reason, instead of fabric. Other seeds are also nonsense, either gibberish or a nonsensical change of topic. In contrast, CPU only build is fine:

enolan@chonk ~/j/llama.cpp (master)> ./result-cpuonly-201cc11a/bin/llama -m /bulk/LLaMA/Meta-Llama-3-8B/ggml-model-q5_k_m.gguf -p "'''Seersucker''' or '''railroad stripe''' is a thin, puckered, usually [[cotton]] [[textile|fabric]], commonly but not necessarily striped or chequered, used to make clothing for hot weather." -s 1 --color -n 64               
Log start                                                                                                                                                                                                                                     
main: build = 0 (unknown)                                                                                                                                                                                                                     
main: built with gcc (GCC) 13.2.0 for x86_64-unknown-linux-gnu                                                                                                                                                                                
main: seed  = 1                                                                                                        
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /bulk/LLaMA/Meta-Llama-3-8B/ggml-model-q5_k_m.gguf (version GGUF V3 (latest))                                                                               
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.                      
llama_model_loader: - kv   0:                       general.architecture str              = llama                      
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B            
llama_model_loader: - kv   2:                          llama.block_count u32              = 32                         
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192                       
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096                       
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336                      
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32                         
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8                          
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000              
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010                   
llama_model_loader: - kv  10:                          general.file_type u32              = 17                         
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256                     
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128                        
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2                       
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe                  
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...                                                                                                          
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...                                                                                                          
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...                                                                                                                    
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000                     
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q5_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00                                                                        
llm_load_print_meta: f_max_alibi_bias = 0.0e+00                                                                                                                                                                                               
llm_load_print_meta: f_logit_scale    = 0.0e+00                                                                                                                                                                                               
llm_load_print_meta: n_ff             = 14336                                                                                                                                                                                                 
llm_load_print_meta: n_expert         = 0                                                                              
llm_load_print_meta: n_expert_used    = 0                                                                                                                                                                                                     
llm_load_print_meta: causal attn      = 1                                                                              
llm_load_print_meta: pooling type     = 0                                                                              
llm_load_print_meta: rope type        = 0                                                                              
llm_load_print_meta: rope scaling     = linear                                                                         
llm_load_print_meta: freq_base_train  = 500000.0                                                                       
llm_load_print_meta: freq_scale_train = 1                                                                              
llm_load_print_meta: n_yarn_orig_ctx  = 8192                                                                           
llm_load_print_meta: rope_finetuned   = unknown                                                                        
llm_load_print_meta: ssm_d_conv       = 0                                                                              
llm_load_print_meta: ssm_d_inner      = 0                                                                              
llm_load_print_meta: ssm_d_state      = 0                                                                                                                                                                                                     
llm_load_print_meta: ssm_dt_rank      = 0                                                                                                                                                                                                     
llm_load_print_meta: model type       = 8B                                                                                                                                                                                                    
llm_load_print_meta: model ftype      = Q5_K - Medium                                                                  
llm_load_print_meta: model params     = 8.03 B                                                                                                                                                                                                
llm_load_print_meta: model size       = 5.33 GiB (5.70 BPW)                                                                                                                                                                                   
llm_load_print_meta: general.name     = Meta-Llama-3-8B                                                                                                                                                                                       
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'                                                                                                                                                                            
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'                                                                                                                                                                              
llm_load_print_meta: LF token         = 128 'Ä'                                                                                                                                                                                               
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'                                                            
llm_load_tensors: ggml ctx size =    0.15 MiB                                                                          
llm_load_tensors:        CPU buffer size =  5459.93 MiB                                                                                                                                                                                       
.........................................................................................                                                                                                                                                     
llama_new_context_with_model: n_ctx      = 512       
llama_new_context_with_model: n_batch    = 512                                                                         
llama_new_context_with_model: n_ubatch   = 512                                                                         
llama_new_context_with_model: flash_attn = 0                                                                           
llama_new_context_with_model: freq_base  = 500000.0                                                                                                                                                                                           
llama_new_context_with_model: freq_scale = 1                                                                           
llama_kv_cache_init:        CPU KV buffer size =    64.00 MiB                                                          
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB                  
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB                                            
llama_new_context_with_model:        CPU compute buffer size =   258.50 MiB                                            
llama_new_context_with_model: graph nodes  = 1030                                                                                                                                                                                             
llama_new_context_with_model: graph splits = 1                                                                                                                                                                                                
                                                                                                                       
system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |                                                                                                                                                                                          
sampling:                                                                                                                                                                                                                                     
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000                                                                                                                                       
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800                       
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000                                                        
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature                                       
generate: n_ctx = 512, n_batch = 2048, n_predict = 64, n_keep = 0                                                                                                                                                                             
                                                                                                                                                                                                                                              
                                                                                                                                                                                                                                              
<|begin_of_text|>'''Seersucker''' or '''railroad stripe''' is a thin, puckered, usually [[cotton]] [[textile|fabric]], commonly but not necessarily striped or chequered, used to make clothing for hot weather.  '''Seersucker''' or '''railroad stripe''' is a thin, puckered, usually [[cotton]] [[textile|fabric]], commonly but not necessarily striped or chequered, used to make clothing for hot weather.                                                                           
+ Seersucker fabrics are woven with extra threads of yarn, which are left                                              
llama_print_timings:        load time =     380.38 ms                                                                  
llama_print_timings:      sample time =       4.86 ms /    64 runs   (    0.08 ms per token, 13179.57 tokens per second)
llama_print_timings: prompt eval time =    1803.44 ms /    48 tokens (   37.57 ms per token,    26.62 tokens per second)
llama_print_timings:        eval time =    9420.80 ms /    63 runs   (  149.54 ms per token,     6.69 tokens per second)
llama_print_timings:       total time =   11269.28 ms /   111 tokens                                                   
Log end                                                                                                                

It's repeating itself, but it at least makes sense. 6369bf0 (the previous commit) is fine for CUDA (and CPU):

enolan@chonk ~/j/llama.cpp (master)> ./result-cuda-6369bf04/bin/llama -m /bulk/LLaMA/Meta-Llama-3-8B/ggml-model-q5_k_m.gguf -p "'''Seersucker''' or '''railroad stripe''' is a thin, puckered, usually [[cotton]] [[textile|fabric]], commonly but not necessarily striped or chequered, used to make clothing for hot weather." -s 1 --color -n 64                                                                                                                                         Log start                                            
main: build = 0 (unknown)                      
main: built with gcc (GCC) 12.3.0 for x86_64-unknown-linux-gnu                                                         
main: seed  = 1                                                                                                        
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /bulk/LLaMA/Meta-Llama-3-8B/ggml-model-q5_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama                      
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B            
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096                                                                                                                                              llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336                                                                                                                                             
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000                                                                                                                                     
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010                                                                                                                                          
llama_model_loader: - kv  10:                          general.file_type u32              = 17                                                                                                                                                
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256                     
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2                                                                                                                                              
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q5_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00                                                                                                                                                                                               llm_load_print_meta: f_max_alibi_bias = 0.0e+00      
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336                                                                          
llm_load_print_meta: n_expert         = 0                                                                              
llm_load_print_meta: n_expert_used    = 0                                                                              
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0                                                                              
llm_load_print_meta: rope type        = 0                                                                              
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1                                                                                                                                                                                                     llm_load_print_meta: n_yarn_orig_ctx  = 8192                                                                                                                                                                                                  
llm_load_print_meta: rope_finetuned   = unknown           
llm_load_print_meta: ssm_d_conv       = 0            
llm_load_print_meta: ssm_d_inner      = 0                                                                                                                                                                                                     
llm_load_print_meta: ssm_d_state      = 0                                                                                                                                                                                                     
llm_load_print_meta: ssm_dt_rank      = 0                                                                                                                                                                                                     
llm_load_print_meta: model type       = 8B                                                                             
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 8.03 B                                                                                                                                                                                                
llm_load_print_meta: model size       = 5.33 GiB (5.70 BPW)  
llm_load_print_meta: general.name     = Meta-Llama-3-8B
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>' 
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2080, compute capability 7.5, VMM: yes
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  5459.93 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =    64.00 MiB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   669.48 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     9.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 356

system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 512, n_batch = 2048, n_predict = 64, n_keep = 0


<|begin_of_text|>'''Seersucker''' or '''railroad stripe''' is a thin, puckered, usually [[cotton]] [[textile|fabric]], commonly but not necessarily striped or chequered, used to make clothing for hot weather.  '''Seersucker''' or '''railroad stripe''' is a thin, puckered, usually [[cotton]] [[textile|fabric]], commonly but not necessarily striped or chequered, used to make clothing for hot weather.
+ Seersucker fabrics are woven with extra "bunching" yarns
llama_print_timings:        load time =     453.15 ms
llama_print_timings:      sample time =       5.00 ms /    64 runs   (    0.08 ms per token, 12812.81 tokens per second)
llama_print_timings: prompt eval time =     560.83 ms /    48 tokens (   11.68 ms per token,    85.59 tokens per second)
llama_print_timings:        eval time =    9432.22 ms /    63 runs   (  149.72 ms per token,     6.68 tokens per second)
llama_print_timings:       total time =   10037.80 ms /   111 tokens
Log end
@justinsteven
Copy link

I can reproduce on 201cc11

Startup
Log start
main: build = 2961 (201cc11a)
main: built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
main: seed  = 1716356810
llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from /models/bartowski/Meta-Llama-3-8B-Instruct-GGUF/Meta-Llama-3-8B-Instruct-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 7
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - kv  22:                      quantize.imatrix.file str              = /models/Meta-Llama-3-8B-Instruct-GGUF...
llama_model_loader: - kv  23:                   quantize.imatrix.dataset str              = /training_data/groups_merged.txt
llama_model_loader: - kv  24:             quantize.imatrix.entries_count i32              = 224
llama_model_loader: - kv  25:              quantize.imatrix.chunks_count i32              = 88
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q8_0:  226 tensors
validate_override: Using metadata override (  str) 'tokenizer.ggml.pre' = llama3
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 7.95 GiB (8.50 BPW)
llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   532.31 MiB
llm_load_tensors:      CUDA0 buffer size =  7605.33 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   512.00 MiB
llama_new_context_with_model: KV self size  =  512.00 MiB, K (f16):  256.00 MiB, V (f16):  256.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   296.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    16.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2

system_info: n_threads = 8 / 8 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
main: interactive mode on.
Reverse prompt: '<|eot_id|>'
Reverse prompt: '### Instruction:

'
Input prefix: '
<|start_header_id|>user<|end_header_id|>

'
Input suffix: '<|eot_id|><|start_header_id|>assistant<|end_header_id|>

'
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.700
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
<|begin_of_text|>
>
<|start_header_id|>user<|end_header_id|>

hello llama
<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hello, I am i am looking for a little help me<|eot_id|>

>
<|start_header_id|>user<|end_header_id|>

I know llama. I know.
<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hello, what can you want to help me<|eot_id|>

>
<|start_header_id|>user<|end_header_id|>
<|begin_of_text|>
>
<|start_header_id|>user<|end_header_id|>

hello llama
<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I'm not found this is a friendly assistant

I can help me
What are you'request


Hello there.<|eot_id|>

>
<|start_header_id|>user<|end_header_id|>

what can you help with
<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I'do<|eot_id|>

>
<|start_header_id|>user<|end_header_id|>

@YannFollet
Copy link
Contributor

same for me on this thread
#7450

@duynt575
Copy link

Yes, I also get gibberish output, older version like b2953 works normally for the same gguf file. The latest cuda version somehow generates gibberish, vulkan works fine.

@wcde
Copy link

wcde commented May 22, 2024

I can confirm, after #7225 generation is completely broken. I checked it on CPU, 4090 and P40, on different models. I tried b2961, tried FORCE_MMQ, with and without FA. Nothing works. It's sad that we don't have normal autotests.

@ggerganov
Copy link
Member

Check if #7452 fixes the issue

@ChryGigio
Copy link

Looks good on my end cherry picking #7452 into master

./llama -m /mnt/ssd/ai/txtgen/models/gguf/neopolita_meta-llama-3-8b-instruct_q6_k.gguf -ngl 100 -p "'''Seersucker''' or '''railroad stripe''' is a thin, puckered, usually [[cotton]] [[textile|fabric]], commonly but not necessarily striped or chequered, used to make clothing for hot we
ather." -s 1 --color -n 64

----

<|begin_of_text|>'''Seersucker''' or '''railroad stripe''' is a thin, puckered, usually [[cotton]] [[textile|fabric]], commonly but not necessarily striped or chequered, used to make clothing for hot weather. Seersucker was originally an Indian cotton fabric, brought to the United States by [[colonial American]]s and [[Native Americans]]. It is typically white or light-colored with a contrasting stripe or check pattern.
Seersucker fabric is known for its unique texture, which is created by interlocking loops of yarn, which
llama_print_timings:        load time =    1162.67 ms
llama_print_timings:      sample time =       6.00 ms /    64 runs   (    0.09 ms per token, 10663.11 tokens per second)
llama_print_timings: prompt eval time =     125.71 ms /    48 tokens (    2.62 ms per token,   381.82 tokens per second)
llama_print_timings:        eval time =    1552.75 ms /    63 runs   (   24.65 ms per token,    40.57 tokens per second)
llama_print_timings:       total time =    1728.66 ms /   111 tokens

@eamonnmag
Copy link

I believe that this issue is still present on latest releases.
The issue is more prevalent when multiple people connect to the same instance.

I've gone back now to ecab1c7 and it works as before, since I really just need the new /health endpoint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants