Merge upstream (experimental) #2

pi6am · 2024-07-09T07:11:18Z

No description provided.

* whisper : use ggml_backend_sched (wip) * use sched in whisper_allocr * whisper : single backend in whisper_context * whisper : remove whisper_state->backends_used * whisper : remove whisper_context->backend * whisper : reset scheduler after init * whisper : fix external encoder (e.g. CoreML) * whisper : cleanup * whisper : handle null GPU buffer types + fix sycl --------- Co-authored-by: slaren <slarengh@gmail.com>

Signed-off-by: thxCode <thxcode0824@gmail.com>

On hosts which are not prepared/dedicated to execute code using CUDA it is still possible to compile llama.cpp with CUDA support by just installing the development packages. Missing are the runtime libraries like /usr/lib64/libcuda.so* and currently the link step will fail. The development environment is prepared for such situations. There are stub libraries for all the CUDA libraries available in the $(CUDA_PATH)/lib64/stubs directory. Adding this directory to the end of the search path will not change anything for environments which currently work fine but will enable compiling llama.cpp also in case the runtime code is not available.

* Only use FIM middle if it exists * Only use FIM middle if it exists

* Random test: add_bos_token, add_eos_token * Random test: add BPE models for testing * Custom regex split fails with codepoint 0 * Fix falcon punctuation regex * Refactor llm_tokenizer_bpe: move code to constructor * Move 'add_special_bos/eos' logic to llm_tokenizer_bpe * Move tokenizer flags to vocab structure. * Default values for special_add_bos/eos * Build vocab.special_tokens_cache using vocab token types * Generalize 'jina-v2' per token attributes * Fix unicode whitespaces (deepseek-coder, deepseek-llm) * Skip missing byte tokens (falcon) * Better unicode data generation * Replace char32_t with uint32_t

* seperate lower precision GEMM from the main files * fix workgroup size hardcode

@slaren

* un-ignore `build-info.cmake` and `build-info.sh` I am assuming that ignoring them was unintentional. If they are ignored, some tools, like cargo, will consider the files inexistent, even if they're comitted, for the purpose of publishing. This leads to the build failing in such cases. * un-ignore `build-info.cpp.in` For the same reason as the previous two files. * Reorganize `.gitignore` * Add exceptions for files mentioned by @slaren I did leave .clang-tidy since it was explicitly ignored before. * Add comments for organization * Sort some lines for pretty * Test with `make` and `cmake` builds to ensure no build artifacts might be comitted * Remove `.clang-tidy` from `.gitignore` Per comment by @ggerganov * Remove `IDEWorkspaceChecks.plist` from root-level `.gitignore`

Currently the Metal backend does not support BF16. `ggml_metal_supports_op` was returning true in these cases, leading to a crash with models converted with `--leave-output-tensor`. This commit checks if the first few sources types are BF16 and returns false if that's the case.

* CUDA: stream-k decomposition for MMQ * fix undefined memory reads for small matrices

* add sycl preset * fix debug link error. fix windows crash * update README

* common: fix warning * Update common/common.cpp Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>

…ml-org#8040)

* create append_pooling operation; allow to specify attention_type; add last token pooling; update examples * find result_norm/result_embd tensors properly; update output allocation logic * only use embd output for pooling_type NONE * get rid of old causal_attn accessor * take out attention_type; add in llama_set_embeddings * bypass logits when doing non-NONE pooling

ggml-ci

* initial iq4_xs * fix ci * iq4_nl * iq1_m * iq1_s * iq2_xxs * iq3_xxs * iq2_s * iq2_xs * iq3_s before sllv * iq3_s * iq3_s small fix * iq3_s sllv can be safely replaced with sse multiply

…ml-org#8022) * vulkan: detect multiple devices by deviceUUID instead of deviceID * vulkan: remove unneeded variables * vulkan: fix id query

# Conflicts: # .github/labeler.yml # .github/workflows/server.yml # .gitignore # CMakeLists.txt # Makefile # README-sycl.md # README.md # llama.cpp # requirements/requirements-convert-hf-to-gguf-update.txt # requirements/requirements-convert-hf-to-gguf.txt # requirements/requirements-convert-legacy-llama.txt # scripts/sync-ggml.last # tests/test-tokenizer-random.py

@ochafik

* Adding simple bare-bones test for end-to-end integration test for json validation against auto-generated JSON-schema grammars. * Adding additional examples as documented in ggml-org#7789 . Also adding the ability to automatically output improperly failing grammars to debug output files so they can more easily be examined in the gbnf-validator program. * Uncommenting formerly commented tests so that they fail for others who are attempting to reproduce the bugs. * Merging improved schema test methods added by @ochafik in ggml-org#7797 * Adding #define to temporarily remove failing tests so that this PR can pass CI, but still be useful for other PRs that want to leverage the framework. * Fixing nits from ochafik. Removing escape slashes, adding additional failing cases, fixing some other strings. * Fixing grammar indentation to be consistent throughout file.

@JohannesGaessler

…alues (ggml-org#8058) Uses the values computed by @JohannesGaessler in PR ggml-org#7413

* py : switch to snake_case ggml-ci * cont ggml-ci * cont ggml-ci * cont : fix link * gguf-py : use snake_case in scripts entrypoint export * py : rename requirements for convert_legacy_llama.py Needed for scripts/check-requirements.sh --------- Co-authored-by: Francis Couture-Harpin <git@compilade.net>

* fix group_norm ut * split softmax * fix softmax * add concat support condition * revert debug code * move QK_WARP_SIZE to presets.hpp

* passkey : add short intro to README.md [no-ci] This commit adds a short introduction to the README.md file in the examples/passkey directory. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * Update examples/passkey/README.md --------- Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

)

The change on the launch_bounds was causing a small performance drop in perplexity of 25 t/s

* llama : minor indentation during tensor loading ggml-ci * llama : use int for layer iterators [no ci]

* re-organize docs * add link among docs * add link to build docs * fix style * de-duplicate sections

# Conflicts: # .github/ISSUE_TEMPLATE/config.yml # .gitignore # CMakeLists.txt # CONTRIBUTING.md # Makefile # README.md # ci/run.sh # common/common.h # examples/main-cmake-pkg/CMakeLists.txt # ggml/src/CMakeLists.txt # models/ggml-vocab-bert-bge.gguf.inp # models/ggml-vocab-bert-bge.gguf.out # models/ggml-vocab-deepseek-coder.gguf.inp # models/ggml-vocab-deepseek-coder.gguf.out # models/ggml-vocab-deepseek-llm.gguf.inp # models/ggml-vocab-deepseek-llm.gguf.out # models/ggml-vocab-falcon.gguf.inp # models/ggml-vocab-falcon.gguf.out # models/ggml-vocab-gpt-2.gguf.inp # models/ggml-vocab-gpt-2.gguf.out # models/ggml-vocab-llama-bpe.gguf.inp # models/ggml-vocab-llama-bpe.gguf.out # models/ggml-vocab-llama-spm.gguf.inp # models/ggml-vocab-llama-spm.gguf.out # models/ggml-vocab-mpt.gguf.inp # models/ggml-vocab-mpt.gguf.out # models/ggml-vocab-phi-3.gguf.inp # models/ggml-vocab-phi-3.gguf.out # models/ggml-vocab-starcoder.gguf.inp # models/ggml-vocab-starcoder.gguf.out # requirements.txt # requirements/requirements-convert_legacy_llama.txt # scripts/check-requirements.sh # scripts/pod-llama.sh # src/CMakeLists.txt # src/llama.cpp # tests/test-rope.cpp

(cherry picked from commit 572aba8)

* Add llama_detokenize(): - Update header files location - UNKNOWN and CONTROL are 'special pieces' - Remove space after UNKNOWN and CONTROL - Refactor llama_token_to_piece() - Add flag: clean_up_tokenization_spaces - Symmetric params for llama_tokenize() and llama_detokenize() * Update and fix tokenizer tests: - Using llama_detokenize() - Unexpected vocab type as test fail instead of error - Useful when automating tests: - If you don't know in advance the vocab type - Differenciate other loading errors - Skip unicode surrogaes and undefined - Gracefully exit threads - Using exit() is throwing random exceptions - Clean old known problematic codepoints - Minor: confusing hexadecimal codepoint * Update bruteforce random tests - Add detokenizer checks - New generator: ascii_lr_strip - New generator: apostrophe - Add more vocabs files - Detokenize special tokens. - Replace errors with '\uFFFD' when detokenizing to 'utf-8' - More edge cases - Better detokenization results check * Fix add_space_prefix, set false by default * Better leading space removal * Do not remove space when decoding special tokens * Bugfix: custom regexs splits undefined unicode codepoints * 'viking' detokenizer clean spaces

* llama : add early return for empty range This commit adds an early return to the llama_kv_cache_seq_add and llama_kv_cache_seq_div functions. The motivation for adding this is to avoid looping over the cache when the range is empty. I ran into this when using the self-extend feature in main.cpp. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * llama : add static_cast to fix CI warning/error This commit attempts to fix the following warning/error: ```console src/llama.cpp:7271:31: error: comparison of integer expressions of different signedness: ‘int’ and ‘uint32_t’ {aka ‘unsigned int’} [-Werror=sign-compare] 7271 | if (i < hparams.n_layer_dense_lead) { | ~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` This can be reproduced locally by setting -Wsign-compare in the Makefile. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * squash! llama : add early return for empty range Remove the setting of cache.head to 0 when the range is empty. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * Update src/llama.cpp --------- Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

# Conflicts: # .gitignore # README.md # docs/backend/BLIS.md # docs/backend/SYCL.md # docs/development/llama-star/idea-arch.key # docs/development/llama-star/idea-arch.pdf # docs/development/token_generation_performance_tips.md # src/llama.cpp # tests/test-tokenizer-0.cpp # tests/test-tokenizer-1-bpe.cpp # tests/test-tokenizer-1-spm.cpp # tests/test-tokenizer-random.py

* fstring #1 * fstring #2

* dictionary #1 * dictionary #2

ggerganov and others added 30 commits June 18, 2024 09:50

ggml : sync

5326bcc

readme : update UI list (ggml-org#7943)

1193778

chore: clean useless beam search param (ggml-org#7985)

b96f9af

Signed-off-by: thxCode <thxcode0824@gmail.com>

Fix no gcc pragma on Windows (ggml-org#7751)

84f6de1

Only use FIM middle token if it exists (ggml-org#7648)

91c188d

* Only use FIM middle if it exists * Only use FIM middle if it exists

[SYCL] refactor (ggml-org#6408)

623494a

* seperate lower precision GEMM from the main files * fix workgroup size hardcode

codecov : remove (ggml-org#8004)

a04a953

ggml : synchronize threads using barriers (ggml-org#7993)

9c77ec1

server : fix smart slot selection (ggml-org#8020)

ba58993

CUDA: stream-k decomposition for MMQ (ggml-org#8018)

d50f889

* CUDA: stream-k decomposition for MMQ * fix undefined memory reads for small matrices

[SYCL] Fix windows build and inference (ggml-org#8003)

de391e4

* add sycl preset * fix debug link error. fix windows crash * update README

common: fix warning (ggml-org#8036)

abd894a

* common: fix warning * Update common/common.cpp Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>

convert-hf : Fix the encoding in the convert-hf-to-gguf-update.py (gg…

17b291a

…ml-org#8040)

requirements : Bump torch and numpy for python3.12 (ggml-org#8041)

b1ef562

swiftui : enable stream updating (ggml-org#7754)

0e64591

llama : optimize long word tokenization with WPM (ggml-org#8034)

a927b0f

ggml-ci

ggml : AVX IQ quants (ggml-org#7845)

7d5e877

* initial iq4_xs * fix ci * iq4_nl * iq1_m * iq1_s * iq2_xxs * iq3_xxs * iq2_s * iq2_xs * iq3_s before sllv * iq3_s * iq3_s small fix * iq3_s sllv can be safely replaced with sse multiply

vulkan: detect multiple devices by deviceUUID instead of deviceID (gg…

557b653

…ml-org#8022) * vulkan: detect multiple devices by deviceUUID instead of deviceID * vulkan: remove unneeded variables * vulkan: fix id query

fix ubatch, autoselect vulkan dgpu if possible

1339847

Update llama-quantize ppl/file size output from LLaMA-v1 to Llama-3 v…

5b48cd5

…alues (ggml-org#8058) Uses the values computed by @JohannesGaessler in PR ggml-org#7413

convert-hf : change assert to exception (ggml-org#8015)

3aa184a

add llava separator

12abc41

ggerganov and others added 27 commits July 5, 2024 07:53

[SYCL] Fix WARP_SIZE=16 bug of Intel GPU (ggml-org#8266)

a9554e2

* fix group_norm ut * split softmax * fix softmax * add concat support condition * revert debug code * move QK_WARP_SIZE to presets.hpp

contributing : update guidelines (ggml-org#8316)

6c05752

llama : prefer n_ over num_ prefix (ggml-org#8308)

aa5898d

readme : fix minor typos [no ci] (ggml-org#8314)

5a7447c

CUDA: fix MMQ stream-k rounding if ne00 % 128 != 0 (ggml-org#8311)

bcefa03

llama : streamline embeddings from "non-embedding" models (ggml-org#8087

d12f781

)

CUDA: revert part of the RDNA1 optimizations (ggml-org#8309)

0a42380

The change on the launch_bounds was causing a small performance drop in perplexity of 25 t/s

CUDA: MMQ support for iq4_nl, iq4_xs (ggml-org#8278)

8e55830

llama : minor indentation during tensor loading (ggml-org#8304)

2cccbaa

* llama : minor indentation during tensor loading ggml-ci * llama : use int for layer iterators [no ci]

convert : remove AWQ remnants (ggml-org#8320)

148ec97

Enabled more data types for oneMKL gemm_batch (ggml-org#8236)

1f3e1b6

cmake : add GGML_BUILD and GGML_SHARED macro definitions (ggml-org#8281)

1d894a7

llama : fix compile warning (ggml-org#8304)

7ed03b8

workaround for deepseek not working

388a2af

Reorganize documentation pages (ggml-org#8325)

be20e7f

* re-organize docs * add link among docs * add link to build docs * fix style * de-duplicate sections

add target for oldcpu cuda

572aba8

add target for oldcpu cuda

ecec9fb

(cherry picked from commit 572aba8)

Merge branch 'concedo' into concedo_experimental

15d2072

change default rec

43b3cf0

update gemma format

5e458f4

allow unpacking in CLI

7f48ed3

pi6am merged commit 938aa62 into pi6am:concedo_experimental Jul 9, 2024

pi6am pushed a commit that referenced this pull request Jul 27, 2024

Streamline with fstrings (LostRuins#1006)

ce971a0

* fstring #1 * fstring #2

pi6am pushed a commit that referenced this pull request Jul 27, 2024

Streamline with dictionaries (LostRuins#1005)

7de1ebf

* dictionary #1 * dictionary #2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge upstream (experimental) #2

Merge upstream (experimental) #2

pi6am commented Jul 9, 2024

Merge upstream (experimental) #2

Merge upstream (experimental) #2

Conversation

pi6am commented Jul 9, 2024