2024-09-30:

- added support of Llama 3.2 text models
- added support of phi3 models
- added support of Qwen2 models
- added support of gte_qwen2 embedding models
- added the "embeddings" endpoint
- Whisper: added automatic language detection
- Whisper: allow language parameter as form data and make it optional.
- added support of batched query for the "completions" and "logprob" endpoints
- added CPU offloading support

2024-08-03:

- added support of Llama 3.1
- removed dependency on libjpeg

2024-06-05:

- added the 'streaming_output' configuration parameter.

2024-05-21:

- added support of Llama 3
- single libnc_cuda library for cuda 11 and 12
- added the ncdump utility

2024-01-20:

- added Whisper based speech to text transcription
- added Python API examples in 'api_examples'
- added more generic 'hf_model_convert.py' conversion script
- moved the tokenizer file into the model file
- added mixtral model support

2023-10-21:

- added support of speculative sampling
- added BNF grammar and JSON schema based sampling
- added Windows cuda GPU support
- added 3 bit quantization (cuda GPU only)
- '-T' option default value set to the number of physical cores
- ts_server: added 'kv_cache_size' and 'kv_cache_max_count' parameters
- ts_server: added 'bind_addr' parameter
- automatically fill the model list in the GUI
- ts_chat: added Windows support
- added mistral model support
- reduced memory usage and improved speed with large contexts (> 4K)
- added negative_prompt and image input support to stable diffusion
- fixed sd_convert.py conversion script to accept more stable
  diffusion models

2023-08-07:

- added the ts_zip utility
- ts_chat: added handling of Ctrl-C
- added chat HTML GUI

2023-07-21:

- added support of the Llama 2 and MPT models
- reduced memory usage for long contexts
- increased speed
- ts_chat: better terminal support

2023-06-10:

- added the ts_chat utility to easily use chat models
- new supported models: Falcon and RedPajama-INCITE

2023-05-24:

- cuda: fixed handling of large T5 contexts

2023-05-22:

- added T5 encoder and decoder context size parameters
- added cuda version 12.x support

2023-05-04:

- faster T5 and bloom model support

2023-03-26:

- added NLLB200 and UL2 model support
- added HTML GUI

2023-03-15:

- initial release
