Releases: microsoft/onnxruntime-genai
Releases · microsoft/onnxruntime-genai
v0.12.2
- Update examples after 0.12.0 release
- Add missing Quark 0.11 weight patterns for ChatGLM3 output layer
- Support Qwen2.5-VL pre-quantized models in qwen.py
- Fix incorrect batch responses when using multiple prompts
- Harden CUDA error checking across the codebase
- allow pruned models for prefill
- Add small changes after pruning prefill
v0.12.1
v0.12.0
What's Changed
- Update versions after making 0.11.0 branch by @kunal-vaishnavi in #1867
- Fix guidance usage in continuous decoding by @kunal-vaishnavi in #1870
- Fix HelloPhi C# example by @kunal-vaishnavi in #1871
- Fix regex by @apsonawane in #1875
- Update extensions commit by @apsonawane in #1874
- Revert removal of eps_without_if_support by @xiaofeihan1 in #1878
- Fix condition for NPU by @apsonawane in #1880
- Model builder refactoring by @tianleiwu in #1862
- Add lintrunner to format code by @tianleiwu in #1884
- Remove empty submodule leftover. by @xkszltl in #1883
- Fix build for lack of RTLD_DI_ORIGIN support by @jaeyoonjung in #1888
- Enable graph capture for webgpu by @qjia7 in #1848
- Generic shared emb_tokens/lm_head implementation by @jixiongdeng in #1885
- Fix bug in Squeeze for getting the value of total_seq_len by @Honry in #1886
- Extra_options
disable_qkv_fusionto untie qkv_projs from upstream choice by @jixiongdeng in #1893 - Fix mac pipeline by @apsonawane in #1904
- whisper: Support a variant of the whisper pipeline where encoder / decoder are stateful. by @RyanMetcalfeInt8 in #1857
- Add model builder for Qwen2_5_VLTextModel by @tianleiwu in #1882
- Integrate FARA-7B model by @apsonawane in #1902
- Fix gpt-oss model export by @apsonawane in #1861
- OpenVINO: Add support for model caching via 'cache_dir' provider option by @RyanMetcalfeInt8 in #1900
- WinML - Remove the inclusive Microsoft.WindowsAppSDK.ML range check by @chrisdMSFT in #1907
- Run the model in text mode by @apsonawane in #1908
- Update extensions commit by @apsonawane in #1914
- Fix gpt-oss export by @apsonawane in #1915
- Support Olive new uint8 quantization format by @xiaoyu-work in #1916
- Disable CUDA graph for Phi LongRoPE models with IF nodes on TRT-RTX by @anujj in #1921
- Add support for CUDA and CPU arch for Qwen-2.5-VL and Fara-7B by @apsonawane in #1919
- Add Gemma-3 vision tutorial to ONNX Runtime GenAI by @kunal-vaishnavi in #1793
- Quark GPT-OSS support by @thpereir in #1903
- Fix sliding window alignment regression in QNN models by @apsonawane in #1938
- AMD RyzenAI EP Support by @akholodnamdcom in #1935
- Update README by @natke in #1934
- [RyzenAI] Non-pruned models backward compatibility by @akholodnamdcom in #1942
- [VitisAI] EP loader by @akholodnamdcom in #1918
- Set default top_k and top_p if it is None by @xiaoyu-work in #1944
- Ensure dlls are signed in the c and nuget packages. by @baijumeswani in #1947
- Bump torch from 2.7.1 to 2.7.1+cpu in /test/python/directml/torch by @dependabot[bot] in #1868
- Add linker flags for 16 KB page size on Android by @sheetalarkadam in #1860
- Only manually load DLLs if onnxruntime.dll is not already loaded. by @chemwolf6922 in #1800
- Add a doc showing how to run GPT OSS 20B with WebGPU by @natke in #1945
- Add C#, Java, and Objective-C APIs for Config by @kunal-vaishnavi in #1946
- Fix GatherBlockQuantized node to support symmetric quantized LM_HEAD by @sushraja-msft in #1951
- Fix QMoE blockwise quantization support for TRT-RTX execution provider by @anujj in #1926
- Revert "Add a doc showing how to run GPT OSS 20B with WebGPU" by @kunal-vaishnavi in #1950
- Add custom model path support for unit tests by @mpasumarthi-git in #1917
- fix: patch
llguidanceto remove reference toringcrate by @sanaa-hamel-microsoft in #1948 - Implement graph models for EPs by @qjia7 in #1895
- Update handling EOS token id detection by @kunal-vaishnavi in #1925
- Remove onnxruntime-genai-cuda from the foundry package by @baijumeswani in #1954
- Include linux builds in the foundry ort-genai package by @baijumeswani in #1955
- Support pre-registered plug-in NvTensorRtRtx execution provider library by @anujj in #1889
- [RyzenAI] Linux compatibility fixes by @akholodnamdcom in #1959
- Use cuda 12.8 to build ort-genai by @baijumeswani in #1960
- Bump protobuf from 5.29.5 to 6.33.5 in /test/python by @dependabot[bot] in #1961
- Add RAII wrappers for ORT Model Editor API types by @qjia7 in #1953
- Rewrite all examples using standardization by @kunal-vaishnavi in #1939
- Add versioning to the onnxruntime-genai-cuda.dll by @baijumeswani in #1965
- [Build][Packaging] macOS packaging to skip building x86_64 by @baijumeswani in #1966
- Sync packaging changes with ONNX Runtime by @baijumeswani in #1967
- Release 0.12.0 cherry-pick PR by @baijumeswani in #1978
New Contributors
- @xkszltl made their first contribution in #1883
- @jaeyoonjung made their first contribution in #1888
- @jixiongdeng made their first contribution in #1885
- @Honry made their first contribution in #1886
- @thpereir made their first contribution in #1903
- @akholodnamdcom made their first contribution in #1935
- @sheetalarkadam made their first contribution in #1860
- @sanaa-hamel-microsoft made their first contribution in #1948
Full Changelog: v0.11.4...v0.12.0
v0.11.4
What's Changed
- WinML - Remove the inclusive Microsoft.WindowsAppSDK.ML range check by @chrisdMSFT in #1907
- Run the model in text mode by @apsonawane in #1908
Full Changelog: v0.11.3...v0.11.4
v0.11.3
What's Changed
- Model builder refactoring by @tianleiwu in #1862
- Add lintrunner to format code by @tianleiwu in #1884
- Remove empty submodule leftover. by @xkszltl in #1883
- Fix build for lack of RTLD_DI_ORIGIN support by @jaeyoonjung in #1888
- Enable graph capture for webgpu by @qjia7 in #1848
- Generic shared emb_tokens/lm_head implementation by @jixiongdeng in #1885
- Fix bug in Squeeze for getting the value of total_seq_len by @Honry in #1886
- Extra_options disable_qkv_fusion to untie qkv_projs from upstream choice by @jixiongdeng in #1893
- Fix mac pipeline by @apsonawane in #1904
- whisper: Support a variant of the whisper pipeline where encoder / decoder are stateful. by @RyanMetcalfeInt8 in #1857
- Add model builder for Qwen2_5_VLTextModel by @tianleiwu in #1882
- Integrate FARA-7B model by @apsonawane in #1902
- Set version as 0.11.3 by @kunal-vaishnavi in #1905
New Contributors
- @xkszltl made their first contribution in #1883
- @jaeyoonjung made their first contribution in #1888
- @jixiongdeng made their first contribution in #1885
- @Honry made their first contribution in #1886
Full Changelog: v0.11.2...v0.11.3
v0.11.2
What's Changed
- Revert removal of eps_without_if_support by @xiaofeihan1 in #1878
- Fix condition for NPU by @apsonawane in #1880
- Set version as 0.11.2 by @kunal-vaishnavi in #1881
Full Changelog: v0.11.1...v0.11.2
v0.11.1
What's Changed
- Cherry pick guidance fix into 0.11.1 release by @kunal-vaishnavi in #1872
- Set version as 0.11.1 by @kunal-vaishnavi in #1873
- Fix regex by @apsonawane in #1876
Full Changelog: v0.11.0...v0.11.1
v0.11.0
What's Changed
- ADO - Update WinML build pipeline by @chrisdMSFT in #1768
- Fix CMakeLists.txt auto-detection of library directory by @anujj in #1774
- Fix new/delete override and Enable cuda kernel test in Windows by @tianleiwu in #1772
- Use abbreviation for TensorRT RTX EP by @kunal-vaishnavi in #1763
- Add trust remote code option to model builder by @kunal-vaishnavi in #1766
- Support block-wise quant in qmoe op by @apsonawane in #1746
- Change the status for TRT-RTX EP by @gaugarg-nv in #1780
- Cherry-Pick changes from rel 0.10.0 back to main. by @chrisdMSFT in #1782
- Fix /CETCOMPAT Usage for Cross-Compiling by @sayanshaw24 in #1779
- Provide distributed version of improved TopK kernel by @hariharans29 in #1710
- [TRT-RTX] Disable KV cache re-computation for Phi models by @gaugarg-nv in #1787
- [CUDA] Add high-performance Top-K kernels and online benchmarking by @tianleiwu in #1748
- Change shared indices array type from float to int by @hariharans29 in #1789
- Enable bfloat16 multi-modal models by @kunal-vaishnavi in #1786
- Disable lmhead while prompt processing by @qti-ashimaj in #1762
- Introduce support for dynamic batching by @baijumeswani in #1662
- Generate pyd type info by @chemwolf6922 in #1742
- Add trt-rtx c packages in c example by @anujj in #1794
- [CUDA] Fix build with CUDA >= 12.9 by @tianleiwu in #1802
- [CUDA] topk kernels v2 by @tianleiwu in #1798
- Add prefill Chunking Support for NvTensorRtRtx and Cuda Providers by @anujj in #1765
- Add TRT-RTX EP support, keep NvTensorRtRtx as user facing name, and force QDQ by @anujj in #1791
- [CUDA] Add static assert to suppress windows build warnings by @tianleiwu in #1804
- Revert "Generate pyd type info" by @baijumeswani in #1805
- [QNN] Support continuous decoding by @baijumeswani in #1808
- ADO Pipeline - nuget_winml_package_reference_version is configured at build time. by @chrisdMSFT in #1811
- Update version to 0.11.0-dev by @baijumeswani in #1815
- Add Support For Tokenizer Options by @sayanshaw24 in #1785
- Fix exit call in README example by @justinchuby in #1823
- Add tokenizer APIs for accessing important ids by @kunal-vaishnavi in #1822
- Use correct classes for config-only usage in model builder by @kunal-vaishnavi in #1828
- Fix packaging pipeline by @baijumeswani in #1829
- Add missing tokenizer methods in java by @baijumeswani in #1833
- Add run options to ONNX Runtime GenAI by @kunal-vaishnavi in #1795
- Avoid Processing EOS Token During Continuous Decoding by @baijumeswani in #1814
- Fix nuget packaging pipeline for dev builds by @baijumeswani in #1837
- Add tool normalization for tool calling by @kunal-vaishnavi in #1838
- Refactor past_present_share_buffer logic into reusable function by @anujj in #1839
- Fix nuget packaging pipeline by @baijumeswani in #1841
- Add enable_webgpu_graph in extra_options by @qjia7 in #1788
- Update tool normalization in ORT GenAI by @kunal-vaishnavi in #1842
- Support RotaryEmbedding in GQA for webgpu ep by @xiaofeihan1 in #1847
- Enable guidance ff tokens for faster inference by @JC1DA in #1803
- Support pre-registered plug-in cuda execution provider library by @baijumeswani in #1850
- ADO: Update pipeline to publish onnxruntime-genai. for relwithdebinfo builds. by @chrisdMSFT in #1855
- Layer-wise KV Cache Allocation for Models with Alternating Attention Patterns by @anujj in #1832
- Mpasumarthi/nvtrt test suite by @mpasumarthi-git in #1756
- bugfix: fix a memory issue in Whisper by @fs-eire in #1859
- Add disable cuda graph when num_beams > 1 and fix set_provider_option bug by @anujj in #1846
- Mixed precision export support for gptq quantized model by @rM-planet in #1853
- Enable If Node Support for TRT-RTX in Phi-3.5/Phi-4 LongRoPE Models by @anujj in #1851
- Fix handling EOS token id detection by @kunal-vaishnavi in #1849
- Ensure Consistent Tool Calling JSON Serialization and Deserialization by @sayanshaw24 in #1863
- Add C# binding for GetNextTokens by @kunal-vaishnavi in #1865
- Set version as 0.11.0 by @kunal-vaishnavi in #1866
New Contributors
- @hariharans29 made their first contribution in #1710
- @qti-ashimaj made their first contribution in #1762
- @chemwolf6922 made their first contribution in #1742
- @qjia7 made their first contribution in #1788
- @xiaofeihan1 made their first contribution in #1847
- @JC1DA made their first contribution in #1803
- @mpasumarthi-git made their first contribution in #1756
- @rM-planet made their first contribution in #1853
Full Changelog: v0.10.0...v0.11.0
v0.10.0
What's Changed
- Enable continuous decoding for NvTensorRtRtx EP by @anujj in #1697
- Use updated Decoder API with
skip_special_tokensby @sayanshaw24 in #1722 - Update extensions to include memleak fix by @baijumeswani in #1724
- Support batch processing for whisper example by @jiafatom in #1723
- Update onnxruntime_extensions dependency version by @baijumeswani in #1725
- Include C++ header in native nuget and fix compiler warnings by @baijumeswani in #1727
- Update Microsoft.Extensions.AI to 9.8.0 by @rogerbarreto in #1689
- Update Extensions commit for Qwen 2.5 Chat Template Tools Fix by @sayanshaw24 in #1730
- Whisper Truncation Extensions Commit Update by @sayanshaw24 in #1735
- Enable Cuda Graph for TensorRtRtx by default by @anujj in #1734
- Update sampling benchmark by @tianleiwu in #1729
- Add Windows WinML x64 build workflow by @chrisdMSFT in #1740
- Fix CUDA synchronization issue between ORT-GenAI and TRT-RTX inference by @anujj in #1733
- Hello WindowsML by @chrisdMSFT in #1711
- [CUDA] sampling kernel improvements by @tianleiwu in #1732
- Update GitHub Actions to latest versions by @snnn in #1749
- Update WinML version to 1.8.2091 by @nieubank in #1750
- Address macos packaging pipeline issues by @baijumeswani in #1747
- ProviderOptions level device filtering and APIs to configure model level device filtering by @vortex-captain in #1744
- Fix string indexing bug with Phi-4 mm tokenization by @kunal-vaishnavi in #1751
- Fix TRT-RTX EP regression by @gaugarg-nv in #1754
- Fix typo in C API header by @kunal-vaishnavi in #1753
- Enable WinML by default in ADO pipelines by @chrisdMSFT in #1755
- Change default build configuration to 'relwithdebinfo' by @baijumeswani in #1757
- Pin cmake and vcpkg versions in macOS workflows by @snnn in #1760
- Add TRT_RTX support for onnxruntime-genai-trt-rtx wheel by @anujj in #1736
- rel-0.10.0 by @chrisdMSFT in #1767
- Microsoft.ML.OnnxRuntimeGenAI.WinML.props by @chrisdMSFT in #1776
- Warning fix - ort_genai.h by @chrisdMSFT in #1778
- Microsoft.ML.OnnxRuntimeGenAI.targets by @chrisdMSFT in #1781
New Contributors
Full Changelog: v0.9.2...v0.10.0
v0.9.2
This release fixes a pre-processing bug with Phi-4 multimodal.
Full Changelog: v0.9.1...v0.9.2