Release MSCCL++ v0.9.0 · microsoft/mscclpp

What's Changed

Fix lint.sh by @chhwang in #652
Update the port channel tutorial doc by @chhwang in #653
Fix test script by @Binyang2014 in #655
Update EndpointConfig interfaces by @chhwang in #651
New allreduce algo for small message size by @Binyang2014 in #647
Fix docs by @chhwang in #656
Fix docs version by @chhwang in #659
Exclude irrelevant files from workflow triggers by @chhwang in #663
Improving DSL documentation by @caiomcbr in #650
Add exclude paths under pipeline triggers by @chhwang in #664
Test peer accessibility after deployment by @chhwang in #661
Rename nvls* files by @chhwang in #660
Fix #651 by @chhwang in #662
Add token pool for cuCreate API by @Binyang2014 in #628
Auto-detect CUDA arch in CMake GPU check by @chhwang in #666
FP8 support for Allreduce by @seagater in #646
Resolve IBVerbs Loading Issues by @caiomcbr in #648
Fixes for no-IB systems by @chhwang in #667
Integrate MSCCL++ DSL to torch workload by @Binyang2014 in #620
Add a new logger by @chhwang in #668
upgrade codeql to v3 by @Binyang2014 in #676
IB stack enhancements & bug fixes by @chhwang in #673
Support Synchronous Initialization for Proxy Service by @caiomcbr in #679
Supporting New Packet Kernel Operation at Executor by @caiomcbr in #677
connect() APIs changed to return an instance instead of a shared_ptr by @chhwang in #680
Fix Minor Issue Proxy Python Interface by @caiomcbr in #685
Revise the mscclpp datatype by @seagater in #671
Fix Error in Non IB Env at Executor by @caiomcbr in #686
No IB Env CI Test by @caiomcbr in #687
Fix Python bindings and tests by @chhwang in #690
DSL Quick Start by @caiomcbr in #689
Add CudaDeviceGuard by @chhwang in #691
Optimized logger by @chhwang in #693
Build fixes by @chhwang in #696
Add an IB multi-node tutorial by @chhwang in #702
Creating Documentation Section for MSCCL++ DSL by @caiomcbr in #706
Make IB more configurable by @chhwang in #703
Improve DSL Documentation by @caiomcbr in #707
Add handle cache for AMD platform by @Binyang2014 in #698
Add copilot-instructions.md by @chhwang in #602
Use uncached memory on Rocm platform to avoid hang by @qishilu in #711
Replace __HIP_PLATFORM_AMD__ to use internal macro by @Binyang2014 in #712
Rename P2P log subsys into GPU by @chhwang in #716
Minor fixes by @chhwang in #715
Remove UB std:: declarations by @chhwang in #709
Tune the nThreadsPerBlock for FP8 and Half datatype on MI300 by @seagater in #694
Update container images for pipeline by @Binyang2014 in #717
Add CUDA 13.0 Docker images by @chhwang in #720
Bypassing SSCA alerts by @chhwang in #721
Add GpuIpcMemHandle by @chhwang in #704
Reduce CI build time by @chhwang in #723
Use GpuIpcMem for NVLS connections by @chhwang in #719
Fix ci issue by @Binyang2014 in #727
Fix ci pipeline failure by @Binyang2014 in #729
Torch integration by @Binyang2014 in #692
fp8 nvls support (e5m2 and e4m3) by @mahdiehghazim in #730
Support versioning for mscclpp document by @seagater in #724
Revert "Support versioning for mscclpp document (#724)" by @seagater in #734
Use native GPU architecture when NVIDIA GPU is detected; otherwise fall back to multi-arch build. by @mahdiehghazim in #732
Update document versioning for PR #724 by @seagater in #735
Fix the relative path extraction on github page by @seagater in #739
Support multi-node in MemoryChannel tutorial by @chhwang in #726
Address comments for PR #692 by @Binyang2014 in #733
Refactor reduce kernel by @Binyang2014 in #738
Fix cpplint error in main branch by @seagater in #740
Update copilot-instructions.md by @chhwang in #722
create CI pipeline for rocm by @Binyang2014 in #718
Add a new IB stack impl that doesn't use RDMA atomics by @chhwang in #728
Support Fusion for ReadPutPacket Operation at DSL by @caiomcbr in #742
Refactor algo selection logic and introduce symmetric_memory env by @Binyang2014 in #741
Support uint8 data type for Allreduce by @seagater in #736
Add new CI pipeline for RCCL test by @Binyang2014 in #746
Update dtype name by @Binyang2014 in #748
address flagBuffer ownership issue by @Binyang2014 in #749
Removing MPI Dependency by @caiomcbr in #743
Address installation issue in some env by @Binyang2014 in #750
Mahdieh/switchchannel test clean by @mahdiehghazim in #751
Disabling Nanobind Memory Leak Warnings in Release Builds by @caiomcbr in #745
Adjusting Communicator in Python API by @caiomcbr in #752
Add CI pipeline for no-IB environment testing by @Binyang2014 in #755
Add new algos for GB200 by @Binyang2014 in #747
Add doc for perf tunning by @Binyang2014 in #756
Adding Support to Setting Message Size Range in Native Algorithm API by @caiomcbr in #758
Do threadInit/cudaSetDevice before other cuda calls by @wuxb45 in #757
Fix NCCL fallback comm destroy and use latest NCCL release in CI by @Binyang2014 in #760
Fix multicast handle leak, cuMemMap offset handling, and rename NVLS allreduce algorithms by @Binyang2014 in #759
Fix use-after-free for fabric allocation handle in GpuIpcMemHandle by @Binyang2014 in #764
Remove GTest dependency, add code coverage, and refactor unit tests and CI pipelines by @Copilot in #744
Install default plans under MSCCLPP_CACHE_DIR/default by @ekwhoa in #769
Use PTX red for D2D semaphore signal by @Binyang2014 in #768
Add unit testing framework readme by @chhwang in #766
Fix run-remote.sh to support multi-command scripts by @Binyang2014 in #770
Fix CI/CD pipeline issues by @Binyang2014 in #773
Support E4M3B15 datatype by @Binyang2014 in #765

New Contributors

@qishilu made their first contribution in #711
@mahdiehghazim made their first contribution in #730
@wuxb45 made their first contribution in #757
@ekwhoa made their first contribution in #769

Full Changelog: v0.8.0...v0.9.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MSCCL++ v0.9.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

New Contributors

Contributors

Uh oh!