What's Changed
- Fix lint.sh by @chhwang in #652
- Update the port channel tutorial doc by @chhwang in #653
- Fix test script by @Binyang2014 in #655
- Update
EndpointConfiginterfaces by @chhwang in #651 - New allreduce algo for small message size by @Binyang2014 in #647
- Fix docs by @chhwang in #656
- Fix docs version by @chhwang in #659
- Exclude irrelevant files from workflow triggers by @chhwang in #663
- Improving DSL documentation by @caiomcbr in #650
- Add exclude paths under pipeline triggers by @chhwang in #664
- Test peer accessibility after deployment by @chhwang in #661
- Rename nvls* files by @chhwang in #660
- Fix #651 by @chhwang in #662
- Add token pool for cuCreate API by @Binyang2014 in #628
- Auto-detect CUDA arch in CMake GPU check by @chhwang in #666
- FP8 support for Allreduce by @seagater in #646
- Resolve IBVerbs Loading Issues by @caiomcbr in #648
- Fixes for no-IB systems by @chhwang in #667
- Integrate MSCCL++ DSL to torch workload by @Binyang2014 in #620
- Add a new logger by @chhwang in #668
- upgrade codeql to v3 by @Binyang2014 in #676
- IB stack enhancements & bug fixes by @chhwang in #673
- Support Synchronous Initialization for Proxy Service by @caiomcbr in #679
- Supporting New Packet Kernel Operation at Executor by @caiomcbr in #677
connect()APIs changed to return an instance instead of a shared_ptr by @chhwang in #680- Fix Minor Issue Proxy Python Interface by @caiomcbr in #685
- Revise the mscclpp datatype by @seagater in #671
- Fix Error in Non IB Env at Executor by @caiomcbr in #686
- No IB Env CI Test by @caiomcbr in #687
- Fix Python bindings and tests by @chhwang in #690
- DSL Quick Start by @caiomcbr in #689
- Add
CudaDeviceGuardby @chhwang in #691 - Optimized logger by @chhwang in #693
- Build fixes by @chhwang in #696
- Add an IB multi-node tutorial by @chhwang in #702
- Creating Documentation Section for MSCCL++ DSL by @caiomcbr in #706
- Make IB more configurable by @chhwang in #703
- Improve DSL Documentation by @caiomcbr in #707
- Add handle cache for AMD platform by @Binyang2014 in #698
- Add copilot-instructions.md by @chhwang in #602
- Use uncached memory on Rocm platform to avoid hang by @qishilu in #711
- Replace
__HIP_PLATFORM_AMD__to use internal macro by @Binyang2014 in #712 - Rename
P2Plog subsys intoGPUby @chhwang in #716 - Minor fixes by @chhwang in #715
- Remove UB
std::declarations by @chhwang in #709 - Tune the nThreadsPerBlock for FP8 and Half datatype on MI300 by @seagater in #694
- Update container images for pipeline by @Binyang2014 in #717
- Add CUDA 13.0 Docker images by @chhwang in #720
- Bypassing SSCA alerts by @chhwang in #721
- Add
GpuIpcMemHandleby @chhwang in #704 - Reduce CI build time by @chhwang in #723
- Use
GpuIpcMemfor NVLS connections by @chhwang in #719 - Fix ci issue by @Binyang2014 in #727
- Fix ci pipeline failure by @Binyang2014 in #729
- Torch integration by @Binyang2014 in #692
- fp8 nvls support (e5m2 and e4m3) by @mahdiehghazim in #730
- Support versioning for mscclpp document by @seagater in #724
- Revert "Support versioning for mscclpp document (#724)" by @seagater in #734
- Use native GPU architecture when NVIDIA GPU is detected; otherwise fall back to multi-arch build. by @mahdiehghazim in #732
- Update document versioning for PR #724 by @seagater in #735
- Fix the relative path extraction on github page by @seagater in #739
- Support multi-node in
MemoryChanneltutorial by @chhwang in #726 - Address comments for PR #692 by @Binyang2014 in #733
- Refactor reduce kernel by @Binyang2014 in #738
- Fix cpplint error in main branch by @seagater in #740
- Update
copilot-instructions.mdby @chhwang in #722 - create CI pipeline for rocm by @Binyang2014 in #718
- Add a new IB stack impl that doesn't use RDMA atomics by @chhwang in #728
- Support Fusion for ReadPutPacket Operation at DSL by @caiomcbr in #742
- Refactor algo selection logic and introduce symmetric_memory env by @Binyang2014 in #741
- Support uint8 data type for Allreduce by @seagater in #736
- Add new CI pipeline for RCCL test by @Binyang2014 in #746
- Update dtype name by @Binyang2014 in #748
- address flagBuffer ownership issue by @Binyang2014 in #749
- Removing MPI Dependency by @caiomcbr in #743
- Address installation issue in some env by @Binyang2014 in #750
- Mahdieh/switchchannel test clean by @mahdiehghazim in #751
- Disabling Nanobind Memory Leak Warnings in Release Builds by @caiomcbr in #745
- Adjusting Communicator in Python API by @caiomcbr in #752
- Add CI pipeline for no-IB environment testing by @Binyang2014 in #755
- Add new algos for GB200 by @Binyang2014 in #747
- Add doc for perf tunning by @Binyang2014 in #756
- Adding Support to Setting Message Size Range in Native Algorithm API by @caiomcbr in #758
- Do threadInit/cudaSetDevice before other cuda calls by @wuxb45 in #757
- Fix NCCL fallback comm destroy and use latest NCCL release in CI by @Binyang2014 in #760
- Fix multicast handle leak, cuMemMap offset handling, and rename NVLS allreduce algorithms by @Binyang2014 in #759
- Fix use-after-free for fabric allocation handle in GpuIpcMemHandle by @Binyang2014 in #764
- Remove GTest dependency, add code coverage, and refactor unit tests and CI pipelines by @Copilot in #744
- Install default plans under MSCCLPP_CACHE_DIR/default by @ekwhoa in #769
- Use PTX red for D2D semaphore signal by @Binyang2014 in #768
- Add unit testing framework readme by @chhwang in #766
- Fix run-remote.sh to support multi-command scripts by @Binyang2014 in #770
- Fix CI/CD pipeline issues by @Binyang2014 in #773
- Support E4M3B15 datatype by @Binyang2014 in #765
New Contributors
- @qishilu made their first contribution in #711
- @mahdiehghazim made their first contribution in #730
- @wuxb45 made their first contribution in #757
- @ekwhoa made their first contribution in #769
Full Changelog: v0.8.0...v0.9.0