Skip to content

Create GlooSharp NuGet Package for Native Gloo C++ Integration #461

@ooples

Description

@ooples

Create GlooSharp NuGet Package for Native Gloo C++ Integration

User Story

As a distributed training developer using AiDotNet on high-performance compute clusters, I want native Gloo library integration through a GlooSharp NuGet package, so that I can leverage optimized collective operations for CPU and InfiniBand hardware without falling back to TCP-only implementations.


Problem Statement

Current State:

The GlooCommunicationBackend<T> class (src/DistributedTraining/GlooCommunicationBackend.cs) contains detection logic for a "GlooSharp" package that does not exist on NuGet.org:

// Line 104-124: Dead code - GlooSharp package does not exist
var glooType = Type.GetType("Gloo.Context, GlooSharp");
if (glooType != null)
{
    // This code path is never reached
    throw new NotImplementedException(
        "GlooCommunicationBackend with Gloo library support is not yet fully implemented...");
}

Problems with Current Approach:

  1. Non-Existent Dependency: Code references "GlooSharp" package that doesn't exist on NuGet.org
  2. Always Falls Back to TCP: Detection always fails, forcing TCP mode even if user wants native Gloo
  3. Misleading Documentation: Code comments suggest Gloo integration exists when it doesn't
  4. No InfiniBand Support: TCP fallback doesn't support high-performance RDMA networks
  5. Performance Gap: TCP implementation is production-ready but significantly slower than native Gloo on supported hardware

Impact:

  • Users on InfiniBand clusters cannot use native RDMA for collective operations
  • High-performance computing (HPC) environments limited to TCP performance
  • No way to leverage Gloo's hardware-specific optimizations (even if user installs native Gloo)

Proposed Solution

Create GlooSharp - a .NET wrapper NuGet package providing P/Invoke bindings to the native Gloo C++ library.

Design Philosophy

  1. Optional Dependency: GlooSharp is an optional package users install when they need native Gloo performance
  2. Platform-Specific Binaries: Include native Gloo libraries for Windows, Linux, and macOS
  3. Graceful Fallback: If GlooSharp isn't installed, GlooCommunicationBackend continues using TCP mode
  4. Zero Breaking Changes: Existing code continues working without GlooSharp
  5. Production Ready: Only ship when P/Invoke bindings are stable and tested

Definition of Done

  • Gloo C++ library built for Windows, Linux, macOS
  • Native binaries packaged in runtimes structure
  • GlooSharp project created with P/Invoke bindings
  • Core collective operations implemented (AllReduce, Broadcast, AllGather, Barrier)
  • GlooCommunicationBackend updated to detect and use GlooSharp
  • TCP fallback still works when GlooSharp not installed
  • All unit tests pass
  • GlooSharp.nuspec created and package published to NuGet.org as preview
  • Documentation complete with examples
  • No breaking changes to existing AiDotNet API

Open Questions

  1. Gloo Version: Which version of Gloo should we target? (Recommend: latest stable)
  2. InfiniBand Support: Should v0.1.0 include ibverbs, or defer to v0.2.0?
  3. CUDA Support: Is GPU-direct Gloo support in scope? (Recommend: future enhancement)
  4. Licensing: Gloo is MIT licensed - confirm compatibility with AiDotNet license
  5. Maintenance: Who maintains native binary builds when Gloo updates? (CI automation?)

Related Issues

  • Code cleanup: Remove non-existent GlooSharp references from GlooCommunicationBackend.cs
  • Future: Add NCCL-style GPU collectives via separate package

Estimated Effort: 65 story points (significant native library integration work)

Priority: Medium - TCP implementation is production-ready, Gloo is performance optimization

Notes:

  • This is a significant undertaking requiring C++ build expertise
  • Alternative: Partner with or sponsor existing Gloo .NET binding projects if they exist
  • Consider creating as separate GitHub repository (GlooSharp) to avoid bloating main AiDotNet repo

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions