Table of contents

    Runtime requests

    SIMD

    Alignment

    Handle SIMD objects in structures better, so they get properly aligned by default, otherwise we suffer some performance problems, such as use of unaligned loads, and confusing 10x perf differences caused by random alignment.

    ABI

    Would it be possible to pass SIMD arguments in the SIMD registers on Intel?

    => This is possible, but largely useless until we rewrite the register allocator.

    Ports

    Would like to have LLVM/ARM support SIMD (NEON), as this would help us in MonoTouch. Perhaps we need the same on MonoJIT/ARM for other platforms like Android.

    It would be good to support PowerPC SIMD (AltiVec) for platforms like PS3.

    Newer x86 processors (Intel Sandy Bridge & AMD Bulldozer) introduce another SIMD extension called AVX, and it would be good to support that too.

    Per-arch Method Implementations

    Since many SIMD instructions on exist in one specific instruction set or instruction set extension, it would be useful to have a way to have different versions of a method for different architectures, for example one for SSE1, another for SSE2, and another for NEON. Perhaps this could be done with an attribute and an encoded method name suffix, for example

     [MonoMethodImpl(MethodImplOptions.ArchSpecific)] void Foo ()
    

    then if the processor supports SSE2 and the method

     void Foo_Sse2 ()
    

    exists, it would be used instead.

    Struct as SIMD Wrappers

    When producing nice APIs, it’s often useful to wrap the SIMD intrinsic types structs with a cleaner and more specific API for a specific use, for example a Quaternion that wraps a Vector4f field. Unfortunately, the JIT currently generates horrible code for such cases, as it does not deal well with the indirection, especially when combined with the SSE intrinsics.

    For examples of the code generated, see https://bugzilla.novell.com/show_bug.cgi?id=662127

    Ref overloads in Mono.Simd

    Provide ref overloads for all the methods in Mono.Simd, since when it falls back to non-intrinsic implementations, passing large structs by ref is usually much faster than passing them by value.

    SSE Floating Point on x86

    We should use SSE for floating-point math on x86, like we do on x86-64, instead of using the x87 FPU as we do now.

    Optimization Hinting

    ABC disabling

    Add an attribute (maybe MonoMethodImpl) to disable array bounds checking in specific methods. This would allow it to be disabled in audited library code while still keeping it in user code. Obviously this would only be permitted for unsafe methods.

    Branch hinting

    Add JIT intrinsics for branch hinting, for tuning code such as that which uses Mono.Simd.

    Data Prefetch

    JIT intrinsics for data prefetch instructions. Useful combined with Mono.Simd.

    Byref attributes

    Add an attribute to be applied to struct method parameters to indicate that the JIT should pass them by reference, while retaining a by-value API in the CIL. This would allow byref args for operator overloads, and would remove the need to create byval and byref overloads for perf.

    Optimization Level

    Allow using MonoMethodImpl attribute to hint that a method is important and should be optimized more heavily, maybe even using LLVM.

    Inliner

    Force Inline Attribute

    Perhaps we can steal one of the attributes in the MethodImpl to force inlining for certain methods. If not, perhaps we could add a new MonoMethodImpl attribute for our own JIT optimization control.

    Not sure if Mono can inline any method, or if there are limitations on what we can inline, even when forced to inline. Apparently .NET 3.5 can’t inline methods with struct parameters, which hurts perf of math vector APIs really badly - maybe this is somewhere we could do much better?

    Intrinsics

    Would it be possible to inline certain common code patterns like List<T>.this [int idx]?

    Is the inliner still limited in cases where there is a compare and branch code? Could this limitation be removed?

    This would really improve inlining of common patterns such as properties with argument checking, since in many cases the JIT (or LLVM) could do dead code elimination of the argument checks. For example, in the case

    a = new Foo ();
    b.Property = a;
    

    where b.Property is:

    Foo Property { 
       set { 
          if (value == null) throw new ArgumentNullException ();
          x = value;
       }
    }
    

    then LLVM can do dead-code elimination on the “if value==null”

    P/Invoke Inlining

    Would it be possible to have an attribute to P/Invoke that would flag “this is a simple method that should be treated as an internalcall, do not setup any expensive wrappers”, like for methods that just call into C and are known to not throw exceptions and have a finite execution time (so we do not need to handle Thread.Interrupt there).

    This would help us improve the speed of calling P/Invoke methods.

    What is determined “safe” I am not sure, would love to figure out what we can do about this.

    The concern is not as much the size of the generated wrappers, but the need to execute those wrappers.

    Why the simple solution is not possible

    icall wrappers are needed to be able to do stack walks too. Plus for handling async exceptions. i.e. if a thread gets a signal while it executes an icall, the icall wrapper will throw the ThreadAbortException or such when the icall returns.

    To be able to do stack walks, we need to save some state before calling native code, to be able to handle async exceptions, we need to do a check after each native call.

    Some of this could be inlined at the call site, but calling a native method will never be equivalent to ‘call sin’, it will always have some overhead.

    What can be done

    We need either the wrappers or the functionality they contain.

    We could inline some parts of it, its tricky but doable, that would save the call+parameter passing overhead.

    Its not worth it for common icalls like allocators, since they blow up the size of call sites, but might be worth for icalls which are called from 1 place.

    ICALL performance

    Here is a benchmark to test icall performance:

    using System;
    using System.Diagnostics;
     
    public class Tests
    {
        public static void Main (String[] args) {
            for (int i = 0; i < 10; ++i) {
                int k = (i + 1) / (i + 1);
            }
            var s = Stopwatch.StartNew ();
            int niter = 10000000;
            for (int i = 0; i < niter; ++i) {
                int k = (i + 1) / (i + 1);
            }
            TimeSpan t = s.Elapsed;
            Console.WriteLine (t.Milliseconds);
            Console.WriteLine ("" + (niter / t.Milliseconds) + " iterations per ms");
        }
    }
    

    On an NVIDIA TEGRA, this runs in:

    • normal case: 824ms.
    • normal case, calling mono_get_lmf_addr instead of an inline TLS get sequence: 873ms.
    • normal case, save_lmf=FALSE: 490ms.
    • normal case, create the LMF structure, but don’t push/pop it: 590ms.
    • without any wrappers at all: 372ms.

    Here is the assembly for the wrapper:

       -> save registers and allocate stack frame
       0:   e1a0c00d        mov     ip, sp
       4:   e92d5ff0        push    {r4, r5, r6, r7, r8, r9, sl, fp, ip, lr}
       8:   e24dd030        sub     sp, sp, #48     ; 0x30
       -> save arguments to stack/registers
       c:   e58d0000        str     r0, [sp]
      10:   e1a07001        mov     r7, r1
       -> load lmf_addr TLS variable
      14:   ebfff322        bl      0xffffcca4
      18:   e590000c        ldr     r0, [r0, #12]
      -> create LMF structure on the stack
      1c:   e28d100c        add     r1, sp, #12
      20:   e5810004        str     r0, [r1, #4]
      24:   e5902000        ldr     r2, [r0]
      28:   e5812000        str     r2, [r1]
      2c:   e5801000        str     r1, [r0]
      30:   e581d00c        str     sp, [r1, #12]
      34:   e1a0200f        mov     r2, pc
      38:   e5812010        str     r2, [r1, #16]
      -> load arguments and make the call
      3c:   e59d0000        ldr     r0, [sp]
      40:   e1a01007        mov     r1, r7
      44:   ebfff31c        bl      0xffffccbc
      48:   e1a01000        mov     r1, r0
      -> load interruption flag
      4c:   e30000a8        movw    r0, #168        ; 0xa8
      50:   e340003c        movt    r0, #60 ; 0x3c
      54:   e5900000        ldr     r0, [r0]
      58:   e1a07001        mov     r7, r1
      -> check it, and branch to interruption code if needed
      5c:   e3500000        cmp     r0, #0
      60:   1a000006        bne     0x80
      -> load return value
      64:   e1a00007        mov     r0, r7
      -> restore LMF
      68:   e28d200c        add     r2, sp, #12
      6c:   e592c000        ldr     ip, [r2]
      70:   e592e004        ldr     lr, [r2, #4]
      74:   e58ec000        str     ip, [lr]
      -> pop stack frame and return
      78:   e282d030        add     sp, r2, #48     ; 0x30
      7c:   e8bd9f80        pop     {r7, r8, r9, sl, fp, ip, pc}
      -> interruption code
      80:   ebfffeac        bl      0xfffffb38
      84:   eafffff6        b       0x64
    

    Support for NSString

    Wondering if we could add built-in knowledge to turn a Mono String into an Objective-C NSString in the same way that we do instrinsics for SIMD. The idea would be to just create a simple shell structure for NSString that is initialized to point to the UTF16 data in Mono’s string.

    GCC compiled a regular NSString constant as something like:

    void *classptr;
    intxx flags;
    void *ucs2data
    

    The idea would be to have a mechanism that allocates this in the stack, and allows us to pass a pointer to this blob to the Objective-C code. In C# this would look like this:

    string a = "hello";
    Some_PInvoke_to_objective_C (Mono.Rutnime.AsNsString (a));
    

    The above would do something like a stackalloc, fill in the class pointer, the flags and the dta pointer to point to Mono’s string.

    I don’t see why this would need stackalloc or special JIT support. You should be able to just define a C# struct and fill the ucs2data field with a fixed expression. All this would happen in an autogenerated wrapper of the P/Invoke call. There would be no speed or memory advantage at doing this specially in the JIT.