Want to show your appreciation?
Please a cup of tea.

Sunday, May 24, 2009

Interlocked vs. Monitor Performance

In .Net, when I need thread safe access to a field, I can either use Monitor or use Interlocked class. I have been in believe that the Interlocked should work faster, otherwise I don't see the value of using it except for inter-process atomic. Today when I was porting the Java's AtomicInteger to .Net, I was wondering how should I implement this. I again faced the choice between Interlocked and Monitor. So I decided to do a test.

The test goes against three different implementations of below interface:

    internal interface IAtomic
    {
        int Value { get; set; }
        int CompareExchange(int newValue, int expected);
        int Exchange(int newValue);
    }

One implementation uses method in the Interlocked static class to operate on an volatile field to provide thread safe access to the integer.

    internal class InterlockAtomic : IAtomic
    {
        private volatile int _value = 50;

        public int Value
        {
            get { return _value; }
            set { _value = value; }
        }

        public virtual int CompareExchange(int newValue, int expected)
        {
            return Interlocked.CompareExchange(ref _value, newValue, expected);
        }

        public virtual int Exchange(int newValue)
        {
            return Interlocked.Exchange(ref _value, newValue);
        }
    }

The 2nd implementation uses Monitor with C#'s lock keyword for all the access.

    internal class MonitorAtomic : IAtomic
    {

        private int _value = 50;

        public int Value
        {
            get { lock (this) return _value; }
            set { lock (this) _value = value; }
        }

        public virtual int CompareExchange(int newValue, int expected)
        {
            lock (this)
            {
                int orig = _value;
                if (expected == orig) _value = newValue;
                return orig;
            }
        }

        public virtual int Exchange(int newValue)
        {
            lock (this)
            {
                int orig = _value;
                _value = newValue;
                return orig;
            }
        }
    }

The 3rd implementation uses  Monitor for all access except the read access thread safety is provided by the volatile.

    internal class MonitorVolatileAtomic : IAtomic
    {

        private volatile int _value = 50;

        public int Value
        {
            get { return _value; }
            set { lock (this) _value = value; }
        }

        public virtual int CompareExchange(int newValue, int expected)
        {
            lock (this)
            {
                int orig = _value;
                if (expected == orig) _value = newValue;
                return orig;
            }
        }

        public virtual int Exchange(int newValue)
        {
            lock (this)
            {
                int orig = _value;
                _value = newValue;
                return orig;
            }
        }
    }

Then run the below methods for each implementation with loop set to one million.

        private void RunCompareExchange()
        {
            int result1, result2, result3;
            for (int i = loop - 1; i >= 0; i--)
            {
                _atomic.CompareExchange(50, 100);
                result1 = _atomic.Value;
                _atomic.CompareExchange(100, 50);
                result2 = _atomic.Value;
                _atomic.CompareExchange(50, 100);
                result3 = _atomic.Value;
            }
        }

        private void RunExchange()
        {
            int result1, result2, result3;
            for (int i = loop - 1; i >= 0; i--)
            {
                _atomic.Exchange(30);
                result1 = _atomic.Value;
                _atomic.Exchange(50);
                result2 = _atomic.Value;
                _atomic.Exchange(100);
                result3 = _atomic.Value;
            }
        }

Finally, we run each method above in multiple threads in parallel. Amount of thread can be passed as command line parameter. Below is the main method, for the detail of how things get done, the full source code is available here.

        static void Main(string[] args)
        {
            int threadCount = 1;
            if (args.Length > 0)
            {
                try
                {
                    threadCount = int.Parse(args[0]);
                }
                catch (Exception e)
                {
                    Console.Error.WriteLine(e.Message);
                }
            }
            Console.WriteLine("Using {0} threads:", threadCount);

            var a = new Program { _atomic = new InterlockAtomic() };
            var b = new Program { _atomic = new MonitorAtomic() };
            var c = new Program { _atomic = new MonitorVolatileAtomic() };
            RunAll(a.RunCompareExchange, threadCount);
            RunAll(b.RunCompareExchange, threadCount);
            RunAll(c.RunCompareExchange, threadCount);
            RunAll(a.RunExchange, threadCount);
            RunAll(b.RunExchange, threadCount);
            RunAll(c.RunExchange, threadCount);
        }

I made the Release build and tested on three different machines

  • Thinkpad laptop running Windows XP Professional SP2 with Core 2 Due T7300 2GHz CPU and 4GB dual channel DDR2. I have the systeminfo for 2 CPU box.
  • HP server (quite old) running Windows Server 2003 SE SP1 with 4 Xeon (Not sure but mostly 2x2) 2.8GHz CPUs and 3.5GB (don't know what memory). Here is the systeminfo for 4 CPU box.
  • HP server (very new) running Windows Server 2003 SE x64 SP2 with 16 Xeon (4x4) 3.4GHz CPUs and 16GB of dual channel DDR2 ECC. Here is the systeminfo for the 16 CPU box.

The test result is very interesting and I cannot explain the reason for this.

Here is the result of 2 CPU test.  In this test the Interlocked methods is clearly faster in all situations.

D:\>InterlockVsMonitor.exe 1
Using 1 threads:
          InterlockAtomic.RunCompareExchange   (ns):     81 Average,     81 Minimal,     81 Maxmial
            MonitorAtomic.RunCompareExchange   (ns):    428 Average,    428 Minimal,    428 Maxmial
    MonitorVolatileAtomic.RunCompareExchange   (ns):    254 Average,    254 Minimal,    254 Maxmial
          InterlockAtomic.RunExchange          (ns):     81 Average,     81 Minimal,     81 Maxmial
            MonitorAtomic.RunExchange          (ns):    544 Average,    544 Minimal,    544 Maxmial
    MonitorVolatileAtomic.RunExchange          (ns):    277 Average,    277 Minimal,    277 Maxmial

D:\>InterlockVsMonitor.exe 2
Using 2 threads:
          InterlockAtomic.RunCompareExchange   (ns):    180 Average,    176 Minimal,    184 Maxmial
            MonitorAtomic.RunCompareExchange   (ns):    917 Average,    795 Minimal,   1039 Maxmial
    MonitorVolatileAtomic.RunCompareExchange   (ns):    507 Average,    472 Minimal,    543 Maxmial
          InterlockAtomic.RunExchange          (ns):    227 Average,    222 Minimal,    232 Maxmial
            MonitorAtomic.RunExchange          (ns):   1007 Average,    973 Minimal,   1041 Maxmial
    MonitorVolatileAtomic.RunExchange          (ns):    485 Average,    446 Minimal,    524 Maxmial

D:\>InterlockVsMonitor.exe 4
Using 4 threads:
          InterlockAtomic.RunCompareExchange   (ns):    345 Average,    305 Minimal,    370 Maxmial
            MonitorAtomic.RunCompareExchange   (ns):   1901 Average,   1711 Minimal,   2064 Maxmial
    MonitorVolatileAtomic.RunCompareExchange   (ns):   1048 Average,    925 Minimal,   1101 Maxmial
          InterlockAtomic.RunExchange          (ns):    395 Average,    322 Minimal,    456 Maxmial
            MonitorAtomic.RunExchange          (ns):   1797 Average,   1488 Minimal,   2030 Maxmial
    MonitorVolatileAtomic.RunExchange          (ns):    816 Average,    561 Minimal,   1151 Maxmial

D:\>InterlockVsMonitor.exe 16
Using 16 threads:
          InterlockAtomic.RunCompareExchange   (ns):    998 Average,    736 Minimal,   1424 Maxmial
            MonitorAtomic.RunCompareExchange   (ns):   8315 Average,   3623 Minimal,   9941 Maxmial
    MonitorVolatileAtomic.RunCompareExchange   (ns):   4480 Average,   3345 Minimal,   5104 Maxmial
          InterlockAtomic.RunExchange          (ns):   2051 Average,   1522 Minimal,   2448 Maxmial
            MonitorAtomic.RunExchange          (ns):   9353 Average,   5795 Minimal,  11104 Maxmial
    MonitorVolatileAtomic.RunExchange          (ns):   3419 Average,   1509 Minimal,   4582 Maxmial

In the 4 CPU box, Interlocked start to lose the race when number of parallel thread increases.

D:\>InterlockVsMonitor.exe 1
Using 1 threads:
          InterlockAtomic.RunCompareExchange   (ns):    181 Average,    181 Minimal,    181 Maxmial
            MonitorAtomic.RunCompareExchange   (ns):    853 Average,    853 Minimal,    853 Maxmial
    MonitorVolatileAtomic.RunCompareExchange   (ns):    461 Average,    461 Minimal,    461 Maxmial
          InterlockAtomic.RunExchange          (ns):    179 Average,    179 Minimal,    179 Maxmial
            MonitorAtomic.RunExchange          (ns):    796 Average,    796 Minimal,    796 Maxmial
    MonitorVolatileAtomic.RunExchange          (ns):    441 Average,    441 Minimal,    441 Maxmial

D:\>InterlockVsMonitor.exe 2
Using 2 threads:
          InterlockAtomic.RunCompareExchange   (ns):    864 Average,    863 Minimal,    865 Maxmial
            MonitorAtomic.RunCompareExchange   (ns):   1984 Average,   1965 Minimal,   2003 Maxmial
    MonitorVolatileAtomic.RunCompareExchange   (ns):    948 Average,    870 Minimal,   1026 Maxmial
          InterlockAtomic.RunExchange          (ns):   1155 Average,   1155 Minimal,   1156 Maxmial
            MonitorAtomic.RunExchange          (ns):   1852 Average,   1768 Minimal,   1936 Maxmial
    MonitorVolatileAtomic.RunExchange          (ns):    925 Average,    852 Minimal,    999 Maxmial

D:\>InterlockVsMonitor.exe 4
Using 4 threads:
          InterlockAtomic.RunCompareExchange   (ns):   2558 Average,   2539 Minimal,   2575 Maxmial
            MonitorAtomic.RunCompareExchange   (ns):   4553 Average,   4241 Minimal,   5198 Maxmial
    MonitorVolatileAtomic.RunCompareExchange   (ns):   2502 Average,   2438 Minimal,   2543 Maxmial
          InterlockAtomic.RunExchange          (ns):   4809 Average,   4748 Minimal,   4870 Maxmial
            MonitorAtomic.RunExchange          (ns):   4659 Average,   4504 Minimal,   4780 Maxmial
    MonitorVolatileAtomic.RunExchange          (ns):   2455 Average,   2378 Minimal,   2509 Maxmial

D:\>InterlockVsMonitor.exe 16
Using 16 threads:
          InterlockAtomic.RunCompareExchange   (ns):   9238 Average,   7494 Minimal,  10111 Maxmial
            MonitorAtomic.RunCompareExchange   (ns):  17039 Average,  12189 Minimal,  18937 Maxmial
    MonitorVolatileAtomic.RunCompareExchange   (ns):   9110 Average,   6562 Minimal,  10364 Maxmial
          InterlockAtomic.RunExchange          (ns):  12504 Average,   5275 Minimal,  18905 Maxmial
            MonitorAtomic.RunExchange          (ns):  17205 Average,  11394 Minimal,  19518 Maxmial
    MonitorVolatileAtomic.RunExchange          (ns):   8934 Average,   7105 Minimal,  10300 Maxmial

In the 16 CPU box, Interlocked can only win the game in single thread test. Interlocked is nearly 4 times slow in the 2 and 4 thread test. And somehow get close in the 16 thread test.

D:\>InterlockVsMonitor.exe 1
Using 1 threads:
          InterlockAtomic.RunCompareExchange   (ns):    138 Average,    138 Minimal,    138 Maxmial
            MonitorAtomic.RunCompareExchange   (ns):    463 Average,    463 Minimal,    463 Maxmial
    MonitorVolatileAtomic.RunCompareExchange   (ns):    311 Average,    311 Minimal,    311 Maxmial
          InterlockAtomic.RunExchange          (ns):    133 Average,    133 Minimal,    133 Maxmial
            MonitorAtomic.RunExchange          (ns):    457 Average,    457 Minimal,    457 Maxmial
    MonitorVolatileAtomic.RunExchange          (ns):    257 Average,    257 Minimal,    257 Maxmial

D:\>InterlockVsMonitor.exe 2
Using 2 threads:
          InterlockAtomic.RunCompareExchange   (ns):   1855 Average,   1855 Minimal,   1855 Maxmial
            MonitorAtomic.RunCompareExchange   (ns):    876 Average,    873 Minimal,    879 Maxmial
    MonitorVolatileAtomic.RunCompareExchange   (ns):    482 Average,    448 Minimal,    517 Maxmial
          InterlockAtomic.RunExchange          (ns):   1821 Average,   1821 Minimal,   1822 Maxmial
            MonitorAtomic.RunExchange          (ns):    825 Average,    760 Minimal,    891 Maxmial
    MonitorVolatileAtomic.RunExchange          (ns):    501 Average,    498 Minimal,    505 Maxmial

D:\>InterlockVsMonitor.exe 4
Using 4 threads:
          InterlockAtomic.RunCompareExchange   (ns):   4158 Average,   4158 Minimal,   4160 Maxmial
            MonitorAtomic.RunCompareExchange   (ns):   1763 Average,   1731 Minimal,   1815 Maxmial
    MonitorVolatileAtomic.RunCompareExchange   (ns):    955 Average,    929 Minimal,    998 Maxmial
          InterlockAtomic.RunExchange          (ns):   4192 Average,   4172 Minimal,   4199 Maxmial
            MonitorAtomic.RunExchange          (ns):   1766 Average,   1628 Minimal,   1824 Maxmial
    MonitorVolatileAtomic.RunExchange          (ns):    948 Average,    786 Minimal,   1016 Maxmial

D:\>InterlockVsMonitor.exe 16
Using 16 threads:
          InterlockAtomic.RunCompareExchange   (ns):   8399 Average,   8347 Minimal,   8435 Maxmial
            MonitorAtomic.RunCompareExchange   (ns):  11881 Average,  11595 Minimal,  12082 Maxmial
    MonitorVolatileAtomic.RunCompareExchange   (ns):   7296 Average,   6994 Minimal,   7411 Maxmial
          InterlockAtomic.RunExchange          (ns):   8214 Average,   8180 Minimal,   8257 Maxmial
            MonitorAtomic.RunExchange          (ns):  11984 Average,  11556 Minimal,  12197 Maxmial
    MonitorVolatileAtomic.RunExchange          (ns):   7086 Average,   6707 Minimal,   7327 Maxmial

I don't have good explanation to the test result. I don't understand why the 2 CPU box is the fastest for most of tests. And Monitor performs better then Interlocked in multi-thread test on the box has more CPUs.

Update 5/25/2009: One thing is clear that Interlocked is always the fastest in single thread tests. Given another thought, it indicates to me that I should use Interlocked in the cases where access collision is rarely to occur although it does occasionally. For example, preventing the occasional write from slowing down main thread's excessive write access!?

3 comments:

Zach Saw said...

That's easy to reason why your test results are the way they are.

Have a look at their respective CPU architectures.

Core2Duos have shared L2. Interlocked ops don't have to lock the FSB, snoop the bus, and wait for reply from the CPU with the most recent updated value. The only penalty the Core2Duos pay for going out to L2 is the longer latency vs its own L1.

Your server boxes though are way less efficient. In the case of your tests, the CPU with the most updated value will have to post a value in reply to a snoop request from the CPU trying to update the value. Guess what, bottleneck right there. Your FSB is always going to be a lot slower than your on-silicon L2 cache.

As we go forward, FSB would become a thing of the past. This would somewhat alleviate the bottleneck but in no way it would be as fast as your true multicore CPUs.

Kenneth Xu said...

@Zach, thanks for the comment and what you described makes a lot of sense in explaining the differences between Core2Dues and my server boxes. Do you have any theory on the diffs between two server boxes?

Kenneth Xu said...

Looking it back again. Also, that doesn't explain why Monitor is actually faster. Monitor also need to go through FSB. The difference is that Interlocked is doing busy wait and retry which becomes a real issue with high performance counting. The fact the people are looking at another alternative well explained it. http://gee.cs.oswego.edu/dl/jsr166/dist/jsr166edocs/jsr166e/LongAdder.html

Post a Comment