Monkey2 Threads
Threads are finally in!
I eventually gave up on my fancy ‘Tasks’ idea from the previous worklog update. It’s not a bad idea, but communicating between tasks would have been pretty cumbersome as it effectively involves serialization and that’s never particularly fun.
I did actually take it quite far. I had a proof of concept thing going where each thread had it’s own garbage collector and all global vars were ‘thread local’, and it more or less worked. The only drawback was that passing stuff between threads was gonna be a real headache.
So I went back to the drawing board and decided to have another hack at a more traditional approach. My main problem with this is the increased cost of the GC write barrier – the small snippet of code that gets executed everytime a GC object is assigned to a (non-local) variable. There are many approaches to implementing write barriers depending on the GC algorithm used, but the monkey2 approach is simply to enqueue the object being assigned if it is unmarked, so that it can be incrementally marked later. The enqueue code looks a bit like this:
1
2
3
4
5
6
7
8
9
10
11
|
'executed whenever you assign a GC obj (or array) to heap variable (ie: field, global)
'
'State is a hidden internal field of all Objects.
'
Function Enqueue( obj:Object )
If obj And obj.State=Unmarked
obj.State=Enqueued
RemoveFromUnmarkedList( obj )
InsertIntoEnqueuedList( obj )
Endif
End
|
The only way I could think of to make this work nicely in a multithreaded environment was to use a global or per object mutex and lock it around the entire function, which felt like it’d be pretty expensive – and it was (although not actually as expensive as I thought it’d be, I think it’d be usable in practice). But there was also the potential to do some ‘lockless’ coding here, something I knew a little about but had never actually used. Lockless programming basically involves solving concurrency problems without the use of mutexes or other locking mechanisms. Initially there just seemed to be to much going on here to be able to make it lockless, but I think I’ve sussed it out and the end result is something like:
1
2
3
4
5
6
7
8
|
Function Enqueue( obj:Object )
If obj And obj.State=Unmarked
If AtomicSwap( obj.State,Enqueued )<>Enqueued 'Atomic swap returns previous value of obj.State
InsertIntoEnqueuedList( obj )
Endif
Endif
End
|
The magic bit here is the ‘AtomicSwap’ operation. This is a special CPU instruction that swaps a var and a value atomically, which means the swap will not be interfered with by another thread. Consider a ‘normal’ swap operation:
1
2
3
4
|
tmp=value1
value1=value2
value2=tmp
|
If a thread switch occurs between any of these operations, the result could be incorrect if the new thread also tries to swap the same 2 values. AtomicSwap effectively guarantees this cannot happen, and that the 3 instructions effectively happen as 1 instruction. The other interesting lockless instruction I ended up using is ‘CompareAndSwap’, which is bit like AtomicSwap except the swap only occurs if value1 is equal to some other value. This can be used to implement InsertIntoEnqueuedlist above in a lockless way, although it appears to be significantly slower than plain AtomicSwap. Still, it only gets done once per ‘sweep’ due to the Enqueued check, and it’s still faster than locking a mutex. And I’m yet to get my hands really dirty with some ASM.
The RemoveFromUnmarkedList step is now performed by the collector not the write barrier (I had to add another ‘next’ field to object for this though) which is not a win speed wise because it still gets done one way or the other, but it does simplify the write barrier to the point where it can be implemented locklessly.
Note that the lockless version is still slower than the single threaded version (by my measurements it’s about 50% slower) but it only happens when objects are assigned to heap vars (better than a read barrier) and I am confident I can speed it up over time. It should even be possible to get the collector running in it’s own thread for an overall speed win. The collector is still mostly incremental, meaning it collects little chunks each time ‘new’ is called, but since new can now be called on any thread, new’s access to the collector now needs a mutex – which effectively serializes access to the collector so the collector *is* sort of running in it’s own thread already (if you squint a bit)…
Finally, one other thing I couldn’t avoid was having to ‘stop the world’. This means suspending ALL threads (except the collector thread!) when a ‘sweep’ happens and resuming them afterwards. This is kinda ugly stuff but I have managed to get it working on all targets (except emscripten) at reasonable speeds (ie: microsecs not millsecs – although ios performance when debugging via xcode isn’t the greatest). Note that this is ‘free’ if there’s only one thread running as there are no other threads to suspend of course. I am 97% sure that both java and c# have to do the same thing for their GCs…I have definitely never read any GC papers that have even hinted at a way around this so I think it’s just one of those bite-the-bullet things with concurrent GC of any kind.
Using Threads
Threads are currently enabled by default, but if there are any problems you can still do a non-threads build by setting MX2_THREADS=0 in your bin/env_blah.txt file and rebuilding all modules. Thread builds are currently a separate config inside the .buildv* folders so you can turn threads on and off without having to rebuild everything once you’ve done an initial build. To start with I have provided the following simple classes:
Thread
Mutex
Condvar
Semaphore
The are all in the std.thread namespace but you will need to #Import “<thread>” into your app to use them. Long term this stuff will likely be moved to the <std> module but for now this setup allows me to build new thread aware mx2cc with old non-thread aware mx2cc. There is also a tests dir in the thread module dir with a few simple tests/demos in it. I was pleased to be able to get Image.Load and PlayMusic both going on a thread. In fact, I don’t think there’s any point in doing an Image.LoadAsync as you can just do this:
1
2
3
4
5
6
7
|
New Thread( Lambda()
_image1=Image.Load(...)
_image2=Image.Load(...)
...etc...
_done=true
End ).Detach()
|
The Future class is still only for use with fibers. It would be very to nice to be able to use this with threads too though, so that’s also on the todo list.
Some sort of mechanism will need to be added to be able to notify (at least) the main thread of events, so things like OnFinished() can be implemented. There is already a c++ module in there (std.async) that provides support for this for use by extern code, so it’s likely that this will end up just converted to plain monkey2.
And of course, it would be nice to have AtomicSwap and AtomicCAS in there too, and perhaps even an ‘Atomic’ type modifier in the language?
Limitations
- Debugger support for threads is currently very crude.
- You should only render graphics on the main thread. Image.Load will work on a thread because it doesn’t actually do any rendering (or even create texures until the first time you DrawImage). I plan to address this eventually, but have no idea what will be involved just yet.
- Lots of internal module globals vars will need to be mutex protected to be safely usable in a multithreaded environment. So far, I have concentrated ONLY on getting the core monkey module thread friendly. More needs to be done here.
- Probably lots of other stuff that hasn’t occured to me yet…
***** Monkey-v2018.07 *****
Added thread support and std.thread module!
Updated android SDK to: Android Studio 3.1.3 ; Android SDK 27 (Oreo 8.1) ; NDK 17.1.4828580 ; Android SDK Tools 26.1.1
Added simple FXAAEffect post effect for simple antialiasing – see modules/mojo3d/tests/effects.monkey2
Updated Visual Studio Community Edition 2017 to 15.7.4
Added simple win32 module and some tests.
Fixed SDL2 issue where SDL_WaitEvent always took a multiple of 10ms to execute. This would have affected code that used purely timer driven update rates without any calls to RequestRender inside OnRender.
Added SUPER experimental SwapAsync flag to Window. Causes window to execute swaps in a separate thread. This allows you to use much finer grained timers etc. Note: it may not be a good idea to do any off-screen rendering with this enabled. It may not work everywhere, but Qt4.8 apparantley recommends it so give it a go! Confirmed to work on windows and macos, but not linux.
Added extern alias symbols to mx2cc, eg: you can now go Alias DWORD:UInt="unsigned long"
(inside an Extern section) to get proper win32 dwords (and LPDWORDs that don’t break c++).