5. Full Concurrent Collection
As described in Section 3, root objects of traversal are marked out dynamically by reference counting during the execution of application program, and because application code is all gc-safe, the tracing collector can start up at any place without delay. Since in our approach, we don't need to suspend an application thread to determine references from the root-set of the thread, we can use more aggressive barrier with an incremental collector to achieve full concurrent collection. Herein, more aggressive barrier means we can use something like multi-threading synchronization mechanism instead of traditional read/write barrier, and no fear of causing deadlock. For an oversimplified example, if an application thread (mutator) locks a resource that collector needs, and the mutator was suspended by garbage collection, then the collector cannot get what it need and cause a deadlock. Although the collector can resume mutators and try to perform collection later, but in practice, it is difficult to implement such type of incremental collector and also existing some other issues. Fortunately, our approach never suspends any mutators, and the collector can merely contend with mutators as normal multi-threading programmings.
Unlike some prior designs using "GetWriteWatch" to monitor changes to managed object, we coalesce read or write barrier into the operations of reference count maintenance. It broadens the potential platform base, because the underlying platform need not to provide special services, such as SuspendThread, GetWriteWatch, and even virtual memory service is not a prerequisite. By adding some code to collaborate with collector in the reference counting maintenance, we can simulate some write barrier or read barrier functionality for specific incremental collection algorithm. Maintenance of reference counters, including lock-counter and extra-counter, involve counter decrement and increment operations. In the decrement operations, we can simulate write barrier for the "snapshot-at-beginning" collector as follows. The system records down all referents that have lost reference during the collection, and re-traversing these objects, thus no object is lost from the view of the snapshot-at-beginning collection. In the increment operations, we can simulate the write barrier for the "forward incremental update" collector. By recording down all new references to object and re-traversing these objects, no reference is missing. The barrier or recording operation should occur close to reference counting operation, either before or after decrement/ increment reference counting operations. The barrier or recording operation must not be conducted too late that other related reference counting operation has been conducted, i.e. recording down the referent object must be conducted ahead of the last reference (known to collector) to the object dropped. In one of our implementations, recording is conducted before the return of increment operation to the application program, which may execute statements that drop the reference to the object.
The reference counting operations on lock-counter should be trapped, and every newly-referenced objects during the tracing collection should be recorded down, or a global flag is marked informing the collector to rescan objects for active (locked) objects. Unlike some prior designs only traps changes in the managed heap, we trap the changes in root-set area also. We can imagine the whole outside world of managed heap, the extended root-set, as a large rapid-changed virtual object. References from the virtual object are grouped by reference counting, so that multiple references to an managed object are treated as one effective reference to mark the object with positive lock-counter value. Based on this concept, we can easy understand why we need to trap reference from extended root-set, just the same as trapping references changes between managed objects. Also, more optimization can apply to filter out redundant operations, such as filtering out write barriers where the object's lock-counter remains a non-zero state unchanged.
5.1 An Implementation
Following, we will give an implementation of our full concurrent collector. We choose "forward incremental update" tracing collection algorithm and write barriers in the increment operations of reference counting. Application threads (mutators) and thread running tracing collection (collector) are full concurrent threads. Mutators run like foreground threads and the collector runs like a background thread. When the collector is performing tracing collection, some operations pertaining to memory management are trapped, including creating a new object, creating a new reference to an existing object, etc. These operations are synchronized with the running collector, the worst case of contention delay is predictable and is at micro-second level on stock hardware.
The tracing collection is conducted as follows. The collector scans all objects (excluding new objects created during the collection process), and converts "white" active (marked-out) objects to "gray". Any scanning orderings or methods are permitted, providing that all existing objects are scanned thoroughly. Another actions, e.g. object traversal, can be interleaved with object scanning. If some objects become "marked-off" or become inactive during the scanning, it does not matter. Because the reference assignment operations are trapped by the system, no matter the changed objects are treated as active or inactive, the results are both correct.
The system can calls traverse routine once there is a gray object, no need to wait for all root objects be determined. Various traversal algorithms can be applied, such as depth first, breadth first or any exhaustive traversal algorithms. When all the referents of an object become "gray", the object can be converted to "black".
During the scanning and/or traversal, concurrent application threads keep changing the relationship graph. Assignment operations that create new references to objects are trapped and cause the "white" referents be converted into "gray"; newly created objects are treated directly as color "black". All "gray" objects must be traversed, and finally be converted to color "black". The collector should keep converting "gray" object to "black" until there is no "gray" object waiting and all previous-existing objects are scanned. This approach guarantees that the garbage collection process will eventually catch up with the reference changes by application threads. Because, as garbage collection proceeds, "gray" is a transition state and eventually will become "black"; write barrier on new reference creation and object scanning for marked-out will cause "white" objects to be converted to "gray" and finally "black"; and, once garbage collection starts, newly created objects are treated as "black". Thus, the number of "white"objects will not increase. At a certain time, the scanning and traversal will catch up with changes, all "gray" objects become "black", leaving only "black" and "white" objects.
After scanning and determining root objects, once there are no "gray" objects waiting for traversal, there will subsequently be no more "gray" object in current collection (This statement can be easily proved by contradiction). Once there are no gray objects, it means the collector catch up the change of relationship and means the end of mark phase. After the mark phase, white objects are collected as garbage.
5.2 Detailed Depiction
Figure 4 is block diagram of the mark phase of the collector;
Figure 5 are flowcharts of the mark phase of collector, traps flowcharts of assignment that creates new references, and flowcharts of creating new objects.
In Figure 4 there are three function codes executing. They are tracing collector code (COLLECTOR) in mark phase, mutator trap code (ASSIGN) of creating new reference to an object, and mutator trap code (CREATE) of creating new object. Exclusive lock L1 and flag F1 protect the shared data structure among mutators and collector. The shared data structures are those internal data structures of the garbage collection system, and can be accessed by mutator trap code ASSIGN, CREATE. The collection SA and SG, which can be lists, enable ASSIGN code to work concurrently with tracing collector. These two collections contain references to white objects that should be converted to gray. ASSIGN should only access SA, and COLLECTOR can access SG and may switch SG, SA at proper time. New objects created during the mark phase are put into collection SB. Tracing collector scans white object list LW for marked-out root objects and converts them from "white" to "gray" into gray object list LG, then the traversal of gray objects makes gray objects moved into black object list LB.
By using these separated lists or collections, the contention of shared data structure is reduced considerably. Mutators should first acquire lock L1 before accessing SA and SB, and the collector should acquire lock L1 when it wants to switch SA and SG.
From the flowchart in Figure 5, we can notice that in these steps there is no operation of suspending an application thread, and the durations of holding exclusive lock L1 are very short. The worst cases latency is the maximum duration of those race points. The race points in the mark phase, we can see, are simple and quick with constant cost complexity. They only occur at the point that two or more threads are contending for the same resource. Further optimizations can be applied to, such as using a queue for each application threads and using some tricks, we can use atomic fetch and store instructions instead of racing for an exclusive lock among threads.
During the tracing collection, if an object's reference count drops to zero, we can reclaim the object immediately and provide strict ordering of destructor execution. Certainly, a zero-referenced object does not mean the garbage collector is not using it. So, before reclaiming an object, all references inside garbage collector to the object should be removed carefully.
(This description of HnxGC concurrent collector implementation was finished several months ago, so it does not reflect the improvements we did during this period, such as lock-free and global memory fence.)