joaomlneto / unstickymem Goto Github PK

View Code? Open in Web Editor NEW

0.0 3.0 0.0 641 KB

Library for Dynamic Placement in NUMA Nodes

C++ 87.20% C 9.24% CMake 2.05% Shell 1.51%

numa library dynamic placement

unstickymem's People

Watchers

unstickymem's Issues

Cleanup PagePlacement into two logical components (uniform and weighted) or make uniform the special/default configuration of the weighted

Add Documentation to Wiki

Currently we only have a README.md in the root directory.

Not only this is kind of limiting, but right now it's completely outdated.

Before making things public (after the publication), we should document it properly.

Sections:

How to install
How to compile
How to run (LD_PRELOAD or linking in the source code)
Library modes
Library options (configuration file, environment variables)

Interleave bug

There is a bug that does not uniformly distribute the memory among the nodes.

Improve the decision-making algorithm

The decision-making algorithm does the following:

Initialization steps

Interleave all memory objects (must use numactl --interleave=all for now)
Wait a few seconds before starting the tuning process.

Tuning process

Take N measurements of the memory stall rate, T microseconds apart.
Discard k measurements (the k/2 smallest and k/2 greatest).
Compute the average
If the average is lower, then keep going (shift a few more pages into the local node). Otherwise, stop.

unstickymem/src/unstickymem/unstickymem.cpp

Lines 113 to 122 in 2d6cdef

 // slowly achieve awesomeness 

 while(local_ratio < 1.00) { 

 LWARNF("GOING TO CHECK A LOCAL RATIO OF %lf", local_ratio); 

 place_all_pages(local_ratio); 

 stall_rate = get_average_stall_rate(NUM_POLLS, POLL_SLEEP, NUM_POLL_OUTLIERS); 

 LINFOF("RATIO = %lf STALL RATE = %lf", local_ratio, prev_stall_rate); 

 if (stall_rate > prev_stall_rate) break; 

 prev_stall_rate = stall_rate; 

 local_ratio += 0.01; 

 }

`place_pages` is not idempotent

Calling place_pages with the same arguments results in extra pages being moved into the local node.

We wanted to improve it anyways -- rewrite!!

Extend `place_pages` to have independent weights per node

Currently we assume there is 1 local/worker node.

The function takes one parameter (%local) that places local% of the segment in the local node, and the remainder is interleaved between the three other nodes.

Provide another version that can have different weights for each node.

Add fine-grained adaptive placement algorithm

Algorithm should treat all the contiguous mapped memory as blocks of N pages.

Should place x% of each block in the local node and (1-x)% interleaved in the remaining nodes.

Evaluate Gureya's Benchmark

Measured stall rate and decisions taken by the algorithm

Measurements at 2018-10-10 on intel14cores-v2

Local Ratio	0	0-1	0-2	0-3	0-4	0-5	0-6
25%	0,8652997439	0,8523260392	0,8057950605	0,7694768156	0,7407080060	0,7151234701	0,6881931666
30%	0,8634031441	0,8474194553	0,7984094285	0,7614674645	0,7408262380	0,7148938137	0,6872798011
35%	0,8590710471	0,8382672294	0,7844607808	0,7465582615	0,7412492110	0,7140231972	0,6868939998
40%	0,8534129592	0,8265852018	0,7677881273	0,7427290218	0,7413086173	0,7148290723	0,6826531546
45%	0,8473018202	0,8161373897	0,7703117436	0,7470264965	0,7344344647		0,6940479186
50%	0,8417095835	0,8201595776			0,7372542016
55%	0,8369322683
60%	0,8334777535
65%	0,8316501186
70%	0,8305620997
75%	0,8302131814
80%	0,8301691608
85%	0,8301030693
90%	0,8299428306
95%	0,8299722645
100%	0,8300216876
Minimum Stall Observed	0,8299428306	0,8161373897	0,7677881273	0,7427290218	0,7344344647	0,7140231972	0,6826531546
Optimal Local Ratio	90%	45%	40%	40%	45%	35%	40%
Decision (local ratio)	100%	50%	45%	45%	50%	40%	45%
Per-Node Throughput	3 954MB/s	3 827MB/s	3 526MB/s	3 424MB/s	3 364MB/s	3 174MB/s	2 960MB/s

Improve how library is configured by the user

Dynamic Memory Mapping

Currently we only support checking the memory map once and acting upon it.

If we update the memory mapping, the place_pages is not idempotent. (see #9 for explanation and example).

Evaluate Streamcluster

@gureya which version of Streamcluster are you using?

I recall comments about multiple versions existing during the last Skype call.

Support single-node machines

Currently we will just crash if we're running on a machine with a single NUMA node.

We should simply show a warning and do nothing.

Add very-coarse adaptive placement algorithm

Algorithm should treat all the regions as one and place x% in the local node and (1-x)% interleaved in the other Nodes.

Sections in INI file

Basic documentation

Write some basic documentation for users and future contributors

Enable a code linter (I suggest codefactor.io)

Adaptive noise-avoiding stop-condition not working

The mechanism to avoid stopping due to noise is simply not working.

unstickymem/src/unstickymem/mode/WeightedAdaptiveMode.cpp

Lines 94 to 98 in ca4533c

 if (get_average_stall_rate(_num_polls * 2, _poll_sleep, 

 _num_poll_outliers * 2)) { 

 LINFO("I guess so!"); 

 break; 

 }

If we get a higher stall rate we trigger this confirmation mechanism. However, we will stop the adaptive placement independently of the stall rate being lower.

This needs to be fixed in all adaptive mechanisms (weighted and uniform)

Gather list of data segments

We need to find a way of checking what data segments exist, in order to pass them to mbind.

How are process memory segments stored on the kernel?

How does the performance of mmap/malloc change depending on the number of mappings?
What are the limitations?
What is the data structure where these mappings are stored? Can we use a better one?

If we have 2 segments, what happens if we mbind both of them on the same mbind call? Do they merge into one segment?

Get rid of hardcoded absolute paths

Some Unit Tests?

We should try and do some functional tests to parts of the library.

Not an easy task, given the nature of the library.

Properly initialize counters inside application

Old bug, but ideally we want to avoid running setup-counters-$(arch) all the time.

Enable Travis publishing GitHub releases

Probably a good time to learn about packaging and stuff :-)

Make `rdpmc` portable across different CPUs

Currently it will work on Broadwell/Haswell (which is the case of intel14cores).

Intel Performance Counters:

1 (countreg 0x40000001) - Haswell - Core Clock Cycles
111 (event 0xa2, mask 0x01) - Haswell - Any Resource Stall

AMD Performance Counters:

??????????

Changes are being tracked on the perf branch

What happens if we `mbind` part of the heap? Does it split into multiple segments?

Function to force memory interleaving

As memory goes from an interleaved state to a local state; it would be nice to offer functionality to make it go in the opposite way.

Basic Tutorial

A new user with a brand new NUMA machine with default e.g. Ubuntu -- how should I install/run your application? Should cover package dependencies and a comprehensive list of commands.

Handling many small allocations

If we're dealing with an application that makes many small allocations, the algorithm in WeightedAdaptiveMode won't work.

@gureya has a fix ready in his branch :-P

I'm just writing it here not to forget we had a problem, and to (try and) make a unit test later.

The fix is basically:

Set the default memory policy to MPOL_INTERLEAVE on all nodes in the library constructor
Ignoring any new segments sent from MemoryMap until the WeightedAdaptive mode starts
Upon WeightedAdaptive mode starting, apply initial weights
Proceed normally

The side effects are that the segments will be uniformly interleaved until the node starts, although we don't expect this to be of significance.

Wrap Mapping Functions

We want to be able to know the set of all existing segments to apply mbind on.

To this end, we'll intercept standard C/linux functions that deal with memory mappings/allocations:

mmap
munmap
sbrk

This mechanism has significant advantages over reading /proc/self/maps (#11):

Deal with user semantics rather than the kernel -- we won't observe a segment getting fragmented as when reading the /proc/self/maps file.
Up-to-date information -- we are able to have an always up-to-date set of segments
Thread-safety -- we won't be plagued by read-free-mbind crashes, as we are able to stop the process from making changes during vulnerability periods.

Licensing?

What license to include in the repository?
What copyright notice to include in the source files?

Get rid of /config/weights*.txt files

Actually, they can stay -- but the application should be able to effortlessly generate them.

`mbind` done transparently/asynchronously

There should be a background thread that does things automatically (sleep, measure, decide new weights, apply mbind).

The thread should be bound to the same core in the same NUMA node.

	// slowly achieve awesomeness
	while(local_ratio < 1.00) {
	LWARNF("GOING TO CHECK A LOCAL RATIO OF %lf", local_ratio);
	place_all_pages(local_ratio);
	stall_rate = get_average_stall_rate(NUM_POLLS, POLL_SLEEP, NUM_POLL_OUTLIERS);
	LINFOF("RATIO = %lf STALL RATE = %lf", local_ratio, prev_stall_rate);
	if (stall_rate > prev_stall_rate) break;
	prev_stall_rate = stall_rate;
	local_ratio += 0.01;
	}

	if (get_average_stall_rate(_num_polls * 2, _poll_sleep,
	_num_poll_outliers * 2)) {
	LINFO("I guess so!");
	break;
	}