Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide pcre2_match_data_create_with_frames() to avoid start stack frames. #134

Closed
wants to merge 1 commit into from

Conversation

ylavic
Copy link

@ylavic ylavic commented Jul 10, 2022

pcre2_match() uses START_FRAMES_SIZE (20KiB) of stack space for the start
(initial) frames, which may be too large for systems with small per-thread
stacks (e.g. 128KiB by default on Alpine Linux).

Provide pcre2_match_data_create_with_frames() to allow for the user to
create match_data containing context/heap allocated start frames, which
will be used by pcre2_match() if available instead of the stack frames.

  • src/pcre2.h.generic, src/pcre2.h.in:
    Declare pcre2_match_data_create_with_frames().

  • src/pcre2_intmodedep.h:
    Add a pointer to start_frames to struct pcre2_real_match_data, will be
    NULL if stack frames are to be used.

  • src/pcre2_match_data.c:
    Provide a common match_data_create() helper to be called by all the
    pcre2_match_data_create*() functions without duplicating code.
    When called from pcre2_match_data_create_with_frames(), match_data_create()
    will allocate the start frames in (at the end of) the ovector and point
    match_data->start_frames to that.

  • src/pcre2_match.c:
    Split pcre2_match() in two match_start() and match_start_on_stack()
    helpers, the former doing the usual work given some start frames, the
    latter calling the former with start frames reserved on the stack.
    pcre2_match can then call the one or the other depending on whether
    match_data->start_frames exist or not, playing some noinline/indirection
    games to do the stack allocation from match_start_on_stack() only.

…ames.

pcre2_match() uses START_FRAMES_SIZE (20KiB) of stack space for the start
(initial) frames, which may be too large for systems with small per-thread
stacks (e.g. 128KiB by default on Alpine Linux).

Provide pcre2_match_data_create_with_frames() to allow for the user to
create match_data containing context/heap allocated start frames, which
will be used by pcre2_match() if available instead of the stack frames.

* src/pcre2.h.generic, src/pcre2.h.in:
  Declare pcre2_match_data_create_with_frames().

* src/pcre2_intmodedep.h:
  Add a pointer to start_frames to struct pcre2_real_match_data, will be
  NULL if stack frames are to be used.

* src/pcre2_match_data.c:
  Provide a common match_data_create() helper to be called by all the
  pcre2_match_data_create*() functions without duplicating code.
  When called from pcre2_match_data_create_with_frames(), match_data_create()
  will allocate the start frames in (at the end of) the ovector and point
  match_data->start_frames to that.

* src/pcre2_match.c:
  Split pcre2_match() in two match_start() and match_start_on_stack()
  helpers, the former doing the usual work given some start frames, the
  latter calling the former with start frames reserved on the stack.
  pcre2_match can then call the one or the other depending on whether
  match_data->start_frames exist or not, playing some noinline/indirection
  games to do the stack allocation from match_start_on_stack() only.
@ylavic
Copy link
Author

ylavic commented Jul 10, 2022

Supposedly match_data can be reused for multiple pcre2_match() calls (e.g. some thread_local pointer), so using pcre2_match_data_create_with_frames() thus allocating >20KiB of data does not necessarily happen before each pcre2_match().

@chemodax
Copy link

I would suggest another solution: make START_FRAMES_SIZE configurable at compile time. In this case it can be tweaked on platforms with small stack size.

@ylavic
Copy link
Author

ylavic commented Jul 12, 2022

That would make it a distro setting for most users/projects (and distros are likely to keep the current START_FRAMES_SIZE value anyway), so pcre2_match() is still going to consume 20K of stack for most users.
The proposed changes allow projects using pcre2 and that don't need/want this stack space to opt-out, with any libpcre2 (future?) version.

@PhilipHazel
Copy link
Collaborator

I too had thought about making START_FRAMES_SIZE configurable. It would be a much simpler patch, which I could easily do. Would than meet your need? Oh, I see you have just commented as I write. Of course, BOTH strategies could be implemented; they are not in conflict.

@ylavic
Copy link
Author

ylavic commented Jul 12, 2022

Note that the changes to src/pcre2test.c to make it use pcre2_match_data_create_with_frames() are possibly not needed/suitable, I used this for testing with make check but it might or not be the suitable default for the pcre2test binary..

@ylavic
Copy link
Author

ylavic commented Jul 12, 2022

Oh, I see you have just commented as I write. Of course, BOTH strategies could be implemented; they are not in conflict.

Thanks for considering it. Yeah the main issue with compile time setting is that I (as an OOS developper) can do little about it. For instance the Apache httpd server is going to be built and run against the distro libs most of the time, reusing match_data there with no stack overhead/overflow would be great.

@PhilipHazel
Copy link
Collaborator

The idea of using an initial stack-frame stack on the system stack was to avoid the need for a memory alloc/free in simple cases where not many frames are needed, and in these days of gigabyte memories, 20K didn't seem very much. By adding the frames onto the match_data block your code avoids adding another alloc/free, at the expense of complicating the calling arrangements with multiple call levels - though modern compilers should be able to optimize tail calls. It has just occurred to me that it might be possible to tweak your code so that allocation on the stack is avoided when JIT is going to be called, though I'm not sure how easy that will be. I'm on another project at present, but will look at all of this in detail in due course.

@PhilipHazel
Copy link
Collaborator

PhilipHazel commented Jul 15, 2022

Please see below: this comment is now obsolete, but I'm leaving it for the record.

I've started to look at your code. I don't (yet?) understand the fancy games with the inlining options. Can it not be done more simply like this:

  • If pcre2_match() sees that the startframe pointer in the match data is NULL, it calls pcre2_start_use_stack(). This function allocates the frames on the stack, then calls pcre2_match_start(), passing the stack frames.

  • If pcre2_match() finds the startframe pointer not NULL, it calls pcre2_match_start() itself, passing the frames that are on the heap.

The test could be done after testing for JIT, thereby avoiding stack usage when JIT is called.

Am I missing something?

As for pcre2test, there will have to be some way of testing this code, but I haven't looked in detail yet. No doubt some new option will be needed.

@PhilipHazel
Copy link
Collaborator

OK, I now understand about the inline stuff. I have been pondering how to solve your problem in a simpler manner. One idea is to abandon the use of the on-stack initial frames and always use the heap. This would mean an extra malloc/free cost for every call of pcre2_match(). I come from an era where allocating and freeing memory was expensive, but I've just done some Googling, and it seems that in modern operating systems this is no longer the case. I will do some timing tests to see if there is a detectable effect for pcre2_match.

A slightly more complicated solution is not to modify the match_data block, but instead to add an option to pcre2_match(). This would add the additional malloc/free only to the times the option is set, but involves all the complication of indirect function calls, though it saves having to change match_data_create() with its associated pcre2test complications.

@chemodax
Copy link

@PhilipHazel On modern OS heap alloc is really fast if malloc and free performed from the same thread: heap allocate have some kind of internal per-thread or per-processor free memblock list.

@PhilipHazel
Copy link
Collaborator

However, a quick test on my Arch Linux system, using pcre2test's crude timing feature, shows a consistent 20% increase in CPU time when every call to pcre2_match() does a malloc/free. This was using the testdata/testinput1 sequence of tests, after a quick hack to disable the use of the stack-based backtracking frames (there's already code to do this if the stack vector is too small for at least 10 frames - I just made it unconditional). More thought/experiment needed (but not today).

@chemodax
Copy link

@PhilipHazel Could you please share patch you are testing -- I'm interested to test it on Windows.

@zherczeg
Copy link
Collaborator

This topic is strange for me. Modern OS-es use page based allocation, and only allocate memory pages on memory write. If you allocate 1M data with malloc, but never writes anything to it, no memory is allocated by the OS for that memory region. Same with stack. If your max stack is 100 MByte, but your application uses 2K only, your OS allocates a single 4K page for it. I don'tget the 128K limit at all, maybe it is a misconfiguration?

@PhilipHazel
Copy link
Collaborator

I realized that my simple patch was in fact too simple; I will try a different one later today and report back.

@ltrzesniewski
Copy link
Contributor

ltrzesniewski commented Jul 17, 2022

I've noticed a quite large perf improvement on Windows after I removed a malloc/free pair I was doing for each match. I don't have the numbers anymore though, but I remember the difference was significant.

@PhilipHazel
Copy link
Collaborator

PhilipHazel commented Jul 17, 2022

Here is the patch I have just tested: (For anyone who read an earlier version of this comment, I finally figured out how to quote code without GitHub mis-formatting it.)

--- a/src/pcre2_match.c
+++ b/src/pcre2_match.c
@@ -6827,14 +6827,21 @@ large. Ensure that there are at least 10 available frames by getting an initial
 vector on the heap if necessary, except when the heap limit prevents this. Get
 fewer if possible. (The heap limit is in kibibytes.) */
 
+#ifdef CUTOUT
 if (frame_size <= START_FRAMES_SIZE/10)
   {
   mb->match_frames = mb->stack_frames;   /* Initial frame vector on the stack */
   mb->frame_vector_size = ((START_FRAMES_SIZE/frame_size) * frame_size);
   }
 else
+#endif
+
   {
   mb->frame_vector_size = frame_size * 10;
+  
+  if (mb->frame_vector_size < START_FRAMES_SIZE)
+    mb->frame_vector_size = ((START_FRAMES_SIZE/frame_size) * frame_size); 
+ 
   if ((mb->frame_vector_size / 1024) > mb->heap_limit)
     {
     if (frame_size > mb->heap_limit * 1024) return PCRE2_ERROR_HEAPLIMIT;

Running pcre2test on testdata/testinput1 with this patch applied does 3787 allocs/frees (according to valgrind) whereas without it there are 1346. However, using pcre2test's relatively crude timing options showed a slowdown of only around 2%. This is on Arch Linux. On some of the shorter test files the difference was much greater, but this might be rounding type errors. I'm not sure quite what to make of this -- I await the information from Windows tests with interest.

@chemodax
Copy link

chemodax commented Jul 18, 2022

On Windows:
Baseline:

pcre2test.exe -8 -TM testinput1
Total match time 7.3721 milliseconds

With patch:

pcre2test.exe -8 -TM testinput1
Total match time 7.5713 milliseconds

I.e. with patch it takes 2% more time.

@PhilipHazel
Copy link
Collaborator

Either I'm not thinking straight, or your figures are a bit off: Surely a difference of 0.2 in 7.37 is around 2%, not 20%? (2.7%, I think). This is similar to what I see on Linux. What do you see for tests 4 and 5?

@ylavic
Copy link
Author

ylavic commented Jul 18, 2022

Replacing the current large on-stack initial buffer with a full malloc/free implementation is always going to be costly IMO, even if a microbenchmark shows ~2% loss only it might be worse for a real application with more concurrency/pressure on the system allocator.
In Apache httpd for instance, a custom pcre2_general_context (with custom malloc/free functions) is used to redirect to a per-thread memory pool/cache that efficiently reuses memory chunks between the calls to pcre2_match(). So malloc() is going to be called a few times (for the first/largest regexes run by the thread), and free() almost never (unless the total exceeds some configurable limit per cache/pool).

With this kind of caching mechanism there is negligible overhead in switching to heap memory allocation, but since it's not something that most PCRE users have (supposedly), keeping the current stack reserve by default sounds wise.
That's why I proposed an opt-in with pcre2_match_data_create_with_frames(), but a new option to pcre2_match() could work too (still this makes sense only if stack_frames_vector is reserved in match_start_on_stack() only, when the option is not set).

Regarding JIT, it seems that pcre2_jit_match() has its own jit_stack mechanism, using some context's jit_callback_data instead of the stack_frames_vector, so the stack_frames_vector is not needed/used there (nor would be the match_data->start_frames from this PR then), right?
If that's the case then a new pcre2_match() option looks more flexible indeed, and match_start could skip the allocation until after (and if not) use_jit.

@zherczeg
Copy link
Collaborator

Hm, allocating a stack space by both the interpreter and JIT could be a waste, that is a good point. Currently JIT allocates a 32 KB stack space when the custom JIT stack data is not provided (using a non-inlined match with stack function call). However, if the interpreter already allocates that space, JIT could also use it.

It could be interesting to use the JIT stack by the interpreter as well. It allocates virtual memory, but it requires OS support, so it does not work on some exotic environments, Probably not worth it.

@PhilipHazel
Copy link
Collaborator

I have thought of a new Cunning Plan which I will try out, possibly later today. It is as follows: Add a pointer and a length to the match_data structure, but initialize the pointer as NULL. Then pcre2_match(), when it sees NULL, calls malloc. The length is needed because pcre2_match() may extend the block if it needs to. But do NOT free. Then, if the caller uses the same match_data block for multiple pcre2_match() calls, which I suspect many programs do, the block is re-used. It is eventually freed when pcre2_match_data_free() is called. The malloc() will happen only when JIT is not being called. This has the advantage of being simple, needing no action on the part of callers, and no need to explain these internals in the documentation.

@ylavic
Copy link
Author

ylavic commented Jul 19, 2022

This sounds good to me, possibly you would add a pcre2_match_data_block_size() function too that returns the current size of the match_data block, such that one reusing the same match_data can free it at some point should/when a "pathological" regex allocate too much memory to keep? Though one can track this already with a custom memctl.malloc() function.

Ideally a new pcre2_match_data_create_with_options() would let users specify whether they want initial stack memory (today's behaviour), or an initial heap block of START_FRAMES_SIZE with the lifetime of the match_data (if pcre2_match() allocates more memory it would be freed after the match, keeping the initial block only), or an adaptative block size preserved accross the pcre2_match() calls as you propose above. But I understand why you want to keep this internal and go with the latter only.

@ltrzesniewski
Copy link
Contributor

ltrzesniewski commented Jul 19, 2022

if the caller uses the same match_data block for multiple pcre2_match() calls, which I suspect many programs do

While that may be true for C programs, I suppose some libraries or bindings to other languages can't afford that luxury, since they don't know what their users do.

For instance, I create a match data block per match in my C# binding library (in its idiomatic API, I expose a more optimized API as well, but it imposes additional constraints on the user). That's because I need to ensure correct behavior whatever the user decides to do.

no need to explain these internals in the documentation

Due to the above, I think that documenting this behavior could be worthwhile for libraries built on top of PCRE2.

@PhilipHazel
Copy link
Collaborator

I have done the refactoring as I described above; the results are in a branch called "heapframe". Somewhat to my astonishment, the testinput1 test now runs 40% FASTER. But when I thought about it, I realized that running pcre2test on that file isn't really a representative test. Some earlier tests probably extend the frame vector to a large size, meaning that later tests, which would in the past have had to call malloc() won't need to do so.

If you a stake in this issue, please test the heapframe branch. There are a couple of wrinkles: (1) When setting up or extending the heap frame vector, the memory functions that were used for getting the match_data block are used. In the past, when the stack vector was too small, the regex's memory functions were used. I don't think this is a big issue. (2) The limit on heap usage, set by function or (*LIMIT_HEAP) in a pattern, is applied only when a call to malloc() happens. Thus, for example, if pcre2_match() has already been called, and therefore set up a vector, using the same match_data with a heap limit of 0 will not automatically fail. Again, I don't think this is a showstopper.

It would be easy to create a function called pcre2_match_data_free_frames() if that is thought to be helpful. Also one to get the current size of the frames vector. Indeed, I guess it would be possible to have a function to add a frames vector to a newly-created match_data block. This would allow an initial vector smaller than the current 20KiB default size.

I have not yet done any documenting, because I await feedback.

@ylavic
Copy link
Author

ylavic commented Jul 20, 2022

Looks good (not tested yet though, will do later today or tomorrow).

One improvement possibly, to avoid reducing match_data->heapframes_size in place when it's reused for a smaller frame_size, something like the below patch?

diff --git a/src/pcre2_match.c b/src/pcre2_match.c
index cb876ec..f0bf97c 100644
--- a/src/pcre2_match.c
+++ b/src/pcre2_match.c
@@ -598,6 +598,7 @@ heapframe *P = NULL;
 heapframe *frames_top;  /* End of frames vector */
 heapframe *assert_accept_frame = NULL;  /* For passing back a frame with captures */
 PCRE2_SIZE frame_copy_size;   /* Amount to copy when creating a new frame */
+PCRE2_SIZE heapframes_size;
 
 /* Local variables that do not need to be preserved over calls to RRMATCH(). */
 
@@ -634,10 +635,16 @@ copied when a new frame is created. */
 
 frame_copy_size = frame_size - offsetof(heapframe, eptr);
 
+/* Use a heapframes_size rounded down to a multiple of the current frame size,
+as match_data->heapframes_size may well have been allocated for a pattern with
+a different frame size. */
+
+heapframes_size = (match_data->heapframes_size/frame_size) * frame_size;
+
 /* Set up the first frame and the end of the frames vector. */
 
 F = match_data->heapframes;
-frames_top = (heapframe *)((char *)F + match_data->heapframes_size);
+frames_top = (heapframe *)((char *)F + heapframes_size);
 
 Frdepth = 0;                        /* "Recursion" depth */
 Fcapture_last = 0;                  /* Number of most recent capture */
@@ -663,6 +670,7 @@ if (N >= frames_top)
   heapframe *new;
   PCRE2_SIZE newsize = match_data->heapframes_size * 2;
 
+  if (newsize <= match_data->heapframes_size) return PCRE2_ERROR_HEAPLIMIT;
   if ((newsize / 1024) > mb->heap_limit)
     {
     PCRE2_SIZE maxsize = ((mb->heap_limit * 1024)/frame_size) * frame_size;
@@ -672,7 +680,7 @@ if (N >= frames_top)
 
   new = match_data->memctl.malloc(newsize, match_data->memctl.memory_data);
   if (new == NULL) return PCRE2_ERROR_NOMEMORY;
-  memcpy(new, match_data->heapframes, match_data->heapframes_size);
+  memcpy(new, match_data->heapframes, heapframes_size);
 
   F = (heapframe *)((char *)new + ((char *)F - (char *)match_data->heapframes));
   N = (heapframe *)((char *)F + frame_size);
@@ -680,7 +688,9 @@ if (N >= frames_top)
   match_data->memctl.free(match_data->heapframes, match_data->memctl.memory_data);
   match_data->heapframes = new;
   match_data->heapframes_size = newsize;
-  frames_top = (heapframe *)((char *)new + newsize);
+
+  heapframes_size = (newsize/frame_size) * frame_size;
+  frames_top = (heapframe *)((char *)new + heapframes_size);
   }
 
 #ifdef DEBUG_SHOW_RMATCH
@@ -6831,26 +6841,25 @@ if (heapframes_size < START_FRAMES_SIZE)
 
 if ((heapframes_size / 1024) > mb->heap_limit)
   {
-  if (frame_size > mb->heap_limit * 1024) return PCRE2_ERROR_HEAPLIMIT;
-  heapframes_size = ((mb->heap_limit * 1024)/frame_size) * frame_size;
+  PCRE2_SIZE maxsize = ((mb->heap_limit * 1024)/frame_size) * frame_size;
+  if (frame_size > maxsize) return PCRE2_ERROR_HEAPLIMIT;
+  heapframes_size = maxsize;
   }
 
 /* If an existing frame vector in the match_data block is large enough, use it,
-but round its size down to a multiple of the current frame size, as it may well
-have been allocated for a pattern with a different frame size. */
-
-if (match_data->heapframes_size >= heapframes_size)
-  match_data->heapframes_size = (match_data->heapframes_size/frame_size) * frame_size;
+otherwise free any pre-existing vector and get a new one. */
 
-/* Otherwise, free any pre-existing vector and get a new one. */
-
-else
+if (match_data->heapframes_size < heapframes_size)
   {
   match_data->memctl.free(match_data->heapframes,
     match_data->memctl.memory_data);
   match_data->heapframes = match_data->memctl.malloc(heapframes_size,
     match_data->memctl.memory_data);
-  if (match_data->heapframes == NULL) return PCRE2_ERROR_NOMEMORY;
+  if (match_data->heapframes == NULL)
+    {
+    match_data->heapframes_size = 0;
+    return PCRE2_ERROR_NOMEMORY;
+    }
   match_data->heapframes_size = heapframes_size;
   }

(EDIT1: last hunk that resets match_data->heapframes_size in case of allocation failure)
(EDIT2: check for PCRE2_SIZE overflow when doubling heapframes_size => PCRE2_ERROR_HEAPLIMIT)
(EDIT3: remove spurious change)

@PhilipHazel
Copy link
Collaborator

I had thought about the issue of malloc size versus usable size and concluded that another field in the match data is needed. Perhaps the value that is currently in the frames_top variable.

@ylavic
Copy link
Author

ylavic commented Jul 20, 2022

It seems that the only place where heapframes_size should be a multiple of frame_size is in the match() function, so a local heapframes_size variable there can do (like in the proposed patch)?

@PhilipHazel
Copy link
Collaborator

Yes, I realized that too. I'm working on it.

@PhilipHazel
Copy link
Collaborator

PhilipHazel commented Jul 20, 2022

I've pushed a patch to the heapframe branch, which also does a minor code tidy to the heap limit - do the *1024 once only.

Do you think that functions pcre2_match_data_{add,free}_frames() would be useful?

EDIT: I've had to do two more commits/pushes as I overlooked a couple of things. I blame the hot weather we have here.

@chemodax
Copy link

Either I'm not thinking straight, or your figures are a bit off: Surely a difference of 0.2 in 7.37 is around 2%, not 20%? (2.7%, I think). This is similar to what I see on Linux. What do you see for tests 4 and 5?

Yes, you are right: it's 2%, not 20%.

@ylavic
Copy link
Author

ylavic commented Jul 23, 2022

(sorry for the delay)

I've pushed a patch to the heapframe branch, which also does a minor code tidy to the heap limit - do the *1024 once only.

My testing/debugging worked perfectly, +1 from me (FWIW).

Do you think that functions pcre2_match_data_{add,free}_frames() would be useful?

I think that it would be a good API (possibly with pcre2_match_data_frames_size()) for maintaining the match_data with the default memctl (system's malloc/free).
Honestly I won't use them for now because a custom memctl better fits my needs so far.

EDIT: If pcre2_match_data_add_frames() is used (e.g. before calling pcre2_match()), could it avoid the current minimum of START_FRAMES_SIZE (~50 * frame_size)? The code could still enforce the minimum of 10 * frame_size (e.g. ~3K for 10 captures on 64bit system), but if add_frames() made enough room for that already maybe the match could start from there..

Anyway thanks a lot for the changes and your time, great improvements IMO.

@PhilipHazel
Copy link
Collaborator

Thank you for testing. I have realized that work is needed in pcre2test to get the code that finds a minimum heap limit working with the new arrangements. Once I have done that, and updated the documentation, I will merge this branch.

I have also realized that the pcre2_dfa_match() function also uses the same stack trick and could be updated in the same way. However, I do not propose to spend time doing that work unless somebody asks for it.

@PhilipHazel
Copy link
Collaborator

The heapframe branch is now merged into master, so I am going to close this pull request.

@ylavic ylavic deleted the match_data_with_frames branch July 29, 2022 10:56
@chemodax
Copy link

chemodax commented Aug 3, 2022

@PhilipHazel As far I understand stack_frames optimization is completely removed. Is it correct? I think this will introduce performance regression for platforms with enough stack, because of extensive malloc()/free() usage for match data.

Is it possible to bring back stack_frames optimization with configurable size and smaller default? I think it will be safe for most users, but still allow use match_data without malloc() in most cases.

What do you think?

@PhilipHazel
Copy link
Collaborator

Yes, the stack frames usage has gone, but there is a new optimization: the heap block is remembered in the match_data block. Note that this change will not affect JIT usage in any way, except that an unused block is no longer allocated on the stack. I would hope that applications that really, really care about performance would use JIT.

For non-JIT uses of pcre2_match() there will be winners and losers. If a caller uses the same match_data block for multiple matches, there will either be no noticeable difference (one extra malloc/free) if the block never has to be extended, or the code will run faster if lots of the matches need more than the minimal amount, because the extended block is remembered. The pcre2test program, when run on the testinput1 file, runs substantially faster because of this. The losers are programs that use a new match_data block for each match. Of course, the importance depends on how much matching such programs do and how much performance matters to them. There is always the possibility of custom memory allocation.

I do not like the idea of re-introducing the stack usage. As was pointed out before, many applications want to use whatever the system installs, so making it a build-time option doesn't help them, and making it a run-time option is messy and involves compiler-specific hacks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants