Fix word break when the first character of token is multibyte #753

and3md · 2025-04-11T08:06:59Z

Hello, This is my first PR for RmlUi so I wanted to say thanks for a great library :)

I use RmlUi to display "HTML-like" documents and I noticed instability in some documents with multi byte chars. In release mode there was an infinite loop and in debug a message appeared:

The error occurs when word break is enabled and the first character in the processed token is a multi-byte character, e.g. Polish characters such as "ś" or "ć".
In this case, the character returned for token_begin and token_begin + i by StringUtilities::SeekBackwardUTF8() may be the same char. I fixed that by stopping iteration when partial_string_end == token_begin. I don't know this is the best solution but it works (no inifinite loop/assertion fail) and text is appropriately divided into lines.

Example document that shows the bug

It is quite easy to cause the issue by creating a table filled with the characters ść :

Screen from rmlui fixed by this PR:

Simplified rml doc:

<!DOCTYPE html>
<rml lang="pl">
	<head>
		<meta charset="UTF-8"/>
		<meta http-equiv="X-UA-Compatible" content="IE=edge"/>
		<meta name="viewport" content="width=device-width, initial-scale=1.0"/>
		<title>Instalacje sanitarne i sieci zewnętrzne</title>
		<style>

.txt-centered {
		text-align: center;
		vertical-align: middle;
}

body {
    display: block;
    font-family: Roboto;
    font-size: 14dp;
    text-align: left;
    color:rgb(246, 252, 255);
    background:rgb(46, 48, 49);
    position: relative;
    top: 0dp;
    left: 0dp;
    overflow: scroll;
    height: 100vh;
}

div {
    display: block;
    position: relative;
}

/*
    Default styles for all the basic elements.
*/

em
{
    font-style: italic;
}

p {
    display: block;
    padding: 2dp;
{

strong
{
    font-weight: bold;
}

select
{
    text-align: left;
}

tabset tabs
{
    display: block;
}

table
{
    box-sizing: border-box;
    display: table;
    border: 0px rgb(113, 121, 121);
    border-collapse: collapse;
    border-width: 0px 0px 1dp 1dp;
}
tr
{
    box-sizing: border-box;
    display: table-row;
}
td
{
    box-sizing: border-box;
    display: table-cell;
    border: 0px rgb(113, 121, 121);
    border-width: 1dp 1dp 0 0;
    padding: 2dp;
    min-width: 10dp;
    word-break: break-word;
}
col
{
    box-sizing: border-box;
    display: table-column;
    min-width: 10dp;
}
colgroup
{
    display: table-column-group;
}
thead, tbody, tfoot
{
    display: table-row-group;
}


</style>
	</head>
	<body>
		<div id="content">
			<h5 id="IT_10_05">Tablica 0001</h5>
			<table>
				<tr>
					<td class="txt-centered">Lp.</td>
					<td class="txt-centered">title1</td>
					<td colspan="15" class="txt-centered">title2</td>
				</tr>
				<tr>
					<td class="txt-centered">01</td>
					<td class="txt-centered">02</td>
					<td class="txt-centered">03</td>
					<td class="txt-centered">04</td>
					<td class="txt-centered">05</td>
					<td class="txt-centered">06</td>
					<td class="txt-centered">07</td>
					<td class="txt-centered">08</td>
					<td class="txt-centered">09</td>
					<td class="txt-centered">10</td>
					<td class="txt-centered">11</td>
					<td class="txt-centered">12</td>
					<td class="txt-centered">13</td>
					<td class="txt-centered">14</td>
					<td class="txt-centered">15</td>
					<td class="txt-centered">16</td>
					<td class="txt-centered">17</td>
				</tr>
				<tr>
					<td class="txt-centered">1</td>
					<td>śśśśćśśśśśććść</td>
					<td class="txt-centered">28</td>
					<td class="txt-centered">32</td>
					<td class="txt-centered">36</td>
					<td class="txt-centered">41</td>
					<td class="txt-centered">48</td>
					<td class="txt-centered">55</td>
					<td class="txt-centered">64</td>
					<td class="txt-centered">78</td>
					<td class="txt-centered">91</td>
					<td class="txt-centered">111</td>
					<td class="txt-centered">129</td>
					<td class="txt-centered">160</td>
					<td class="txt-centered">198</td>
					<td class="txt-centered">242</td>
					<td class="txt-centered">295</td>
				</tr>
			</table>
			<p/>
		</div>
	</body>
</rml>

mikke89 · 2025-04-12T22:31:37Z

Hey, first of all, thanks a lot for taking the time to make a pull request. And thank you for the kind words :)

It was easy for me to reproduce the issue you are describing. I think you are definitely onto something here with the suggested fix. I want to tinker a bit with it first, the +2 in particular seems a bit suspicious to me. Edit: I realize this follows the existing logic further down, I think this can be simplified quite a bit though.

mikke89 · 2025-04-12T23:56:36Z

I looked a bit more into this one. I figured we could make some improvements to simplify that loop. So I pushed some changes directly to this PR here. The loop had some issues with character boundaries (originally from before your changes), and it should be a bit faster now too with multi-byte characters.

What do you think, does it look correct to you? It would be great if you could test it with your documents.

and3md · 2025-04-14T08:38:43Z

What do you think, does it look correct to you? It would be great if you could test it with your documents.

I'd love to check it out, but I need some time as I won't be able to do it until Wednesday.

and3md · 2025-04-16T12:04:09Z

I've tested your changes on multiple (about 200) documents and they work great. Thank you!

mikke89 · 2025-04-16T14:46:29Z

Great, thanks for following up and testing this. I merged it just now, thanks again for the pull request!

Fix word break when the first character of token is multibyte

097c063

mikke89 added the bug Something isn't working label Apr 12, 2025

Rewrite word break loop, always seek to the next character

e20fdc4

mikke89 merged commit ba9775e into mikke89:master Apr 16, 2025
32 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix word break when the first character of token is multibyte #753

Fix word break when the first character of token is multibyte #753

and3md commented Apr 11, 2025

mikke89 commented Apr 12, 2025 •

edited

Loading

mikke89 commented Apr 12, 2025

and3md commented Apr 14, 2025

and3md commented Apr 16, 2025

mikke89 commented Apr 16, 2025

Fix word break when the first character of token is multibyte #753

Fix word break when the first character of token is multibyte #753

Conversation

and3md commented Apr 11, 2025

Example document that shows the bug

mikke89 commented Apr 12, 2025 • edited Loading

mikke89 commented Apr 12, 2025

and3md commented Apr 14, 2025

and3md commented Apr 16, 2025

mikke89 commented Apr 16, 2025

mikke89 commented Apr 12, 2025 •

edited

Loading