You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When recognized entities partially overlap, Presidio passes incorrect text argument to the Operator.operate(text, params) function.
In EngineBase._operate() the text get dynamically replaced as it's being anonymized. The text_to_operate_on that's passed to Operator.operate(text=text_to_operate_on) is taken from this partially anonymized text instead of the original one.
This can cause a problem, because in certain cases this doesn't correspond to the originally found entity. It is not noticeable when operators don't use the original text (e.g. replacing by entity type) or when the found entities are far enough from each other.
Example
Let's say we have a custom operator that replaces entity by <ENTITY_TYPE: original_text>.
There are two entities, credit card number and URL, that partially overlap.
text = "Fake card number 4151 3217 6243 3448.com that overlaps with nonexisting URL."
expected_result = "Fake card number <CREDIT_CARD: 4151 3217 6243 3448><URL: 3448.com> that overlaps with nonexisting URL."
actual_result = "Fake card number <CREDIT_CARD: 4151 3217 6243 <URL><URL: 3448.com> that overlaps with nonexisting URL."
The anonymization is done backwards, so the URL is anonymized correctly. But when anonymizing the credit card, instead of the original text 4151 3217 6243 3448, the operator gets partially replaced one 4151 3217 6243 <URL.
While this is just an artifical example, it could cause real issues for example when hashing the original values - the hashes for the same entities would be different.
Proposed solutions
Instead of taking text_to_operate_on from the text_replace_builder, take it from the original text.
The original text could be an attribute of PIIEntity (handled during analysis), and it would be used as text_to_operate_on.
Background
When recognized entities partially overlap, Presidio passes incorrect
text
argument to theOperator.operate(text, params)
function.In
EngineBase._operate()
the text get dynamically replaced as it's being anonymized. Thetext_to_operate_on
that's passed toOperator.operate(text=text_to_operate_on)
is taken from this partially anonymized text instead of the original one.This can cause a problem, because in certain cases this doesn't correspond to the originally found entity. It is not noticeable when operators don't use the original text (e.g. replacing by entity type) or when the found entities are far enough from each other.
Example
Let's say we have a custom operator that replaces entity by
<ENTITY_TYPE: original_text>
.There are two entities, credit card number and URL, that partially overlap.
The anonymization is done backwards, so the URL is anonymized correctly. But when anonymizing the credit card, instead of the original text
4151 3217 6243 3448
, the operator gets partially replaced one4151 3217 6243 <URL
.While this is just an artifical example, it could cause real issues for example when hashing the original values - the hashes for the same entities would be different.
Proposed solutions
text_to_operate_on
from thetext_replace_builder
, take it from the original text.PIIEntity
(handled during analysis), and it would be used astext_to_operate_on
.Minimal code example
Python 3.9,
presidio_anonymizer
2.2.28,presidio_analyzer
2.2.28The text was updated successfully, but these errors were encountered: