Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect text passed to Operator.operate() when entities overlap #888

Closed
tomas666 opened this issue Jun 30, 2022 · 2 comments
Closed

Incorrect text passed to Operator.operate() when entities overlap #888

tomas666 opened this issue Jun 30, 2022 · 2 comments
Assignees

Comments

@tomas666
Copy link

Background

When recognized entities partially overlap, Presidio passes incorrect text argument to the Operator.operate(text, params) function.
In EngineBase._operate() the text get dynamically replaced as it's being anonymized. The text_to_operate_on that's passed to Operator.operate(text=text_to_operate_on) is taken from this partially anonymized text instead of the original one.

text_to_operate_on = text_replace_builder.get_text_in_position(
    operator.start, operator.end
)

This can cause a problem, because in certain cases this doesn't correspond to the originally found entity. It is not noticeable when operators don't use the original text (e.g. replacing by entity type) or when the found entities are far enough from each other.

Example

Let's say we have a custom operator that replaces entity by <ENTITY_TYPE: original_text>.
There are two entities, credit card number and URL, that partially overlap.

text = "Fake card number 4151 3217 6243 3448.com that overlaps with nonexisting URL."

expected_result = "Fake card number <CREDIT_CARD: 4151 3217 6243 3448><URL: 3448.com> that overlaps with nonexisting URL."
actual_result = "Fake card number <CREDIT_CARD: 4151 3217 6243 <URL><URL: 3448.com> that overlaps with nonexisting URL."

The anonymization is done backwards, so the URL is anonymized correctly. But when anonymizing the credit card, instead of the original text 4151 3217 6243 3448, the operator gets partially replaced one 4151 3217 6243 <URL.
While this is just an artifical example, it could cause real issues for example when hashing the original values - the hashes for the same entities would be different.

Proposed solutions

  • Instead of taking text_to_operate_on from the text_replace_builder, take it from the original text.
  • The original text could be an attribute of PIIEntity (handled during analysis), and it would be used as text_to_operate_on.

Minimal code example

Python 3.9, presidio_anonymizer 2.2.28, presidio_analyzer 2.2.28

from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from presidio_analyzer.nlp_engine import SpacyNlpEngine
from presidio_analyzer.predefined_recognizers import CreditCardRecognizer, UrlRecognizer

from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
from presidio_anonymizer.operators import Operator, OperatorType


class CustomReplaceOperator(Operator):
    def operate(self, text: str, params: dict = None) -> str:
        return f"<{params['entity_type']}: {text}>"

    def validate(self, params: dict = None) -> None:
        pass

    def operator_name(self) -> str:
        return "custom_replace_operator"

    def operator_type(self) -> OperatorType:
        return OperatorType.Anonymize


recognizer_registry = RecognizerRegistry()
recognizer_registry.add_recognizer(CreditCardRecognizer())
recognizer_registry.add_recognizer(UrlRecognizer())

analyzer_engine = AnalyzerEngine(
    registry=recognizer_registry,
    nlp_engine=SpacyNlpEngine(models={"en": "en_core_web_sm"}),
    default_score_threshold=0,
    supported_languages=["en"]
)

anonymizer_engine = AnonymizerEngine()

text = "Fake card number 4151 3217 6243 3448.com that overlaps with nonexisting URL."
recognizer_results = analyzer_engine.analyze(text, language="en")
anonymized_text = anonymizer_engine.anonymize(
    text, recognizer_results, operators={"DEFAULT": OperatorConfig("custom_replace_operator")}).text

print(anonymized_text)
@omri374
Copy link
Contributor

omri374 commented Jul 1, 2022

Thanks, will look into this. Would you be interested in creating a pull request? We'd be happy to review.

@shiranr
Copy link
Contributor

shiranr commented Jul 5, 2022

I've pushed a fix for this issue. It is now on main.

@shiranr shiranr closed this as completed Jul 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants