[DO NOT MERGE] Temporary chrome deployment #4440

vitorguidi · 2024-11-26T11:04:00Z

No description provided.

We are currently logging progression when unpacking, but we're missing the last log. This can be useful to monitor what's going on in CF.

### Motivation We currently lack metrics for build retrieval and unpacking times. This PR adds that, with granularity by fuzz target and job type. There are two different implementations for build downloading/unpacking: - In the Build class, from which RegularBuild, SplitTargetBuild, FuchsiaBuild and SymbolizedBuild inherit the downloading/unpacking behavior - In the CustomBuild class, which implements its own logic There are two possible cases for downloading/unpacking: clusterfuzz either downloads the whole build and unpacks it locally, or unpacks it remotely. This is the case for all build types except CustomBuild, which does not perform remote unpacking. For build retrieval over http, we do not track download time. For all the other cases, it suffices to keep track of start/finish time for download and unpacking. Finally, a _build_type is added to the constructor of the Build class, from which all other inherit. This is used to track the build type (debug or release), and is only mutated by SymbolizedBuild when attempting to fetch a debug build. Part of #4271

For non engine fuzzers, we do not need to have the full list of fuzz targets, since those builds are only containing the APP_NAME. For that reason, and when the build is fully unpacked, lazily fetch the fuzzing targets only when requested by the user of the class. This change will tremendously speed up the unpacking step for some of our fuzzers, see https://docs.google.com/document/d/1OepfVcuG2XNXLxgZIXVwgE-uOhX9RO5RfH0TSg_wmPU/. Co-authored-by: jonathanmetzman <31354670+jonathanmetzman@users.noreply.github.com>

### Motivation We currently lack awareness on how old builds are during fuzz task. This PR implements that, by making the assumption that the Last Update Time metadata field in GCS is a good proxy for build age. [Documentation reference](https://cloud.google.com/storage/docs/json_api/v1/objects#resource) ### Approach Symbolized and custom builds do not matter, thus all builds of interest will be fetched from ```build_manager.setup_regular_build```. Logic for collecting all bucket paths and the latest revision was refactored, so that ```setup_regular_build``` can also figure out the latest revision for a given build and conditionally emit the proposed metric. ### Testing strategy !Todo: test this for fuzz, analyze, progression Locally ran tasks, with instructions from #4343 and #4345 , and verified the _emmit_build_age_metric function gets invoked and produces sane output. Commands used: ``` fuzz libFuzzer libfuzzer_asan_log4j2 ``` ![image](https://github.com/user-attachments/assets/66937297-20ec-44cf-925e-0004a905c92e) ``` progression 4992158360403968 libfuzzer_asan_qt ``` ![image](https://github.com/user-attachments/assets/0e1f1199-d1d8-4da5-814e-8d8409d1f806) ``` analyze 4992158360403968 libfuzzer_asan_qt (disclaimer: build revision was overriden mid flight to force a trunk build, since this testcase was already tied to a crash revision) ``` ![image](https://github.com/user-attachments/assets/dd3d5a60-36a1-4a9e-a21b-b72177ffdecd) Part of #4271

### Motivation Small changes for the build retrieval metric, as requested per Chrome folks after initially using the feature: * Adding the time for listing fuzz targets as a step * Added the total retrieval duration * Tracking metric as minutes, for readability Part of #4271

### Motivation The Chrome team has no easy visibility into how many manually uploaded test cases flake or successfully reproduce. This PR implements a counter metric to track that. There are three possible outcomes, each represented by a string label: 'reproduces', 'one_timer' and 'does_not_reproduce' Part of #4271

…4381) ### Motivation Once a testcase is generated (or manually uploaded), followup tasks (analyze/progression) are started. This happens by publishing to a pubsub queue, both for the manually uploaded case, and for the fuzzer generated case. If for any reason the messages are not processed, the testcase gets stuck. To get better visibility into these stuck testcases, the UNTRIAGED_TESTCASE_AGE metric is introduced, to pinpoint how old these testcases that have not yet been triaged are(more precisely, gone through analyze/regression/impact/progression tasks). ### Attention points Testcase.timestamp mutates in analyze task: https://github.com/google/clusterfuzz/blob/6ed80851ad0f6f624c5b232b0460c405f0a018b5/src/clusterfuzz/_internal/bot/tasks/utasks/analyze_task.py#L589 This makes it unreliable as a source of truth for testcase creation time. To circumvent that, a new ```created``` field is added to the Testcase entity, from which we can derive the correct creation time. Since this new field will only apply for testcases created after this PR is merged, Testcase.timestamp will be used instead to calculate the testcase age when the new field is missing. ### Testing strategy Ran the triage cron locally, and verified the codepath for the metric is hit and produces sane output (reference testcase: 4505741036158976). ![image](https://github.com/user-attachments/assets/6281b44f-768a-417e-8ec1-763f132c8181) Part of #4271

Chrome security shepherds manually upload testcases through appengine, triggering analyze task and, in case of a legitimate crash, the followup progression tasks: * Minimize * Analyze * Impact * Regression * Cleanup cronjob, when updating a bug to inform the user that all above stages were finished This PR adds instrumentation to track the time elapsed between the user upload, and the completion of the above events. * TestcaseUploadMetadata.timestamp was being mutated on the preprocess stage for analyze task. This mutation was removed, so that this entity can be the source of truth for when a testcase was in fact uploaded by the user. * The job name could be retrieved from the JOB_NAME env var within the uworker, however this does not work for the cleanup use case. For this reason, the job name is fetched from datastore instead. * The ```query_testcase_upload_metadata``` method was moved from analyze_task.py to a helpers file, so it could be reused across tasks and on the cleanup cronjob Every task mentioned was executed locally, with a valid uploaded testcase. The codepath for the metric emission was hit and produced the desired output, both for the tasks and the cronjob. Part of #4271

### Motivation Some cumulative distribution metrics (build age, retrieval, testcase age, testcase triage duration) are misbehaving and capping at 1. This PR intends to aid in debugging that.

This reverts commit 5404212.

### Motivation Cumulative distribution metrics from the monitoring initiative were incorrectly set to use the fixed width bucketer, and/or width=0.05 and max_buckets=20. This caused percentile metrics to cap at 1, which was wrong behavior. This PR attempts to fix that by moving them all to Geometric Bucketer, without the aforementioned limits. It also reverts #4429 , since it apparently broke triage.py in chrome. Part of #4271

…#4414) ### Motivation Chrome folks need to know how long on average a fuzzer takes to generate a testcase. This PR implements that. Part of #4271

…g filing (#4415) ### Motivation As per Chrome request, it is desirable to know how long it takes for an issue to be opened, from the moment a testcase is created. Part of #4271

https://crbug.com/380707237 is caused by this discrepency between master and chrome branches.

Cherry picking #4441 onto chrome temp branch Co-authored-by: Ali HIJAZI <ahijazi@google.com>

…ng everything (#4486) Now that we are lazily checking for fuzzing targets, it makes sense to allow remote unpacking even when unpacking the full archive. Furthermore, it seems that remote unpacking performances are much higher than local unpacking on CF bots, so this might improve overall performances of the build_manager.

### Motivation This merges #4489, #4458 and #4483 to the chrome temporary deployment branch The purpose is to have task error rate metrics, and log what old testcases are polluting the testcase upload metrics, so we can figure out if a purge is necessary --------- Co-authored-by: jonathanmetzman <31354670+jonathanmetzman@users.noreply.github.com>

Deploy #4492 to chrome

This merges #4494 into the temporary chrome deployment

…4498) The metric for untriaged testcae age was not considering bugs that were being filed legitimately, so there was no metric emission at all. Also, removes granularity in the stuck testcase count metric.

Merging #4500 into the chrome branch, doing CI checks first

#4503) Fixes the following error: ``` AttributeError 'ProtoType' object has no attribute 'DESCRIPTOR' Failed to flush metrics: 'ProtoType' object has no attribute 'DESCRIPTOR' Traceback (most recent call last): File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/metrics/monitor.py", line 118, in _flush_metrics metric.monitoring_v3_time_series(series, labels, start_time, end_time, File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/metrics/monitor.py", line 350, in monitoring_v3_time_series self._set_value(point.value, value) File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/metrics/monitor.py", line 404, in _set_value point.int64_value = value ^^^^^^^^^^^^^^^^^ File "/mnt/scratch0/clusterfuzz/src/third_party/proto/message.py", line 935, in __setattr__ pb_value = marshal.to_proto(pb_type, value) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/scratch0/clusterfuzz/src/third_party/proto/marshal/marshal.py", line 229, in to_proto proto_type.DESCRIPTOR.has_options ^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'ProtoType' object has no attribute 'DESCRIPTOR' ```

Running CI checks with a PR prior to deployment

cherry-pick #4411

This adds 9 archives from Fuzzilli to the general test-input archives used by fuzzers on Clusterfuzz. The Fuzzilli-side archives are refreshed every few days. We'll add a freshness metric in a follow up. This was tested locally with butler. See bug: https://crbug.com/362963886

Cherry-pick of #4526

Introduces regex-based filtering directly in the API to improve performance and reduce number of calls to ab api server. Cherry pick: #4297 Co-authored-by: aditya-wazir <108256495+aditya-wazir@users.noreply.github.com>

The device check is updated to use find instead of a literal match so that sanitized version of the devices (e.g: cheetah_hwasan) can also be used --------- Cherry pick: #4256 Co-authored-by: svasudevprasad <151788366+svasudevprasad@users.noreply.github.com>

… analyze step to testcase triage metric (#4558) This merges #4547 and #4516 to the chrome branch.

Merging #4530 into the Chrome branch

This adds all PRs related to the terraform dashboard.

Co-authored-by: Peter Boström <git@pbos.me>

Co-authored-by: Paul Semel <paulsemel@google.com>

Followup for the analyze/triage incident

Addresses: crbug.com/387828381

This fixes the following error: ``` File "/usr/local/google/home/metzman/clusterfuzz-311/butler.py", line 421, in <module> sys.exit(main()) ^^^^^^ File "/usr/local/google/home/metzman/clusterfuzz-311/butler.py", line 407, in main return command.execute(args) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/google/home/metzman/clusterfuzz-311/src/local/butler/deploy.py", line 576, in execute package.package( File "/usr/local/google/home/metzman/clusterfuzz-311/src/local/butler/package.py", line 87, in package py_unittest.execute(args={}) File "/usr/local/google/home/metzman/clusterfuzz-311/src/local/butler/py_unittest.py", line 299, in execute target=args.target, ^^^^^^^^^^^ AttributeError: 'dict' object has no attribute 'target' ```

This merges #4624 and #4616 into chrome, to unblock deployments --------- Co-authored-by: jonathanmetzman <31354670+jonathanmetzman@users.noreply.github.com>

Previous logic would overwrite `minimized_keys` if say round 5 of libfuzzer minimization failed, even if round 4 succeeded. This resulted in tasks completing minimization successfully, but not storing `minimized_keys` to reflect that in the database. Instead, we should take the result of the last successful minimization round. This both aligns with the logic around crash results and testcase file names, and prevents another side effect of this bug: we were previously deleting blobs on GCS that had been uploaded during successful minimization rounds :/ Note: it seems a similar bug affects the cleanse step, but I did not change logic there as it is OSS-Fuzz specific and I was not sure whether or not it could be intentional. See [`minimize_task.py` line 1686](https://github.com/google/clusterfuzz/pull/4626/files#diff-e8255271eeeadc2bda46b215fbc7b0bb160ef1116f6e27ff895990d88caafa5fR1686). There are exceedingly few tests for minimize task, so I did not write a test for this either - the effort required seemed too high. Bug: https://crbug.com/389589679

This was requested by chrome. I can't really think of a strong justification for this paternalistic behavior, and I can think of some against it, mainly that it causes CF to behave unlike the user expects. Cherry-picked from 01239a0 in master. Fixes: b/382207330 Co-authored-by: jonathanmetzman <31354670+jonathanmetzman@users.noreply.github.com>

Merging #4562 to branch 'chrome' Co-authored-by: Vitor Guidi <vitorguidi@gmail.com>

@pbos

Chrome check failure stack traces have changed, update the ignore regexes to match them. This generalizes previous regexes to cover the new variants: ``` logging::CheckLogMessage::~CheckLogMessage logging::DCheckLogMessage::~DCheckLogMessage logging::CheckNoreturnError::~CheckNoreturnError logging::NotReachedLogMessage::~NotReachedLogMessage logging::NotReachedNoreturnError::~NotReachedNoreturnError ``` cc @pbos @tsepez for future reference Bug: https://crbug.com/389589679 Co-authored-by: Titouan Rigoudy <titouan@chromium.org>

…OOMs. (#4639) Cherry-pick of #4635.

#4649) …read runs Merge to chrome branch.

merged to master at #4646

merge #4653 into Chrome

merge #4580 into chrome

Cherry-pick of #4656 into the `chrome` branch.

Cherry-pick of #4661 into the `chrome` branch.

merging PR #4638 chrome

Cherry-pick of #4669 into the `chrome` branch.

centipede: create workdir in prepare instead of the constructor It turns out that the same class can be used for different fuzzing rounds, and the parent directory of the temp dir is being cleared in between each round. For that reason, we need to re-create a workdir in prepare.

Cherry-pick from of #4647 into `chrome` branch.

We are getting this error during butler deploy: ``` | ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. | twine 5.1.1 requires urllib3>=1.26.0, but you have urllib3 1.24.3 which is incompatible. ``` There might have been a regression in the underlying python dependencies. We ran: ``` pipenv lock ``` And this will change the following deps: [prod] aiohappyballs: 2.4.2 > 2.4.6 aiosignal: 1.3.1 > 1.3.2 attrs: 24.2 > 25.1 cachetools: 5.5.0 > 5.5.1 deprecated: 1.2.14 > 1.2.18 frozenlist: 1.4.1 > 1.5.0 google-cloud-appengine-logging: 1.4.5 > 1.5.0 google-apis-commons-protos: 1.65 > 1.66 grpc-google-iam-v1: 0.13.1 > 0.14.0 pbr: 6.1.0 > 6.1.1 propcache: 0.2.1 (NEW) proto-plus: 1.24.0 > 1.26.0 pyjwt: 2.9.0 > 2.10.1 setuptools: 75.1.0 > 75.8.0 six: 1.16.0 > 1.17.0 yarl: 1.13.1 > 1.18.3 [dev] cachecontrol: 0.14.0 > 0.14.1 cachetools: 5.5.0 > 5.5.1 click: 8.1.7 > 8.1.8 google-cloud-firestore: 2.19.0 > 2.20.0 google-apis-common-protos: 1.65.0 > 1.66.0 markupsafe: 2.1.5 > 3.0.2 proto-plus: 1.24.0 > 1.26.0 pyjwt: 2.9.0 > 2.10.1 ### Testing We successfully deployed from butler with --release=chrome-tests-syncer and --target=zips, this unblocks deployments

Fix a bug inserted in #4675, which was deleting the manifest file before uploading it to the GCP during deployment.

…" (#4686) This reverts commit aab2f92.

paulsemel and others added 30 commits November 8, 2024 09:37

add logs at the end of unpacking (#4380)

6f3dddc

We are currently logging progression when unpacking, but we're missing the last log. This can be useful to monitor what's going on in CF.

Using correct JOB_NAME env var for JOB_BUILD_RETRIEVAL_TIME (#4398)

0d751b9

Add logging for misbehaving distribution metrics (#4429)

5404212

### Motivation Some cumulative distribution metrics (build age, retrieval, testcase age, testcase triage duration) are misbehaving and capping at 1. This PR intends to aid in debugging that.

Revert "Add logging for misbehaving distribution metrics (#4429)"

387fc03

This reverts commit 5404212.

[Monitoring] Adding a blackbox fuzzer testcase generation time metric (…

8394630

…#4414) ### Motivation Chrome folks need to know how long on average a fuzzer takes to generate a testcase. This PR implements that. Part of #4271

[Monitoring] Adding metric to track time from testcase creation to bu…

20237ab

…g filing (#4415) ### Motivation As per Chrome request, it is desirable to know how long it takes for an issue to be opened, from the moment a testcase is created. Part of #4271

Fix Datetime not being including in data_types.py (#4439)

ef74782

https://crbug.com/380707237 is caused by this discrepency between master and chrome branches.

Fix Analyze Task (#4441) (#4442)

2083e89

Cherry picking #4441 onto chrome temp branch Co-authored-by: Ali HIJAZI <ahijazi@google.com>

Add analyze task postprocess tests

382e2e1

Merge #4492 to chrome temporary branch (#4493)

8e6d060

Deploy #4492 to chrome

Merge 4494 into chrome temp branch (#4495)

a5746d6

This merges #4494 into the temporary chrome deployment

[Monitoring] Remove granularity for stuck testcases metric (#4496) (#…

978c311

…4498) The metric for untriaged testcae age was not considering bugs that were being filed legitimately, so there was no metric emission at all. Also, removes granularity in the stuck testcase count metric.

Merge #4500 to the chrome branch (#4501)

fc61aff

Merging #4500 into the chrome branch, doing CI checks first

Merge #4499 and #4481 into chrome branch (#4505)

19fea40

Running CI checks with a PR prior to deployment

Test

80b37ea

Restore logging for android commands (#4480)

61f2558

cherry-pick #4411

Delete test push (#4509)

2946c5d

Enable skipping minimization with an env var (#4527)

7bdd80b

Cherry-pick of #4526

jonathanmetzman and others added 30 commits December 26, 2024 09:09

[Chrome deployment] Split utasks properly into success and error, add…

5655b43

… analyze step to testcase triage metric (#4558) This merges #4547 and #4516 to the chrome branch.

Close old non reproducible bugs (#4559)

3a96f79

Merging #4530 into the Chrome branch

Merge all dashboards PRs into chrome (#4567)

5fdead8

This adds all PRs related to the terraform dashboard.

Cherry pick #4579 into temporary chrome branch. (#4594)

8604198

Co-authored-by: Peter Boström <git@pbos.me>

Cherry pick of 98f7d7e (#4595)

a227d9d

Co-authored-by: Paul Semel <paulsemel@google.com>

Merge #4575 into the chrome branch (#4597)

525d980

Followup for the analyze/triage incident

Cherry pick of 4cca6fc (#4609)

2fb11c9

Cherry pick #4587 (#4605)

90d136e

Addresses: crbug.com/387828381

Merge deploy reverts to chrome (#4628)

4e8d844

This merges #4624 and #4616 into chrome, to unblock deployments --------- Co-authored-by: jonathanmetzman <31354670+jonathanmetzman@users.noreply.github.com>

Check for upload permission based on google group (#4625)

13787d5

Merging #4562 to branch 'chrome' Co-authored-by: Vitor Guidi <vitorguidi@gmail.com>

Merge #4635 into chrome branch - Recognize Rust allocation errors as …

a993638

…OOMs. (#4639) Cherry-pick of #4635.

Add some logs to investigate minimization issues and disable multi th… (

602fc84

#4649) …read runs Merge to chrome branch.

Add external testcase config to local_config (#4646) (#4651)

e4771b6

merged to master at #4646

Implement rudimentary rate limiting (#4653) (#4658)

c7f14ff

merge #4653 into Chrome

Update issue type and component for VRP uploaded bugs (#4580) (#4664)

5fdca6b

merge #4580 into chrome

Improve logs of the grouper cron job (#4666)

19df64f

Cherry-pick of #4656 into the `chrome` branch.

Use data bundle bucket name from the database. (#4667)

0a8fdfa

Cherry-pick of #4661 into the `chrome` branch.

Clean up external_testcase_reader for testing round 1 (#4665)

9f48cff

merging PR #4638 chrome

Convert testcase IDs to strings before joining. (#4670)

f0cadf4

Cherry-pick of #4669 into the `chrome` branch.

Enable deploying chrome-tests-syncer independently (#4675)

2c697b3

Cherry-pick from of #4647 into `chrome` branch.

Fix package deployment following tests-syncer changes (#4683)

6d50743

Fix a bug inserted in #4675, which was deleting the manifest file before uploading it to the GCP during deployment.

Revert "Bump dependencies to fix twine urllib mismatch (#4681) (#4682)…

7a2f0b0

…" (#4686) This reverts commit aab2f92.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DO NOT MERGE] Temporary chrome deployment #4440

[DO NOT MERGE] Temporary chrome deployment #4440

vitorguidi commented Nov 26, 2024

[DO NOT MERGE] Temporary chrome deployment #4440

Are you sure you want to change the base?

[DO NOT MERGE] Temporary chrome deployment #4440

Conversation

vitorguidi commented Nov 26, 2024