Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] Temporary chrome deployment #4440

Open
wants to merge 73 commits into
base: master
Choose a base branch
from
Open

[DO NOT MERGE] Temporary chrome deployment #4440

wants to merge 73 commits into from

Conversation

vitorguidi
Copy link
Collaborator

No description provided.

paulsemel and others added 30 commits November 8, 2024 09:37
We are currently logging progression when unpacking, but we're missing
the last log. This can be useful to monitor what's going on in CF.
### Motivation

We currently lack metrics for build retrieval and unpacking times. This
PR adds that, with granularity by fuzz target and job type.

There are two different implementations for build downloading/unpacking:

- In the Build class, from which RegularBuild, SplitTargetBuild,
FuchsiaBuild and SymbolizedBuild inherit the downloading/unpacking
behavior
- In the CustomBuild class, which implements its own logic

There are two possible cases for downloading/unpacking: clusterfuzz
either downloads the whole build and unpacks it locally, or unpacks it
remotely. This is the case for all build types except CustomBuild, which
does not perform remote unpacking.

For build retrieval over http, we do not track download time. For all
the other cases, it suffices to keep track of start/finish time for
download and unpacking.

Finally, a _build_type is added to the constructor of the Build class,
from which all other inherit. This is used to track the build type
(debug or release), and is only mutated by SymbolizedBuild when
attempting to fetch a debug build.

Part of #4271
For non engine fuzzers, we do not need to have the full list of fuzz
targets, since those builds are only containing the APP_NAME. For that
reason, and when the build is fully unpacked, lazily fetch the fuzzing
targets only when requested by the user of the class.

This change will tremendously speed up the unpacking step for some of
our fuzzers, see
https://docs.google.com/document/d/1OepfVcuG2XNXLxgZIXVwgE-uOhX9RO5RfH0TSg_wmPU/.

Co-authored-by: jonathanmetzman <31354670+jonathanmetzman@users.noreply.github.com>
### Motivation

We currently lack awareness on how old builds are during fuzz task. This
PR implements that, by making the assumption that the Last Update Time
metadata field in GCS is a good proxy for build age. [Documentation
reference](https://cloud.google.com/storage/docs/json_api/v1/objects#resource)

### Approach

Symbolized and custom builds do not matter, thus all builds of interest
will be fetched from ```build_manager.setup_regular_build```. Logic for
collecting all bucket paths and the latest revision was refactored, so
that ```setup_regular_build``` can also figure out the latest revision
for a given build and conditionally emit the proposed metric.

### Testing strategy

!Todo: test this for fuzz, analyze, progression

Locally ran tasks, with instructions from #4343 and #4345 , and verified
the _emmit_build_age_metric function gets invoked and produces sane
output.

Commands used:
```
fuzz libFuzzer libfuzzer_asan_log4j2
```

![image](https://github.com/user-attachments/assets/66937297-20ec-44cf-925e-0004a905c92e)

```
progression 4992158360403968 libfuzzer_asan_qt
```

![image](https://github.com/user-attachments/assets/0e1f1199-d1d8-4da5-814e-8d8409d1f806)

```
analyze 4992158360403968 libfuzzer_asan_qt (disclaimer: build revision was overriden mid flight to force a trunk build, since this testcase was already tied to a crash revision)
```

![image](https://github.com/user-attachments/assets/dd3d5a60-36a1-4a9e-a21b-b72177ffdecd)


Part of #4271
### Motivation

Small changes for the build retrieval metric, as requested per Chrome
folks after initially using the feature:

* Adding the time for listing fuzz targets as a step
* Added the total retrieval duration
* Tracking metric as minutes, for readability

Part of #4271
### Motivation

The Chrome team has no easy visibility into how many manually uploaded
test cases flake or successfully reproduce. This PR implements a counter
metric to track that.

There are three possible outcomes, each represented by a string label:
'reproduces', 'one_timer' and 'does_not_reproduce'

Part of #4271
…4381)

### Motivation


Once a testcase is generated (or manually uploaded), followup tasks
(analyze/progression) are started. This happens by publishing to a
pubsub queue, both for the manually uploaded case, and for the fuzzer
generated case.

If for any reason the messages are not processed, the testcase gets
stuck. To get better visibility into these stuck testcases, the
UNTRIAGED_TESTCASE_AGE metric is introduced, to pinpoint how old these
testcases that have not yet been triaged are(more precisely, gone
through analyze/regression/impact/progression tasks).


### Attention points

Testcase.timestamp mutates in analyze task:


https://github.com/google/clusterfuzz/blob/6ed80851ad0f6f624c5b232b0460c405f0a018b5/src/clusterfuzz/_internal/bot/tasks/utasks/analyze_task.py#L589

This makes it unreliable as a source of truth for testcase creation
time. To circumvent that, a new ```created``` field is added to the
Testcase entity, from which we can derive the correct creation time.

Since this new field will only apply for testcases created after this PR
is merged, Testcase.timestamp will be used instead to calculate the
testcase age when the new field is missing.

### Testing strategy

Ran the triage cron locally, and verified the codepath for the metric is
hit and produces sane output (reference testcase: 4505741036158976).

![image](https://github.com/user-attachments/assets/6281b44f-768a-417e-8ec1-763f132c8181)


Part of #4271
Chrome security shepherds manually upload testcases through appengine,
triggering analyze task and, in case of a legitimate crash, the followup
progression tasks:
* Minimize
* Analyze
* Impact
* Regression
* Cleanup cronjob, when updating a bug to inform the user that all above
stages were finished

This PR adds instrumentation to track the time elapsed between the user
upload, and the completion of the above events.

* TestcaseUploadMetadata.timestamp was being mutated on the preprocess
stage for analyze task. This mutation was removed, so that this entity
can be the source of truth for when a testcase was in fact uploaded by
the user.

* The job name could be retrieved from the JOB_NAME env var within the
uworker, however this does not work for the cleanup use case. For this
reason, the job name is fetched from datastore instead.

* The ```query_testcase_upload_metadata``` method was moved from
analyze_task.py to a helpers file, so it could be reused across tasks
and on the cleanup cronjob

Every task mentioned was executed locally, with a valid uploaded
testcase. The codepath for the metric emission was hit and produced the
desired output, both for the tasks and the cronjob.

Part of #4271
### Motivation

Some cumulative distribution metrics (build age, retrieval, testcase
age, testcase triage duration) are misbehaving and capping at 1. This PR
intends to aid in debugging that.
### Motivation

Cumulative distribution metrics from the monitoring initiative were
incorrectly set to use the fixed width bucketer, and/or width=0.05 and
max_buckets=20. This caused percentile metrics to cap at 1, which was
wrong behavior.

This PR attempts to fix that by moving them all to Geometric Bucketer,
without the aforementioned limits. It also reverts #4429 , since it
apparently broke triage.py in chrome.

Part of #4271
…#4414)

### Motivation

Chrome folks need to know how long on average a fuzzer takes to generate
a testcase. This PR implements that.

Part of #4271
…g filing (#4415)

### Motivation

As per Chrome request, it is desirable to know how long it takes for an
issue to be opened, from the moment a testcase is created.

Part of #4271
https://crbug.com/380707237 is caused by this discrepency between master
and chrome branches.
Cherry picking #4441 onto chrome temp branch

Co-authored-by: Ali HIJAZI <ahijazi@google.com>
…ng everything (#4486)

Now that we are lazily checking for fuzzing targets, it makes sense to
allow remote unpacking even when unpacking the full archive.
Furthermore, it seems that remote unpacking performances are much higher
than local unpacking on CF bots, so this might improve overall
performances of the build_manager.
### Motivation

This merges #4489, #4458 and #4483 to the chrome temporary deployment
branch

The purpose is to have task error rate metrics, and log what old
testcases are polluting the testcase upload metrics, so we can figure
out if a purge is necessary

---------

Co-authored-by: jonathanmetzman <31354670+jonathanmetzman@users.noreply.github.com>
This merges #4494 into the temporary chrome deployment
…4498)

The metric for untriaged testcae age was not considering bugs that were
being filed legitimately, so there was no metric emission at all.

Also, removes granularity in the stuck testcase count metric.
Merging #4500 into the chrome branch, doing CI checks first
#4503)

Fixes the following error:

```
AttributeError
'ProtoType' object has no attribute 'DESCRIPTOR'
Failed to flush metrics: 'ProtoType' object has no attribute 'DESCRIPTOR' Traceback (most recent call last): File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/metrics/monitor.py", line 118, in _flush_metrics metric.monitoring_v3_time_series(series, labels, start_time, end_time, File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/metrics/monitor.py", line 350, in monitoring_v3_time_series self._set_value(point.value, value) File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/metrics/monitor.py", line 404, in _set_value point.int64_value = value ^^^^^^^^^^^^^^^^^ File "/mnt/scratch0/clusterfuzz/src/third_party/proto/message.py", line 935, in __setattr__ pb_value = marshal.to_proto(pb_type, value) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/scratch0/clusterfuzz/src/third_party/proto/marshal/marshal.py", line 229, in to_proto proto_type.DESCRIPTOR.has_options ^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'ProtoType' object has no attribute 'DESCRIPTOR'
```
Running CI checks with a PR prior to deployment
This adds 9 archives from Fuzzilli to the general test-input archives
used by fuzzers on Clusterfuzz. The Fuzzilli-side archives are refreshed
every few days. We'll add a freshness metric in a follow up.

This was tested locally with butler.

See bug: https://crbug.com/362963886
jonathanmetzman and others added 30 commits December 26, 2024 09:09
Introduces regex-based filtering directly in the API to improve
performance and reduce number of calls to ab api server.

Cherry pick: #4297

Co-authored-by: aditya-wazir <108256495+aditya-wazir@users.noreply.github.com>
The device check is updated to use find instead of a literal match so
that sanitized version of the devices (e.g: cheetah_hwasan) can also be
used

---------

Cherry pick: #4256

Co-authored-by: svasudevprasad <151788366+svasudevprasad@users.noreply.github.com>
… analyze step to testcase triage metric (#4558)

This merges #4547 and #4516 to the chrome branch.
Merging #4530 into the Chrome
branch
This adds all PRs related to the terraform dashboard.
Co-authored-by: Peter Boström <git@pbos.me>
Co-authored-by: Paul Semel <paulsemel@google.com>
Followup for the analyze/triage incident
Addresses: crbug.com/387828381
This fixes the following error:

```
  File "/usr/local/google/home/metzman/clusterfuzz-311/butler.py", line 421, in <module>                                                                                                                           
    sys.exit(main())                                                                                                                                                                                               
             ^^^^^^                                                                                                                                                                                                
  File "/usr/local/google/home/metzman/clusterfuzz-311/butler.py", line 407, in main                                                                                                                               
    return command.execute(args)                                                                                                                                                                                   
           ^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                   
  File "/usr/local/google/home/metzman/clusterfuzz-311/src/local/butler/deploy.py", line 576, in execute                                                                                                           
    package.package(                                                                                                                                                                                               
  File "/usr/local/google/home/metzman/clusterfuzz-311/src/local/butler/package.py", line 87, in package
    py_unittest.execute(args={})                                                                                                                                                                                   
  File "/usr/local/google/home/metzman/clusterfuzz-311/src/local/butler/py_unittest.py", line 299, in execute                                 
    target=args.target,                                                                                                                                                                                            
           ^^^^^^^^^^^
AttributeError: 'dict' object has no attribute 'target'
```
This merges #4624 and #4616 into chrome, to unblock deployments

---------

Co-authored-by: jonathanmetzman <31354670+jonathanmetzman@users.noreply.github.com>
Previous logic would overwrite `minimized_keys` if say round 5 of
libfuzzer minimization failed, even if round 4 succeeded. This resulted
in tasks completing minimization successfully, but not storing
`minimized_keys` to reflect that in the database.

Instead, we should take the result of the last successful minimization
round. This both aligns with the logic around crash results and testcase
file names, and prevents another side effect of this bug: we were
previously deleting blobs on GCS that had been uploaded during
successful minimization rounds :/

Note: it seems a similar bug affects the cleanse step, but I did not
change logic there as it is OSS-Fuzz specific and I was not sure whether
or not it could be intentional. See [`minimize_task.py` line
1686](https://github.com/google/clusterfuzz/pull/4626/files#diff-e8255271eeeadc2bda46b215fbc7b0bb160ef1116f6e27ff895990d88caafa5fR1686).

There are exceedingly few tests for minimize task, so I did not write a
test for this either - the effort required seemed too high.

Bug: https://crbug.com/389589679
This was requested by chrome. I can't really think of a strong
justification for this paternalistic behavior, and I can think of some
against it, mainly that it causes CF to behave unlike the user expects.

Cherry-picked from 01239a0 in master.

Fixes: b/382207330

Co-authored-by: jonathanmetzman <31354670+jonathanmetzman@users.noreply.github.com>
Merging #4562 to branch
'chrome'

Co-authored-by: Vitor Guidi <vitorguidi@gmail.com>
Chrome check failure stack traces have changed, update the ignore
regexes to match them. This generalizes previous regexes to cover the
new variants:

```
logging::CheckLogMessage::~CheckLogMessage
logging::DCheckLogMessage::~DCheckLogMessage
logging::CheckNoreturnError::~CheckNoreturnError
logging::NotReachedLogMessage::~NotReachedLogMessage
logging::NotReachedNoreturnError::~NotReachedNoreturnError
```

cc @pbos @tsepez for future reference

Bug: https://crbug.com/389589679

Co-authored-by: Titouan Rigoudy <titouan@chromium.org>
Cherry-pick of #4656 into the `chrome` branch.
Cherry-pick of #4661 into the `chrome` branch.
Cherry-pick of #4669 into the `chrome` branch.
centipede: create workdir in prepare instead of the constructor

It turns out that the same class can be used for different fuzzing
rounds, and the parent directory of the temp dir is being cleared in
between each round. For that reason, we need to re-create a workdir in
prepare.
Cherry-pick from of #4647 into `chrome` branch.
We are getting this error during butler deploy:

```
| ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
| twine 5.1.1 requires urllib3>=1.26.0, but you have urllib3 1.24.3 which is incompatible.
```

There might have been a regression in the underlying python
dependencies. We ran:

```
pipenv lock
```

And this will change the following deps:

[prod]
aiohappyballs: 2.4.2 > 2.4.6
aiosignal: 1.3.1 > 1.3.2
attrs: 24.2 > 25.1
cachetools: 5.5.0 > 5.5.1
deprecated: 1.2.14 > 1.2.18
frozenlist: 1.4.1 > 1.5.0
google-cloud-appengine-logging: 1.4.5 > 1.5.0
google-apis-commons-protos: 1.65  > 1.66
grpc-google-iam-v1: 0.13.1 > 0.14.0
pbr: 6.1.0 > 6.1.1
propcache: 0.2.1 (NEW)
proto-plus: 1.24.0 > 1.26.0
pyjwt: 2.9.0 > 2.10.1
setuptools: 75.1.0 > 75.8.0
six: 1.16.0 > 1.17.0
yarl: 1.13.1 > 1.18.3

[dev]
cachecontrol: 0.14.0 > 0.14.1
cachetools: 5.5.0 > 5.5.1
click: 8.1.7 > 8.1.8
google-cloud-firestore: 2.19.0 > 2.20.0
google-apis-common-protos: 1.65.0 > 1.66.0
markupsafe: 2.1.5 > 3.0.2
proto-plus: 1.24.0 > 1.26.0
pyjwt: 2.9.0 > 2.10.1

### Testing

We successfully deployed from butler with --release=chrome-tests-syncer
and --target=zips, this unblocks deployments
Fix a bug inserted in #4675, which was deleting the manifest file before
uploading it to the GCP during deployment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants