Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add time-based cleanup for Maven snapshot versions #33420

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dianaStr7
Copy link
Contributor

Pull Request: Implement Cleanup Function for Maven Snapshot Versions

Overview

This pull request introduces a cleanup function for Maven snapshot versions in Gitea, enabling more efficient management of package storage. The new feature allows users to specify how many of the most recent Maven snapshot builds to retain, optimizing storage by automatically removing older files.

Features

  • New Configurable Variable: RETAIN_MAVEN_SNAPSHOT_BUILDS allows setting the number of Maven snapshot builds to keep.
    • Default is -1, which keeps all builds.
  • Automated Cleanup: Tied to the cleanup expired data process, this function targets files within Maven snapshots, not the versions themselves, ensuring that only essential files are kept.

Implementation

The feature extends the existing cleanup job to include files from Maven snapshot versions based on the specified retention policy set in app.ini. It checks against the highest build number from the Maven metadata file to determine which files to retain.

Impact

This enhancement helps manage disk space more effectively by providing control over how many builds of Maven snapshots are retained, potentially reducing storage requirements for projects using Maven within Gitea.

@GiteaBot GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Jan 27, 2025
@pull-request-size pull-request-size bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 27, 2025
@github-actions github-actions bot added modifies/go Pull requests that update Go code docs-update-needed The document needs to be updated synchronously labels Jan 27, 2025
@dianaStr7
Copy link
Contributor Author

Considered Alternative Approach

I considered another way to handle Maven snapshot cleanup that uses the properties field for package files to store build numbers right when files are uploaded. This would make database searches simpler.

This method would need a database migration to work, so it's a bit more complicated. I have the code ready for this alternative if we think it's worth the extra steps later on.

@delvh delvh changed the title Added cleanup method for files in Maven snapshot versions Add time-based cleanup for Maven snapshot versions Jan 31, 2025
@delvh
Copy link
Member

delvh commented Jan 31, 2025

Note to any reviewer: Review this PR carefully.
It seems like an LLM at the very least helped with creating this PR.
That in and of itself is nothing bad, as long as the output is indeed correct.
Given my experience with ChatGPT, I find it rather likely that it missed some edge cases though.

@delvh delvh requested a review from KN4CK3R January 31, 2025 17:30
@dianaStr7
Copy link
Contributor Author

A remark: Yes, it's true that I used ChatGPT for help since I'm not an experienced Go developer, but I handled the core logic and testing myself. I'd appreciate any comments or critiques to help us improve this, as we really need the feature :)

Note to any reviewer: Review this PR carefully. It seems like an LLM at the very least helped with creating this PR. That in and of itself is nothing bad, as long as the output is indeed correct. Given my experience with ChatGPT, I find it rather likely that it missed some edge cases though.

@@ -101,7 +103,7 @@ func loadPackagesFrom(rootCfg ConfigProvider) (err error) {
Packages.LimitSizeRubyGems = mustBytes(sec, "LIMIT_SIZE_RUBYGEMS")
Packages.LimitSizeSwift = mustBytes(sec, "LIMIT_SIZE_SWIFT")
Packages.LimitSizeVagrant = mustBytes(sec, "LIMIT_SIZE_VAGRANT")
Packages.DefaultRPMSignEnabled = sec.Key("DEFAULT_RPM_SIGN_ENABLED").MustBool(false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line seems to have been deleted accidentally

var metadata MavenMetadata

dec := xml.NewDecoder(r)
dec.CharsetReader = charset.NewReaderLabel // Assuming charset.NewReaderLabel is a function you've set up to handle character encoding.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Erm...
I mean, it should probably work if I'm reading the documentation correctly, but the comment seems really out of place

Suggested change
dec.CharsetReader = charset.NewReaderLabel // Assuming charset.NewReaderLabel is a function you've set up to handle character encoding.
dec.CharsetReader = charset.NewReaderLabel

@@ -109,3 +131,20 @@ func ParsePackageMetaData(r io.Reader) (*Metadata, error) {
Dependencies: dependencies,
}, nil
}

// ParseMavenMetadata parses the Maven metadata XML to extract the build number.
func ParseMavenMetaData(r io.Reader) (string, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thoroughly confused by this method signature and docs: Nothing indicates to me that this method parses the entire XML, and throws the result except for the build number away.

}
}

log.Info("Filtered %d files out of %d total files for version ID %d with maxBuildNumber %d", len(filteredFiles), len(files), versionID, maxBuildNumber)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm… I do not like seeing log statements inside models/.
That should be the job of the services/ layer.
Thus, what about adding an extra return value (skippedFiles) that contains all skipped files?

}
}

log.Info("Filtered %d files out of %d total files for version ID %d with maxBuildNumber %d", len(filteredFiles), len(files), versionID, maxBuildNumber)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Info logs are rare within Gitea.
At most, gather statistics inside a success message after successfully running the cron job, or downgrade it to debug.

}

func isSnapshotVersion(version string) bool {
return strings.Contains(version, "-SNAPSHOT")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it even HasSuffix?

@@ -226,6 +229,62 @@ func HasFiles(ctx context.Context, opts *PackageFileSearchOptions) (bool, error)
return db.Exist[PackageFile](ctx, opts.toConds())
}

// GetFilesByBuildNumber retrieves all files for a package version with build numbers <= maxBuildNumber.
func GetFilesByBuildNumber(ctx context.Context, versionID int64, maxBuildNumber int) ([]*PackageFile, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that rather GetFilesBelowBuildNumber instead?


buildNumber, err := strconv.Atoi(buildNumberStr)
if err != nil {
return 0, fmt.Errorf("failed to convert build number to integer: '%s'", buildNumberStr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return 0, fmt.Errorf("failed to convert build number to integer: '%s'", buildNumberStr)
return 0, fmt.Errorf("failed to convert maven package build number to integer: '%s'", buildNumberStr)


thresholdBuildNumber := maxBuildNumber - retainBuilds
if thresholdBuildNumber <= 0 {
log.Info("No files to clean up, as the threshold build number is less than or equal to zero for versionID %d", versionID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the error messages and logs in general, I recommend explicitly stating during what sort of action they occurred.
When you actually see these logs, it will be hard to guess that they belong to the Maven snapshot cleanup otherwise, especially when a request happens in the meantime.

return nil
}

func extractMaxBuildNumberFromMetadata(ctx context.Context, metadataFile *packages.PackageFile) (int, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't you define this method in models already?

@GiteaBot GiteaBot added lgtm/blocked A maintainer has reservations with the PR and thus it cannot be merged and removed lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. labels Feb 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs-update-needed The document needs to be updated synchronously lgtm/blocked A maintainer has reservations with the PR and thus it cannot be merged modifies/go Pull requests that update Go code size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants