Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

solr reindexing through events #873

Open
wants to merge 12 commits into
base: dtq-dev
Choose a base branch
from

Conversation

Paurikova2
Copy link
Collaborator

@Paurikova2 Paurikova2 commented Feb 13, 2025

Phases MP MM MB MR JM Total
ETA 0 0 0 0 0 0
Developing 8 0 0 0 0 0
Review 0 0 0 0 0 0
Total - - - - - 0
ETA est. 0
ETA cust. - - - - - 0

Problem description

Modify the reindexing of an item by event.

Summary by CodeRabbit

  • New Features

    • Introduced enhanced event handling to improve system responsiveness.
    • Added an automated build trigger for customer branches.
  • Refactoring & Improvements

    • Streamlined logging and access control processes for more robust permission management.
    • Removed outdated indexing components to optimize update performance.
  • UI & Configuration

    • Updated submission form fields to support CMDI file uploads.
    • Strengthened data integrity constraints in resource management for improved reliability.

@Paurikova2 Paurikova2 linked an issue Feb 13, 2025 that may be closed by this pull request
@Paurikova2
Copy link
Collaborator Author

@milanmajchrak
Please check. In v5, there is a line in the code at https://github.com/dataquest-dev/DSpace/pull/873/files#diff-bb13118f9051e0624b4f4070b505be320b2e7a98dca5b247ca22dab9ac9cef60R136: Context(Constants.Modes.READ_ONLY). But in v7, the Context() is used for anonymous read access, isn’t it? So, I removed the READ_ONLY mode. (With this mode, I had the problem that the editor could not see changes.) Is this correct? I think this issue might reflect Ondrej’s potential problem with the anonymous context mentioned in #585.

@Paurikova2 Paurikova2 requested a review from milanmajchrak March 4, 2025 10:02
Copy link
Collaborator

@milanmajchrak milanmajchrak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also you have a merge conflict

@Paurikova2 Paurikova2 requested a review from milanmajchrak March 6, 2025 20:58
@Paurikova2
Copy link
Collaborator Author

@milanmajchrak I added test to test scenatious.

Copy link
Collaborator

@milanmajchrak milanmajchrak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the code duplicities.

@Paurikova2 Paurikova2 requested a review from milanmajchrak March 9, 2025 18:22
@milanmajchrak milanmajchrak requested a review from vidiecan March 10, 2025 08:43
@vidiecan
Copy link

@coderabbitai review

Copy link

coderabbitai bot commented Mar 11, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

coderabbitai bot commented Mar 11, 2025

Walkthrough

This pull request refactors and standardizes the handling of resource policies, event logging, and authorization across the codebase. Major changes include consolidating event logging via a new method in the resource policy service, updating method signatures to pass user and group parameters directly, and standardizing null checks. The dependency on SolrOAIReindexer has been removed from several repository classes, and configuration files have been updated to support new event consumers and dependency exclusions. Additional improvements enhance error handling in Sword v2 integration tests and adjust submission form metadata fields.

Changes

File(s) Change Summary
dspace-api/src/main/java/org/dspace/authorize/ResourcePolicyServiceImpl.java
dspace-api/src/main/java/org/dspace/authorize/service/ResourcePolicyService.java
Refactored event logging by adding the addEventModify method; updated ResourcePolicy creation to pass EPerson/Group parameters.
dspace-oai/src/main/java/org/dspace/event/OAIIndexEventConsumer.java Replaced inline null checks with Objects.isNull and Objects.nonNull for readability.
dspace-oai/src/main/java/org/dspace/xoai/app/XOAI.java Removed unused import of org.apache.commons.lang3.StringUtils.
dspace-server-webapp/src/main/java/org/dspace/app/rest/repository/DSpaceObjectRestRepository.java
.../ItemRestRepository.java
.../WorkflowItemRestRepository.java
.../SolrOAIReindexer.java
Removed the SolrOAIReindexer dependency and its related reindexing calls.
dspace-server-webapp/src/main/java/org/dspace/app/rest/repository/ResourcePolicyRestRepository.java Removed return statement for EPerson branch in createAndReturn, modifying the control flow for policy creation.
dspace-api/src/main/java/org/dspace/app/itemimport/ItemImportServiceImpl.java Updated setPermission method to use a new rpType string parameter and pass the group directly during ResourcePolicy creation.
dspace-api/src/main/java/org/dspace/authorize/AuthorizeServiceImpl.java
dspace-api/src/main/java/org/dspace/authorize/FixDefaultPolicies.java
dspace-api/src/main/java/org/dspace/authorize/PolicySet.java
Updated various ResourcePolicy creation calls to accept EPerson/Group parameters directly, streamlining the policy setup process.
dspace-api/src/main/java/org/dspace/content/InstallItemServiceImpl.java
dspace-api/src/main/java/org/dspace/content/crosswalk/METSRightsCrosswalk.java
dspace-api/src/main/java/org/dspace/content/packager/AbstractMETSIngester.java
Modified ResourcePolicy instantiation patterns and added AuthorizeService support in ingestion flows.
dspace-api/src/main/resources/org/dspace/storage/rdbms/sqlmigration/*/V7.6_2023.09.28__enforce_group_or_eperson_for_resourcepolicy.sql Deleted records with null eperson and group; added a constraint enforcing at least one non-null identifier.
dspace/config/clarin-dspace.cfg Added the XOAI event consumer with specific filters and updated the default event consumers list.
.github/workflows/build.yml
.github/workflows/trigger-builds.yml
Upgraded checkout action from v3 to v4 and introduced a new workflow to trigger builds for customer branches.
dspace-server-webapp/pom.xml Added exclusions for conflicting dependencies (e.g., json-smart) and introduced a new json-smart dependency for proper version management.
dspace-swordv2/src/main/java/org/dspace/sword2/CollectionDepositManagerDSpace.java
dspace-swordv2/src/main/java/org/dspace/sword2/SwordMETSContentIngester.java
Enhanced error handling by adding specific catch blocks for Sword-related exceptions.
dspace/config/submission-forms.xml
dspace/config/submission-forms_cs.xml
Renamed <dc-element>hasMetadata</dc-element> to <dc-element>hasCMDI</dc-element> and updated field attributes (e.g., repeatable flag).
(Various test builder and REST test files) Updated ResourcePolicy builder and test classes to pass additional parameters (EPerson/group) directly, removing chained .withUser()/.withGroup() calls.

Sequence Diagram(s)

sequenceDiagram
    participant Service as ResourcePolicyServiceImpl
    participant Context as DSpace Context
    participant DSO as DSpaceObject (Item)
    participant Event as Event Logger

    Service->>+Service: addEventModify(context, dso)
    alt dso is an Item
       Service->>Event: Create MODIFY event
       Service->>Context: Register event in context
    else
       Service->>Context: Skip event creation
    end
Loading
sequenceDiagram
    participant Client
    participant DepositManager as CollectionDepositManagerDSpace
    participant Logger

    Client->>DepositManager: createNew(depositData)
    alt SwordAuthException occurs
       DepositManager->>Logger: Log SwordAuthException
       DepositManager-->>Client: Throw SwordAuthException
    else
       DepositManager->>Client: Process deposit normally
    end
Loading

Poem

I'm a little bunny, hopping through the code,
Finding refactored paths lighten my load.
Policies now flow with a crisp, clear beat,
Logs, tests, and configs all wonderfully neat.
With every new change, my joy's on the rise—
Code carrots and bounces beneath bright skies!

✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

Actionable comments posted: 2

🔭 Outside diff range comments (1)
dspace-oai/src/main/java/org/dspace/event/OAIIndexEventConsumer.java (1)

157-169: ⚠️ Potential issue

Remove duplicated code block to avoid compile or runtime issues.

These lines appear to repeat the exception-handling and finalization logic already included above (lines 141-149). Keeping both copies may lead to compiler errors or unexpected behavior.

-            throw e;
-        } finally {
-            if (Objects.nonNull(anonymousContext)) {
-                anonymousContext.complete();
-            }
-        }
-    }
-    
-    public void finish(Context ctx) throws Exception {
-        // No-op
-    }
-}
+    // Remove the duplicated lines (157-169) if they are artifacts from a merge or snippet error
🧹 Nitpick comments (14)
.github/workflows/trigger-builds.yml (1)

19-29: Enhance Shell Script Robustness in Build Triggering

The current script block successfully authenticates with the GitHub CLI, fetches remote branches, and iterates over branches matching the customer/ pattern to trigger their builds.

Points to consider:

  • Error Handling: If no matching branches are found, the loop silently does nothing. It could be useful to add a check for an empty branch list or log a message indicating that no customer branches were detected.
  • Quoting Variables: To prevent potential issues with branch names containing special characters or spaces, consider quoting the branch variable when used in commands.

Below is a suggested diff snippet that incorporates these improvements:

-          git fetch --prune origin  # Ensure remote refs are fetched
-          BRANCHES=$(git ls-remote --heads origin | awk -F'/' '{print $3"/"$4}' | grep '^customer/')
-          for branch in $(echo "$BRANCHES" | sed -e 's/[\[\]"]//g' -e 's/,/\n/g'); do
-            echo "Triggering build for branch: $branch"
-            gh workflow run build.yml --ref $branch
-          done
+          set -e  # Exit on any error
+          git fetch --prune origin  # Ensure remote refs are fetched
+          BRANCHES=$(git ls-remote --heads origin | awk -F'/' '{print $3"/"$4}' | grep '^customer/')
+          if [ -z "$BRANCHES" ]; then
+            echo "No customer branches found to trigger."
+          else
+            for branch in $(echo "$BRANCHES" | sed -e 's/[\[\]"]//g' -e 's/,/\n/g'); do
+              echo "Triggering build for branch: $branch"
+              gh workflow run build.yml --ref "$branch"
+            done
+          fi

These changes will make the script more robust and clearer in its intent.

dspace-api/src/main/resources/org/dspace/storage/rdbms/sqlmigration/h2/V7.6_2023.09.28__enforce_group_or_eperson_for_resourcepolicy.sql (1)

9-9: Cleaning up invalid resource policies.

The SQL statement removes resource policies that don't have an associated EPerson or Group, which is a good preparatory step before enforcing the constraint. Consider adding a comment explaining the significance of this cleanup for future maintainers.

-DELETE FROM ResourcePolicy WHERE eperson_id is null and epersongroup_id is null;
+-- Remove orphaned resource policies with no associated user or group
+DELETE FROM ResourcePolicy WHERE eperson_id is null and epersongroup_id is null;
dspace/config/clarin-dspace.cfg (1)

308-308: Uncommented metadata editing configuration

The line for allowing metadata editing has been uncommented but left empty. This appears to be preparation for future configuration.

Consider adding a comment explaining the intended use of this configuration and why it's currently empty.

dspace-oai/src/main/java/org/dspace/event/OAIIndexEventConsumer.java (3)

36-38: Consider removing or documenting this empty initialize() method.

This method is currently empty and does not appear to be overridden. If it's not required by an interface contract or used by downstream subclasses, you can remove it to reduce boilerplate code.


47-100: Refactor nested ifs to improve maintainability.

The large block of conditional checks in the consume method makes the code difficult to read and maintain. Consider extracting smaller logic units (e.g., separate methods for handling items, bundles, and communities) or using a strategy pattern to reduce the complexity.


114-150: Consider verifying the new application context creation logic.

Creating a new AnnotationConfigApplicationContext for indexing each time may introduce unnecessary overhead. If performance or resource usage becomes a concern, consider reusing an existing context, or deferring initialization until truly needed.

dspace-oai/src/main/java/org/dspace/xoai/app/XOAI.java (3)

112-118: Consider lazy initialization for the Spring application context.

Performing the new AnnotationConfigApplicationContext in a static initializer or instance initializer might increase startup overhead. If performance or memory usage is a concern, consider lazy-loading these beans on demand.


734-744: Centralize or rename isTest() logic for clarity.

Hardcoding "jdbc:h2:mem:test" for detection can be brittle. Consider a configuration property or a dedicated environment check for test mode, especially if future test setups differ.


755-774: Include the original exception when re-throwing for better stack traces.

Currently, the code discards the original exception cause when throwing a new RuntimeException. Consider retaining it:

-    throw new RuntimeException("Cannot reindex the item with ID: " + item.getID() + " because: "
-            + e.getMessage());
+    throw new RuntimeException("Cannot reindex the item with ID: " + item.getID(), e);

This allows upstream handlers to accurately capture the root cause.

dspace-server-webapp/src/test/java/org/dspace/app/sword2/Swordv2IT.java (2)

93-104: Configuration overrides for SWORDv2 in the @before method.

Setting the SWORDv2 properties under test conditions is helpful to align integration tests with the local environment. Consider verifying that these config overrides do not inadvertently persist in other test classes, especially if parallel test execution is enabled.


189-229: depositItemWithEmbargo - Tests item upload with embargo.

This test ensures that a zipped embargoed item is accepted and returns HTTP 201 with the correct ATOM entry content type. Optionally, consider verifying the embargo policy details in the resulting item, but this is a good start.

dspace-server-webapp/src/test/java/org/dspace/app/rest/ResourcePolicyRestRepositoryIT.java (1)

1048-1106: Consider extending negative tests.
This new test method for creating a policy with a group appears correct. As a minor improvement, consider adding negative or edge-case scenarios (e.g., invalid group references).

dspace-api/src/main/java/org/dspace/authorize/ResourcePolicyServiceImpl.java (2)

306-312: Consider refactoring for DRY.
Repeatedly checking if (r.getdSpaceObject() instanceof Item) could be consolidated into a helper method (e.g., addEventModify(context, dSpaceObject)). This reduces duplication and centralizes event logic.


374-377: Repetitive item-event check.
We could reuse the same helper functionality here to streamline event firing for Item objects.

🛑 Comments failed to post (2)
dspace-server-webapp/pom.xml (1)

483-489: 💡 Verification agent

🧩 Analysis chain

Centralized json-smart version management.

Good approach to define a specific version of json-smart as a direct dependency after excluding it from multiple other dependencies. This resolves version conflicts between nimbus-jose-jwt and json-path.

However, it's worth verifying that 2.5.1 is the latest secure version.


🏁 Script executed:

#!/bin/bash
# Check if 2.5.1 is the latest version of json-smart and if there are any security advisories

# Check Maven Central for the latest version
echo "Latest json-smart version in Maven Central:"
curl -s "https://search.maven.org/solrsearch/select?q=g:net.minidev+AND+a:json-smart&rows=20&wt=json" | jq -r '.response.docs[0].latestVersion'

# Check for CVEs related to json-smart
echo -e "\nChecking for security vulnerabilities in json-smart:"
curl -s "https://services.nvd.nist.gov/rest/json/cves/2.0?keywordSearch=json-smart" | jq '.vulnerabilities[] | {cve: .cve.id, description: .cve.descriptions[0].value}'

Length of output: 1263


Action Required: Update json-smart Dependency Version

The current dependency version (2.5.1) in dspace-server-webapp/pom.xml is outdated and vulnerable. Verification indicates that:

  • Maven Central now lists 2.5.2 as the latest version.
  • Security advisories (e.g., CVE-2024-57699) confirm that versions up to 2.5.1 are susceptible to a DoS vulnerability.

Recommendations:

  • Update the dependency version to 2.5.2.
  • Verify that the new version resolves the identified security issue without breaking dependency exclusions.

Diff Snippet Update:

        <!-- Specify the version of json-smart we want to use.
             This resolves version conflicts and addresses a known security vulnerability. -->
        <dependency>
            <groupId>net.minidev</groupId>
            <artifactId>json-smart</artifactId>
            <version>2.5.2</version>
        </dependency>
dspace-api/src/main/java/org/dspace/content/packager/AbstractMETSIngester.java (1)

765-773: ⚠️ Potential issue

Improved policy handling during bitstream crosswalk operations.

This change preserves bitstream policies during the crosswalk process by storing them before the operation and reapplying them afterward. This ensures that authorization settings aren't lost when metadata is updated through the crosswalk operation.

Great enhancement that fixes a potential issue where bitstream permissions could be lost during package ingestion. The policies are now properly preserved and restored after the crosswalk operation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

solr-reindexing-after-process-running
3 participants