Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outbox: Add upgrade script to populate outbox items #1175

Merged
merged 19 commits into from
Jan 27, 2025

Conversation

obenland
Copy link
Member

@obenland obenland commented Jan 16, 2025

Proposed changes:

  • Adds activitypub_upgrade event for async upgrades with locking.
  • Adds upgrade callbacks to create outbox items for posts and comments.
  • Adds custom add_to_outbox function that skips any federation considerations.
  • Adds tests for the async upgrade event, and outbox generation callbacks.

Other information:

  • Have you written new tests for your changes, if applicable?

Testing instructions:

  • Install the zip file of this branch on a test site and activate it.
  • Manually update the version compare in /includes/class-migration.php to a version of your choice (e.g. 4.7.3.1).
  • Manually update the plugin version in activitypub.php to the same version.
  • Go to /wp-admin/edit.php?post_type=ap_outbox on your test site to see the results.

@obenland obenland added [Status] In Progress Skip Changelog Disables the "Changelog Updated" action for PRs where changelog entries are not necessary. labels Jan 16, 2025
@obenland obenland requested a review from a team January 16, 2025 21:52
@obenland obenland self-assigned this Jan 16, 2025
@pfefferle
Copy link
Member

pfefferle commented Jan 17, 2025

The risk here is, that all added items will be queued for federation again and because these are quite a lot, it could cause quite some traffic. I think we have to find a way to add historical data without publishing them.

@obenland
Copy link
Member Author

obenland commented Jan 17, 2025

I was hoping the add_filter( 'pre_schedule_event', '__return_false' ); filter would prevent these federation jobs from being scheduled

@pfefferle
Copy link
Member

Oh, haven't seen that. True, but this might be only for the moment they were added... the state of the post will still be "pending for federation", so it will be handled at one time...

@obenland
Copy link
Member Author

da26595 adds a localized version of add_to_outbox() that skips all the federation business

@pfefferle
Copy link
Member

Does it make sense to have some of the functions in the collection, instead of the migration?

Should we have dedicated classes for bigger migrations, like for example Activitypub\Migration\Outbox?

@obenland obenland closed this Jan 21, 2025
@obenland obenland deleted the add/outbox-migration branch January 21, 2025 19:49
@obenland obenland restored the add/outbox-migration branch January 21, 2025 21:56
@obenland obenland reopened this Jan 21, 2025
@obenland
Copy link
Member Author

obenland commented Jan 21, 2025

Does it make sense to have some of the functions in the collection, instead of the migration?

Sure, what are you thinking?

Should we have dedicated classes for bigger migrations, like for example Activitypub\Migration\Outbox?

At some point I'd like to suggest renaming Migration to Upgrade to align with Core and free up the namespace for any future content migration work. Not sure we need separate classes for upgrade routines (yet). As a matter of fact, I created activitypub_upgrade to make it more reusable for async upgrades and eventually move the existing async routines over to it (reusing more code).

@obenland obenland force-pushed the add/outbox-migration branch from fea0686 to 45b3b2d Compare January 21, 2025 22:18
@obenland obenland marked this pull request as ready for review January 21, 2025 22:24
@obenland obenland requested a review from a team January 21, 2025 22:29
@mattwiebe
Copy link
Contributor

mattwiebe commented Jan 23, 2025

Ran the upgrade, it worked nicely.

Was just toying around with decoding the JSON content of one of the outbox items, like

wp post get 5944  --field=content | jq
jq: parse error: Invalid escape at line 1, column 7521

Uh-oh. So I try another couple of posts and they work fine. So it's something in this post. Here's the JSON and the original post content:

JSON

{"@context":["https://www.w3.org/ns/activitystreams",{"Hashtag":"as:Hashtag","sensitive":"as:sensitive"}],"id":"http://wp.test/?p=140","type":"Note","attachment":[],"attributedTo":"http://wp.test/?author=1","audience":"http://wp.test/?author=0","content":"u003Ch2u003EComparing Metau0026#8217;s CodeLlama to OpenAIu0026#8217;s Copilotu003C/h2u003Eu003Cpu003EAfter my u003Ca href=u0022https://mattwiebe.blog/2023/08/30/how-to-run-codellama-in-vscode-on-macos/u0022u003Elast postu003C/au003E about setting up CodeLlama, some colleagues have asked the million dollar question: how does CodeLlama compare to Copilot? My early answer is: u0026#8220;I donu0026#8217;t know yetu0026#8221; but hereu0026#8217;s a useful comparison.u003C/pu003Eu003Cpu003EI have little shell script that lets me interface with u003Ca href=u0022https://phacility.com/phabricator/arcanist/u0022u003EArcanistu003C/au003E (u003Ccodeu003Earcu003C/codeu003E), part of the Phabricator toolset we use internally. One of the (many) weird things about it vs something like Github is that your PRs (Diffs in Phabricator parlance) are independent from git branches. This is a problem when I need to update a Diff, but I donu0026#8217;t know what Iu0026#8217;ve been working with from the current branch. u003Ccodeu003Egit pushu003C/codeu003E is just not possible, I need to run u003Ccodeu003Earc diff --update DXXXXXu003C/codeu003E instead. But what is the ID?u003C/pu003Eu003Cpu003EThereu0026#8217;s already a command for this: u003Ccodeu003Earc whichu003C/codeu003E. But its output is messy. Iu0026#8217;m working on a Diff that adds u003Ca href=u0022https://github.com/Automattic/wordpress-activitypub/pull/395u0022u003Ethis plugin PRu003C/au003E to WordPress.com, fixing (inevitable) bugs along the way, and this is what u003Ccodeu003Earc whichu003C/codeu003E gives me:u003C/pu003Eu003Cpre class=u0022wp-block-syntaxhighlighter-codeu0022u003EREPOSITORYTo identify the repository associated with this working copy, arc followed this process:    Configuration value u0022repository.callsignu0022 is set to u0022WPGITu0022.    Found a unique matching repository.This working copy is associated with the WordPress.com repository.COMMIT RANGEIf you run 'arc diff', changes between the commit:    212461c993d9d7ad  [redacted commit message]...and the current working copy state will be sent to Differential, becauseit is the merge-base of 'origin/trunk' and HEAD, as specified by'git:merge-base(origin/trunk)' in your project 'base' configuration.You can see the exact changes that will be sent by running this command:    $ git diff 212461c993d9d7ad..HEADThese commits will be included in the diff:    60eb751ea35ecfe1  [redacted commit message]    55dc4536770138b0  ActivityPub: add Follow Me blockMATCHING REVISIONSThese Differential revisions match the changes in this working copy:    D120281 ActivityPub: add Follow Me block        Reason: Commit message for '55dc4536770138b0' has explicit 'Differential Revision'.Since exactly one revision in Differential matches this working copy, it willbe updated if you run 'arc diff'.u003C/preu003Eu003Cpu003E(That last line is a lie.) I already have a shell command that cuts out the noise and just provides me with the URL and Diff ID that I want:u003C/pu003Eu003Cpre class=u0022wp-block-syntaxhighlighter-codeu0022u003Earc which | grep -E 'D[0-9]+\ ' -o | awk '{ printf u0022https://redacted.a8c.com/%s  %s\nu0022, $1, $1 }'u003C/preu003Eu003Cpu003EWhich for the above gives me:u003C/pu003Eu003Cpre class=u0022wp-block-syntaxhighlighter-codeu0022u003Ehttps://redacted.a8c.com/D120281  D120281u003C/preu003Eu003Cpu003ECool. But I decided that Iu0026#8217;d stripped too much context, I wanted the Diff title after the ID as well, so that my output would be like this:u003C/pu003Eu003Cpre class=u0022wp-block-syntaxhighlighter-codeu0022u003EActivityPub: add Follow Me blockhttps://redacted.a8c.com/D120281  D120281u003C/preu003Eu003Cpu003EThat was hard because u003Ccodeu003Eawku003C/codeu003E separates on spaces and would only give me the first word of the title in my first attempt:u003C/pu003Eu003Cpre class=u0022wp-block-syntaxhighlighter-codeu0022u003Earc which | grep -E 'D[0-9]+\ (.*)' -o | awk '{ printf u0022%s\nhttps://redacted.a8c.com/%s  %s\nu0022, $2, $1, $1 }'ActivityPub: https://redacted.a8c.com/D120281  D120281u003C/preu003Eu003Cpu003ESo I decided to ask both Copilot and CodeLlama (u003Ccodeu003E7b-instruct-q5_K_Mu003C/codeu003E) for help. I kept my wording a bit vague to see how they would do. u003C/pu003Eu003Cpu003ESpoiler: they both contributed something useful, and it was doubly interesting that CodeLlama told me I was doing it wrong.u003C/pu003Eu003Ch3 class=u0022wp-block-headingu0022u003EPromptu003C/h3u003Eu003Cpre class=u0022wp-block-syntaxhighlighter-codeu0022u003EI have the following shell script, which parses an ID and the title out of arc which output.arc which | grep -E 'D[0-9]+ (.*)' -o | awk '{ printf u0022%s\nhttps://code.a8c.com/%s %s\nu0022, $2, $1, $1 }'arc which line that concerns me looks like:` D120281 ActivityPub: add Follow Me block`The trouble is that awk thinks $2 is only the first word of the remaining line, just u0022ActivityPubu0022. I want the whole line in the output. How can I achieve that?u003C/preu003Eu003Ch3 class=u0022wp-block-headingu0022u003ECopilot responseu003C/h3u003Eu003Cpre class=u0022wp-block-syntaxhighlighter-codeu0022u003ETo include the entire title in the output, you can modify the awk command to print all fields starting from the second field. Here's the updated command:```arc which | grep -E 'D[0-9]+ (.*)' -o | awk '{ printf u0022%s\nhttps://code.a8c.com/%s  u0022, $2, $1; for (i=2; iu0026lt;=NF; i++) printf u0022%s u0022, $i; printf u0022\nu0022 }'```'In this updated command, we're using a for loop to print all fields starting from the second field. The NF variable represents the total number of fields in the input, so we're looping from the second field (i=2) to the last field (iu0026lt;=NF). We're then using printf to print each field separated by a space.This should output the entire title in the output, along with the ID and URL.u003C/preu003Eu003Ch3 class=u0022wp-block-headingu0022u003ECodeLlama responseu003C/h3u003Eu003Cpre class=u0022wp-block-syntaxhighlighter-codeu0022u003ETo get the whole line in the output, you need to use a different approach with awk. Instead of using $2 to print out the second word of the remaining line, you can use $0 to print out the entire line. Here's an updated version of your shell script that uses awk to print out the whole line:```arc which | grep -E 'D[0-9]+\ (.*)' -o | awk '{ printf u0022https://code.a8c.com/%s  %s\nu0022, $1, $0 }'```In this version of the script, we use awk to print out the entire line ($0) as part of the URL that we want to output. We also use $1 to print out the identifier followed by one or more numbers. I hope this helps! Let me know if you have any questions.u003C/preu003Eu003Cpu003ENeither approach was exactly what I wanted, but I learned something from both and was able to use this to produce my desired outcome:u003C/pu003Eu003Cpre class=u0022wp-block-syntaxhighlighter-codeu0022u003Earc which | grep -E 'D[0-9]+ (.*)' -o | awk '{ for (i=2; iu0026lt;=NF; i++) printf u0022%s u0022, $i; printf u0022\n %s  https://code.a8c.com/%s\nu0022, $1, $1; }'u003C/preu003Eu003Cpu003EIn any case, googling for code help is probably pretty much dead. And this is the first generation of these tools, which will only get better.u003C/pu003E","contentMap":{"en":"u003Ch2u003EComparing Metau0026#8217;s CodeLlama to OpenAIu0026#8217;s Copilotu003C/h2u003Eu003Cpu003EAfter my u003Ca href=u0022https://mattwiebe.blog/2023/08/30/how-to-run-codellama-in-vscode-on-macos/u0022u003Elast postu003C/au003E about setting up CodeLlama, some colleagues have asked the million dollar question: how does CodeLlama compare to Copilot? My early answer is: u0026#8220;I donu0026#8217;t know yetu0026#8221; but hereu0026#8217;s a useful comparison.u003C/pu003Eu003Cpu003EI have little shell script that lets me interface with u003Ca href=u0022https://phacility.com/phabricator/arcanist/u0022u003EArcanistu003C/au003E (u003Ccodeu003Earcu003C/codeu003E), part of the Phabricator toolset we use internally. One of the (many) weird things about it vs something like Github is that your PRs (Diffs in Phabricator parlance) are independent from git branches. This is a problem when I need to update a Diff, but I donu0026#8217;t know what Iu0026#8217;ve been working with from the current branch. u003Ccodeu003Egit pushu003C/codeu003E is just not possible, I need to run u003Ccodeu003Earc diff --update DXXXXXu003C/codeu003E instead. But what is the ID?u003C/pu003Eu003Cpu003EThereu0026#8217;s already a command for this: u003Ccodeu003Earc whichu003C/codeu003E. But its output is messy. Iu0026#8217;m working on a Diff that adds u003Ca href=u0022https://github.com/Automattic/wordpress-activitypub/pull/395u0022u003Ethis plugin PRu003C/au003E to WordPress.com, fixing (inevitable) bugs along the way, and this is what u003Ccodeu003Earc whichu003C/codeu003E gives me:u003C/pu003Eu003Cpre class=u0022wp-block-syntaxhighlighter-codeu0022u003EREPOSITORYTo identify the repository associated with this working copy, arc followed this process:    Configuration value u0022repository.callsignu0022 is set to u0022WPGITu0022.    Found a unique matching repository.This working copy is associated with the WordPress.com repository.COMMIT RANGEIf you run 'arc diff', changes between the commit:    212461c993d9d7ad  [redacted commit message]...and the current working copy state will be sent to Differential, becauseit is the merge-base of 'origin/trunk' and HEAD, as specified by'git:merge-base(origin/trunk)' in your project 'base' configuration.You can see the exact changes that will be sent by running this command:    $ git diff 212461c993d9d7ad..HEADThese commits will be included in the diff:    60eb751ea35ecfe1  [redacted commit message]    55dc4536770138b0  ActivityPub: add Follow Me blockMATCHING REVISIONSThese Differential revisions match the changes in this working copy:    D120281 ActivityPub: add Follow Me block        Reason: Commit message for '55dc4536770138b0' has explicit 'Differential Revision'.Since exactly one revision in Differential matches this working copy, it willbe updated if you run 'arc diff'.u003C/preu003Eu003Cpu003E(That last line is a lie.) I already have a shell command that cuts out the noise and just provides me with the URL and Diff ID that I want:u003C/pu003Eu003Cpre class=u0022wp-block-syntaxhighlighter-codeu0022u003Earc which | grep -E 'D[0-9]+\ ' -o | awk '{ printf u0022https://redacted.a8c.com/%s  %s\nu0022, $1, $1 }'u003C/preu003Eu003Cpu003EWhich for the above gives me:u003C/pu003Eu003Cpre class=u0022wp-block-syntaxhighlighter-codeu0022u003Ehttps://redacted.a8c.com/D120281  D120281u003C/preu003Eu003Cpu003ECool. But I decided that Iu0026#8217;d stripped too much context, I wanted the Diff title after the ID as well, so that my output would be like this:u003C/pu003Eu003Cpre class=u0022wp-block-syntaxhighlighter-codeu0022u003EActivityPub: add Follow Me blockhttps://redacted.a8c.com/D120281  D120281u003C/preu003Eu003Cpu003EThat was hard because u003Ccodeu003Eawku003C/codeu003E separates on spaces and would only give me the first word of the title in my first attempt:u003C/pu003Eu003Cpre class=u0022wp-block-syntaxhighlighter-codeu0022u003Earc which | grep -E 'D[0-9]+\ (.*)' -o | awk '{ printf u0022%s\nhttps://redacted.a8c.com/%s  %s\nu0022, $2, $1, $1 }'ActivityPub: https://redacted.a8c.com/D120281  D120281u003C/preu003Eu003Cpu003ESo I decided to ask both Copilot and CodeLlama (u003Ccodeu003E7b-instruct-q5_K_Mu003C/codeu003E) for help. I kept my wording a bit vague to see how they would do. u003C/pu003Eu003Cpu003ESpoiler: they both contributed something useful, and it was doubly interesting that CodeLlama told me I was doing it wrong.u003C/pu003Eu003Ch3 class=u0022wp-block-headingu0022u003EPromptu003C/h3u003Eu003Cpre class=u0022wp-block-syntaxhighlighter-codeu0022u003EI have the following shell script, which parses an ID and the title out of arc which output.arc which | grep -E 'D[0-9]+ (.*)' -o | awk '{ printf u0022%s\nhttps://code.a8c.com/%s %s\nu0022, $2, $1, $1 }'arc which line that concerns me looks like:` D120281 ActivityPub: add Follow Me block`The trouble is that awk thinks $2 is only the first word of the remaining line, just u0022ActivityPubu0022. I want the whole line in the output. How can I achieve that?u003C/preu003Eu003Ch3 class=u0022wp-block-headingu0022u003ECopilot responseu003C/h3u003Eu003Cpre class=u0022wp-block-syntaxhighlighter-codeu0022u003ETo include the entire title in the output, you can modify the awk command to print all fields starting from the second field. Here's the updated command:```arc which | grep -E 'D[0-9]+ (.*)' -o | awk '{ printf u0022%s\nhttps://code.a8c.com/%s  u0022, $2, $1; for (i=2; iu0026lt;=NF; i++) printf u0022%s u0022, $i; printf u0022\nu0022 }'```'In this updated command, we're using a for loop to print all fields starting from the second field. The NF variable represents the total number of fields in the input, so we're looping from the second field (i=2) to the last field (iu0026lt;=NF). We're then using printf to print each field separated by a space.This should output the entire title in the output, along with the ID and URL.u003C/preu003Eu003Ch3 class=u0022wp-block-headingu0022u003ECodeLlama responseu003C/h3u003Eu003Cpre class=u0022wp-block-syntaxhighlighter-codeu0022u003ETo get the whole line in the output, you need to use a different approach with awk. Instead of using $2 to print out the second word of the remaining line, you can use $0 to print out the entire line. Here's an updated version of your shell script that uses awk to print out the whole line:```arc which | grep -E 'D[0-9]+\ (.*)' -o | awk '{ printf u0022https://code.a8c.com/%s  %s\nu0022, $1, $0 }'```In this version of the script, we use awk to print out the entire line ($0) as part of the URL that we want to output. We also use $1 to print out the identifier followed by one or more numbers. I hope this helps! Let me know if you have any questions.u003C/preu003Eu003Cpu003ENeither approach was exactly what I wanted, but I learned something from both and was able to use this to produce my desired outcome:u003C/pu003Eu003Cpre class=u0022wp-block-syntaxhighlighter-codeu0022u003Earc which | grep -E 'D[0-9]+ (.*)' -o | awk '{ for (i=2; iu0026lt;=NF; i++) printf u0022%s u0022, $i; printf u0022\n %s  https://code.a8c.com/%s\nu0022, $1, $1; }'u003C/preu003Eu003Cpu003EIn any case, googling for code help is probably pretty much dead. And this is the first generation of these tools, which will only get better.u003C/pu003E"},"icon":{"type":"Image","url":"http://wp.test/wp-content/uploads/2024/09/image-3-150x150.jpg","mediaType":"image/jpeg"},"published":"2023-08-31T16:41:15Z","tag":[],"url":"http://wp.test/2023/08/comparing-metas-codellama-to-openais-copilot/","to":["https://www.w3.org/ns/activitystreams#Public"],"cc":["http://wp.test/wp-json/activitypub/1.0/actors/1/followers"],"mediaType":"text/html","replies":{"id":"http://wp.test/wp-json/activitypub/1.0/posts/140/replies","type":"Collection","first":{"id":"http://wp.test/wp-json/activitypub/1.0/posts/140/replies?page=0","type":"CollectionPage","partOf":"http://wp.test/wp-json/activitypub/1.0/posts/140/replies","items":[]}},"sensitive":false}

Post Content (had to use a gist because the post itself was messing up formatting here with ``` usage)

https://gist.github.com/mattwiebe/e644bb43d7bea38aa781acba6b89ae8a

The error happens inside the content and contentMap.en fields, so it's definitely something in our character encoding.

@obenland obenland force-pushed the add/outbox-migration branch from 2f7c2ff to 32e3213 Compare January 24, 2025 13:59
@obenland
Copy link
Member Author

@mattwiebe Could I ask you to try again, now that the slashing PR has been merged and was rebased into this one?

@mattwiebe
Copy link
Contributor

@mattwiebe Could I ask you to try again, now that the slashing PR has been merged and was rebased into this one?

Yup I tried it and it works nicely.

But!

It appears to not be doing the offset properly, it's just processing the first 50 posts over and over again. It was doing this both before and after my changes above, but it also seems to be failing some relevant tests. Eg always this entry in the cron option:

  array (
    'activitypub_upgrade' =>
    array (
      '578ee61f88becab0f6a24e46a3b16376' =>
      array (
        'schedule' => false,
        'args' =>
        array (
          0 => 'create_post_outbox_items',
          'batch_size' => 50,
          'offset' => 50,
        ),
      ),
    ),
  ),

@@ -173,7 +173,7 @@ public static function maybe_migrate() {
if ( \version_compare( $version_from_db, '4.7.3', '<' ) ) {
add_action( 'init', 'flush_rewrite_rules', 20 );
}
if ( \version_compare( $version_from_db, 'unreleased', '<' ) ) {
if ( \version_compare( $version_from_db, '4.7.4', '<' ) ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why adding a version number?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aaaah, you needed it for testing!?! ;)

* @param string $callback The callback to run.
* @param mixed ...$args The arguments to pass to the callback.
*/
public static function async_upgrade( $callback, ...$args ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public static function async_upgrade( $callback, ...$args ) {
public static function async_upgrade( $callback ) {

Should we get rid of this now that we're calling func_get_args() below?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I wonder if func_get_args is even needed…

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might not be with the new structure. But this works :)

Copy link
Contributor

@mattwiebe mattwiebe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love that the async_upgrade gives us a proof-of-concept for later doing other rolling operations later, like batching in send_activity_to_followers() - at some point there will be too many inboxes, and we'd also like to spread out the updates to prevent DDOS spikes.

Everything is working as I'd expect, now.

@obenland obenland merged commit 40c9535 into add/outbox-collection Jan 27, 2025
20 checks passed
@obenland obenland deleted the add/outbox-migration branch January 27, 2025 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Skip Changelog Disables the "Changelog Updated" action for PRs where changelog entries are not necessary.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants