-
Notifications
You must be signed in to change notification settings - Fork 815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sitemaps: remove post limit on sitemap #5503
Conversation
modules/sitemaps/sitemap-buffer.php
Outdated
$time | ||
) { | ||
$this->item_capacity = $item_limit; | ||
$this->byte_capacity = $byte_limit - mb_strlen($header) - mb_strlen($footer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest checking if mb_strlen
exists before to use it, as some hosts have it disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On further inspection mb_strlen
was the wrong thing to use here anyway- it specifically counts characters instead of bytes, which is what we need. I changed this to use just strlen
instead, although reading some documentation I'm concerned that this will not do the right thing either as some hosts alias strlen
to mb_strlen
.
modules/sitemaps/sitemap-buffer.php
Outdated
* @return bool True if the append succeeded, False if not. | ||
*/ | ||
public function try_to_add_item($item) { | ||
if ($this->item_capacity <= 0 || $this->byte_capacity - mb_strlen($item) <= 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be good to check your use of spaces and yoda conditions across your changes, to match WordPress' coding standards>
You can use Code Sniffer rules to make that process easier.
modules/sitemaps/sitemaps.php
Outdated
*/ | ||
return apply_filters( 'jetpack_sitemap_content_type', 'text/xml' ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you deprecate this filter? If so, you can add it to the list here.
modules/sitemaps/sitemaps.php
Outdated
$type_name, | ||
array( | ||
'labels' => array( 'name' => $label ), | ||
'public' => false, // Set to true to aid debugging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can probably remove this comment now.
modules/sitemaps/sitemaps.php
Outdated
function callback_add_sitemap_schedule ($schedules) { | ||
$schedules['sitemap-interval'] = array( | ||
'interval' => Jetpack_Sitemap_Manager::SITEMAP_INTERVAL, | ||
'display' => __('Every Minute') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would need a textdomain so it can be translated.
modules/sitemaps/sitemaps.php
Outdated
* @since 4.5.0 | ||
*/ | ||
public function callback_action_do_robotstxt () { | ||
echo 'Sitemap: ' . home_url() . '/sitemap.xml' . PHP_EOL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll probably need to take the jetpack_sitemap_location
filter into account, since some people may use the filter to define a custom sitemap location.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The jetpack_sitemap_location
filter is now respected. The way it's used did change a bit though: long story short, instead of setting the location as /path/to/the/sitemap.xml
we only set the location to /path/to/the/
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain the long story? :) Why is it important that jetpack_sitemap_location
changed behavior? Is it possible to implement this without changing the behavior?
modules/sitemaps/sitemaps.php
Outdated
*/ | ||
if ( apply_filters( 'jetpack_sitemap_skip_post', false, $post ) ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice to keep that filter around somehow, it's useful for some site owners.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The jetpack_sitemap_skip_post
filter is now respected. As far as I can tell there is no change in behavior compared to the old version.
modules/sitemaps/sitemaps.php
Outdated
*/ | ||
$tree = apply_filters( 'jetpack_print_sitemap', $tree, $latest_mod ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some site owners already use that filter, so I think it would be nice to keep it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of 7afe5e7 the jetpack_print_sitemap
filter is respected.
I'd be interested in seeing what this is used for. If this filter is kept there probably need to be analogous filters for image sitemaps and sitemap indices.
modules/sitemaps/sitemaps.php
Outdated
*/ | ||
$discover_news_sitemap = apply_filters( 'jetpack_news_sitemap_generate', true ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another filter to add to the list of deprecated hooks.
Hi Nathan, it's working now 🎉 I tried accessing |
modules/sitemaps/sitemaps.php
Outdated
$post_latest_mod = $latest_comment_datetime; | ||
|
||
while ( false === $buffer->is_full() ) { | ||
$posts = $this->query_posts_after_id( $last_post_id, 1000, $log ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This call uses 3 parameters but query_posts_after_id
only accepts 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed this by making the logger object a class member, rather than passing one around.
margin: 0; | ||
} | ||
|
||
#description { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The selectors #description, #content, .odd, #footer, img.thumbnail
are never used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#description, .odd, #content, #footer, img.thumbnail
In the browser sitemap.xml
is transformed by an XML stylesheet, these are used in the transformed HTML. You can see the stylesheet at sitemap.xsl
.
modules/sitemaps/sitemaps.php
Outdated
* @param string $label The label of the new post type. | ||
* @param string $slug The slug of the new post type. | ||
*/ | ||
function register_sitemap_data( $type_name, $label, $slug ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the advantage of creating this inline function, against a foreach
that goes through array elements defined as
(
'jp_sitemap_master',
'Sitemap Master',
'jetpack-sitemap-master'
)
and calls register_post_type
directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't think of one. :) The latest commit uses an array with foreach
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scratch that- sitemaps are now stored in a custom database table, so this code is not relevant anymore.
modules/sitemaps/sitemaps.php
Outdated
* @since 4.5.0 | ||
*/ | ||
public function callback_action_do_robotstxt() { | ||
$url = $this->the_jetpack_sitemap_uri() . '/sitemap.xml'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can we handle default permalinks and PATHINFO permalinks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the URL of robots.txt
depend on the permalink settings? (Edit: I just checked and omg it does. I don't know what's real anymore :))
This is doable, though I have to confess I'm not sure what the benefit is given that sitemaps are meant for robots. But then so is robots.txt
.
Edit again- this module hijacks the usual URL routing by looking at $_SERVER['REQUEST_URI']
directly. Does this cause problems with any of the permalink schemes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest commit handles plain and PATHINFO permalinks. Plain permalinks are served at /?jetpack-sitemap=sitemap.xml
.
modules/sitemaps/sitemaps.php
Outdated
} | ||
// Register 'jp_sitemap_master' post type. | ||
register_sitemap_data( | ||
'jp_sitemap_master', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to register only the post type that is needed at a certain time? So for example if I'm seeing the image sitemap, I can't skip registering this?
Why did you choose a custom post type to model, well, other post types? What is the advantage of this vs rows in a custom database table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's really needed is a way to store (1) the generated sitemap xml, with (2) a name, (3) a unique numeric ID, and (4) a timestamp, which are needed during the generation phase. Using a custom post type seemed like a natural fit for this. (Also this is how msm-sitemap does it, if I understand correctly.)
But also I did not realize that using a custom database table was an option. That might be a better solution, because sitemaps are sort of "out of band" compared to posts. It should be easy to switch, because the actual database queries are abstracted behind the query_
methods. I will have to look into this.
As for registering the post types, it's possible that this is not necessary at all. What exactly does register_post_type
do and when is it required? Can we store custom post types in the database without having to register them? Because the sitemap post types are not needed in the admin panel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest commit uses a custom database table to store the sitemap data.
So |
@eliorivero : To figure out why
|
Hi Natha, sorry for the delay. I've tested this now and it works perfectly with pretty, default and pathinfo permalinks. Great work! 🎉 One concern is the name of the sitemaps with the numbers in them. Maybe the main one could be named |
modules/sitemaps/sitemaps.php
Outdated
} | ||
public function __construct() { | ||
$this->librarian = new Jetpack_Sitemap_Librarian(); | ||
$this->logger = new Jetpack_Sitemap_Logger(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this can only be instantiated when WP_DEBUG
is defined and set to true? It's better not to load this every time during production.
modules/sitemaps/sitemap-builder.php
Outdated
|
||
require_once dirname( __FILE__ ) . '/sitemap-buffer.php'; | ||
require_once dirname( __FILE__ ) . '/sitemap-librarian.php'; | ||
require_once dirname( __FILE__ ) . '/sitemap-logger.php'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file and the class instantiation in line 87 should be done only when WP_DEBUG is defined and set to true.
modules/sitemaps/sitemap-builder.php
Outdated
* @since 4.5.0 | ||
* | ||
* @param array $namespaces Associative array with namespaces and namespace URIs. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This inline doc for the filter must be moved one line down, right before apply_filters
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same mistake was in two more places- all are fixed now.
return true; | ||
} else { | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conditional can be simplified to return 1 === $num_rows_deleted;
/** | ||
* Get the most recent timestamp among approved comments for the given post_id. | ||
* | ||
* @module sitemaps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing @since 4.5.0
instead of @module sitemaps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every method should now have a @since
annotation.
modules/sitemaps/sitemap-stylist.php
Outdated
|
||
$generated_by = Jetpack_Sitemap_Stylist::sanitize_with_links( | ||
__( | ||
'<em>Generated</em> by <a href="%s" target="_blank">Jetpack for WordPress</a>', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jetpack_Sitemap_Stylist::sanitize_with_links
strips out the em
tag. Just remove the em
.
modules/sitemaps/sitemap-finder.php
Outdated
* @since 4.5.0 | ||
* | ||
* @param string $filename The filename of the sitemap. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing @return
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out there were lots of missing @return
s- I checked again and every function that returns something should now have a @return
annotation.
modules/sitemaps/sitemap-finder.php
Outdated
public function construct_sitemap_url( $filename ) { | ||
global $wp_rewrite; | ||
|
||
$location = get_option( 'jetpack_sitemap_location', '' ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this benefit from using Jetpack_Options::get_option_and_ensure_autoload( 'jetpack_sitemap_location', '' )
?
How about others get_option
calls in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading through the source of get_option
, it looks like a bunch of stuff happens when get_option
is called on a nonexistent name- stuff that is unnecessary when we provide a default value. So I imagine there will be some benefit to storing the default, especially for the calls that happen many times (which is all of them).
@mdawaffe Okay- I think I'm done. There are always improvements to be made but I've hit the point of diminishing returns.
|
That's all that I mean. How does the cade cache/store/otherwise preserve the stuff it generates? What stuff is cached? For how long? How is the cache invalidated, or is it? What stuff is not cached and always generated on the fly? |
This is also copied to the P2 post: This module deals with data of a few different kinds, that are cached in different ways. There's
Sitemap pages are stored in the posts table using several custom post types. This storage doesn't really act like a cache, because we don't worry about whether the data is stale. These files are stored indefinitely (until replaced by an updated version by the cron job). The news sitemap page (singular) is cached using a transient. It is invalidated whenever a post is published but not when a comment is posted (see callback_action_flush_news_sitemap_cache), and has a lifetime of JP_NEWS_SITEMAP_INTERVAL (set to 12 hours) (see news_sitemap_xml). The sitemap state, stored in an option and used to keep track of what needs to be generated next, isn't really cached either. But it is locked to prevent a race condition, and this lock (stored as a transient) has a lifetime of JP_SITEMAP_LOCK_INTERVAL (set to 15 minutes). The sitemap XSL files are generated fresh on each request. These would be constant if not for some translated UI strings. |
Looks good and tests well. @jeherve, can you make sure that we include this in the testing instructions to make sure people test it out thoroughly? |
* Readme: remove old release and add skeleton for 4.8. * Changelog: add #6572 * Changelog: add #6567 * Changelog: add #6542 * Changelog: add #6527 * Changelog: add #6508 * Changelog: add #6478 * Changelog: add #6477 * Changelog: add #6249 * Update stable version and remove old version from readme. * Changelog: add 4.7.1 to changelog. * Readme: add new contributor. * Sync: update docblock @SInCE version. Related: #6053 * Changelog: add release post. * changelog: add #6053 * Changelog: add #6413 * Changelog: add #6482 * Changelog: add #6584 * Changelog add #6603 * Changelog: add #6606 * Changelog: add #6611 * Changelog: add #6635 * Changelog: add #6639 * Changelog: add #6684 * Changelog: add #6710 * Changelog: add #6711 * Changelog: add #5461 * Testing list: update Settings UI feedback prompt. Props @MichaelArestad * Changelog: add #6789 * Changelog: add #6778 * Changelog: add #6777 * Changelog: add #6775 * Changelog: add #6755 * Changelog: add #6731 * Changelog: add #6721 * Changelog: add #6705 * Changelog: add #6702 * Changelog: add #6671 * Changelog: add #6637 * Changelog: add #6582 * Changelog: add #6566 * Changelog: add #6555 * Changelog: add #6529 * Changelog: add #6344 * Changelog: add #5763 * Changelog: add #5503 * Changelog: update #6637 changelog. @see 40e115c#commitcomment-21523982 * Changelog: add #6699 * Changelog: add #6632 * Changelog: add #6769 * Changelog: add #6707 * Changelog: add #6590
Addresses #3314: Remove 1000 post limit on sitemaps.
Addresses #3750: Support video sitemaps (in progress)
Changes proposed in this Pull Request:
Replace sitemap generating code.
Note that this PR is an update of #5394, which became broken because I am awesome at git
Documentation is on the Team Luna P2: p7rd6c-B0-p2
Also some design ideas: p7rd6c-fS-p2
Testing instructions:
example.com/news-sitemap.xml
.example.com/sitemap.xml
. (It may take a refresh or two the first time.)example.com/?jetpack-sitemap=sitemap.xml
andexample.com/?jetpack-sitemap=news-sitemap.xml
These changes allow for arbitrarily large sitemap trees per the spec. To test this functionality without having to create an arbitrarily large site, adjust the constant
SITEMAP_MAX_ITEMS
inmodules/sitemaps/sitemap-buffer.php
, (5 is a good value.) This constant determines the maximum number of URLs on a single sitemap, so that if it's set to 5 and we have 6 posts/pages then the tree structure kicks in.The constant
SITEMAP_INTERVAL
inmodules/sitemaps/sitemaps.php
is the minimum number of seconds between rebuilds; it's currently set to 60 for testing.Past related issues/PRs:
example.com/index.php/sitemap.xml
.ent2ncr
to avoid this bug.Proposed changelog entry for your changes:
New filters:
jetpack_sitemap_image_skip_post
: For selectively excluding images from the sitemap.jetpack_sitemap_image_sitemap_item
: For manipulating the XML of image entries in the sitemap.jetpack_sitemap_video_skip_post
: For selectively excluding videos from the sitemap.jetpack_sitemap_video_sitemap_item
: For manipulating the XML of video entries in the sitemap.jetpack_sitemap_news_ns
: Add XML namespaces for the news sitemap.jetpack_sitemap_image_ns
: Add XML namespaces for image sitemaps.Filters with changed behavior:
jetpack_sitemap_location
: The old version had only two sitemap URLs to keep track of, so it was feasible to make both of them configurable. This version keeps track of two URLs and 6 indexed families of unknown size. Do we make them all configurable?gulp js:hint
before to commit your changes. It will allow you to detect errors in Javascript files.