Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sitemaps: remove post limit on sitemap #5503

Merged
merged 59 commits into from
Mar 27, 2017
Merged

Conversation

nbloomf
Copy link
Contributor

@nbloomf nbloomf commented Nov 3, 2016

Addresses #3314: Remove 1000 post limit on sitemaps.
Addresses #3750: Support video sitemaps (in progress)

Changes proposed in this Pull Request:

Replace sitemap generating code.

Note that this PR is an update of #5394, which became broken because I am awesome at git

Documentation is on the Team Luna P2: p7rd6c-B0-p2
Also some design ideas: p7rd6c-fS-p2

Testing instructions:

  1. Create some posts (see FakerPress). Make sure some of them have a recent (<=2 day old) timestamp.
  2. Activate the sitemaps module of Jetpack.
  3. Set permalinks to "pretty" (doesn't matter what structure)
  4. Go to example.com/news-sitemap.xml.
  5. Go to example.com/sitemap.xml. (It may take a refresh or two the first time.)
  6. Set permalinks to "plain"
  7. Go to example.com/?jetpack-sitemap=sitemap.xml and example.com/?jetpack-sitemap=news-sitemap.xml
  8. Add some images and videos to the media library and repeat steps 3--7.

These changes allow for arbitrarily large sitemap trees per the spec. To test this functionality without having to create an arbitrarily large site, adjust the constant SITEMAP_MAX_ITEMS in modules/sitemaps/sitemap-buffer.php, (5 is a good value.) This constant determines the maximum number of URLs on a single sitemap, so that if it's set to 5 and we have 6 posts/pages then the tree structure kicks in.

The constant SITEMAP_INTERVAL in modules/sitemaps/sitemaps.php is the minimum number of seconds between rebuilds; it's currently set to 60 for testing.

Past related issues/PRs:

Proposed changelog entry for your changes:

New filters:

  • jetpack_sitemap_image_skip_post: For selectively excluding images from the sitemap.
  • jetpack_sitemap_image_sitemap_item: For manipulating the XML of image entries in the sitemap.
  • jetpack_sitemap_video_skip_post: For selectively excluding videos from the sitemap.
  • jetpack_sitemap_video_sitemap_item: For manipulating the XML of video entries in the sitemap.
  • jetpack_sitemap_news_ns: Add XML namespaces for the news sitemap.
  • jetpack_sitemap_image_ns: Add XML namespaces for image sitemaps.

Filters with changed behavior:

  • jetpack_sitemap_location: The old version had only two sitemap URLs to keep track of, so it was feasible to make both of them configurable. This version keeps track of two URLs and 6 indexed families of unknown size. Do we make them all configurable?

@nbloomf nbloomf changed the title Replace sitemap module In progress: Remove post limit on sitemap Nov 3, 2016
@jeherve jeherve added [Type] Enhancement Changes to an existing feature — removing, adding, or changing parts of it Trial Project [Feature] Sitemaps [Status] Needs Review This PR is ready for review. labels Nov 4, 2016
@jeherve jeherve added this to the 4.5 milestone Nov 4, 2016
$time
) {
$this->item_capacity = $item_limit;
$this->byte_capacity = $byte_limit - mb_strlen($header) - mb_strlen($footer);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest checking if mb_strlen exists before to use it, as some hosts have it disabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On further inspection mb_strlen was the wrong thing to use here anyway- it specifically counts characters instead of bytes, which is what we need. I changed this to use just strlen instead, although reading some documentation I'm concerned that this will not do the right thing either as some hosts alias strlen to mb_strlen.

* @return bool True if the append succeeded, False if not.
*/
public function try_to_add_item($item) {
if ($this->item_capacity <= 0 || $this->byte_capacity - mb_strlen($item) <= 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to check your use of spaces and yoda conditions across your changes, to match WordPress' coding standards>

You can use Code Sniffer rules to make that process easier.

*/
return apply_filters( 'jetpack_sitemap_content_type', 'text/xml' );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you deprecate this filter? If so, you can add it to the list here.

$type_name,
array(
'labels' => array( 'name' => $label ),
'public' => false, // Set to true to aid debugging
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can probably remove this comment now.

function callback_add_sitemap_schedule ($schedules) {
$schedules['sitemap-interval'] = array(
'interval' => Jetpack_Sitemap_Manager::SITEMAP_INTERVAL,
'display' => __('Every Minute')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would need a textdomain so it can be translated.

* @since 4.5.0
*/
public function callback_action_do_robotstxt () {
echo 'Sitemap: ' . home_url() . '/sitemap.xml' . PHP_EOL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll probably need to take the jetpack_sitemap_location filter into account, since some people may use the filter to define a custom sitemap location.

Copy link
Contributor Author

@nbloomf nbloomf Nov 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The jetpack_sitemap_location filter is now respected. The way it's used did change a bit though: long story short, instead of setting the location as /path/to/the/sitemap.xml we only set the location to /path/to/the/.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the long story? :) Why is it important that jetpack_sitemap_location changed behavior? Is it possible to implement this without changing the behavior?

*/
if ( apply_filters( 'jetpack_sitemap_skip_post', false, $post ) ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to keep that filter around somehow, it's useful for some site owners.

Copy link
Contributor Author

@nbloomf nbloomf Nov 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The jetpack_sitemap_skip_post filter is now respected. As far as I can tell there is no change in behavior compared to the old version.

*/
$tree = apply_filters( 'jetpack_print_sitemap', $tree, $latest_mod );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some site owners already use that filter, so I think it would be nice to keep it.

Copy link
Contributor Author

@nbloomf nbloomf Nov 8, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of 7afe5e7 the jetpack_print_sitemap filter is respected.

I'd be interested in seeing what this is used for. If this filter is kept there probably need to be analogous filters for image sitemaps and sitemap indices.

*/
$discover_news_sitemap = apply_filters( 'jetpack_news_sitemap_generate', true );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another filter to add to the list of deprecated hooks.

@eliorivero
Copy link
Contributor

Hi Nathan, it's working now 🎉

I tried accessing http://localhost/automattic/?jetpack-sitemap=true and couldn't get it to show. The previous sitemap was working with default permalinks, can this one work too? Or maybe it's in another location?

$post_latest_mod = $latest_comment_datetime;

while ( false === $buffer->is_full() ) {
$posts = $this->query_posts_after_id( $last_post_id, 1000, $log );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call uses 3 parameters but query_posts_after_id only accepts 2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed this by making the logger object a class member, rather than passing one around.

margin: 0;
}

#description {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The selectors #description, #content, .odd, #footer, img.thumbnail are never used.

Copy link
Contributor Author

@nbloomf nbloomf Nov 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#description, .odd, #content, #footer, img.thumbnail

In the browser sitemap.xml is transformed by an XML stylesheet, these are used in the transformed HTML. You can see the stylesheet at sitemap.xsl.

* @param string $label The label of the new post type.
* @param string $slug The slug of the new post type.
*/
function register_sitemap_data( $type_name, $label, $slug ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the advantage of creating this inline function, against a foreach that goes through array elements defined as

(
    'jp_sitemap_master',
    'Sitemap Master',
    'jetpack-sitemap-master'
)

and calls register_post_type directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of one. :) The latest commit uses an array with foreach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scratch that- sitemaps are now stored in a custom database table, so this code is not relevant anymore.

* @since 4.5.0
*/
public function callback_action_do_robotstxt() {
$url = $this->the_jetpack_sitemap_uri() . '/sitemap.xml';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we handle default permalinks and PATHINFO permalinks?

Copy link
Contributor Author

@nbloomf nbloomf Nov 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the URL of robots.txt depend on the permalink settings? (Edit: I just checked and omg it does. I don't know what's real anymore :))

This is doable, though I have to confess I'm not sure what the benefit is given that sitemaps are meant for robots. But then so is robots.txt.

Edit again- this module hijacks the usual URL routing by looking at $_SERVER['REQUEST_URI'] directly. Does this cause problems with any of the permalink schemes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest commit handles plain and PATHINFO permalinks. Plain permalinks are served at /?jetpack-sitemap=sitemap.xml.

}
// Register 'jp_sitemap_master' post type.
register_sitemap_data(
'jp_sitemap_master',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to register only the post type that is needed at a certain time? So for example if I'm seeing the image sitemap, I can't skip registering this?

Why did you choose a custom post type to model, well, other post types? What is the advantage of this vs rows in a custom database table?

Copy link
Contributor Author

@nbloomf nbloomf Nov 14, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's really needed is a way to store (1) the generated sitemap xml, with (2) a name, (3) a unique numeric ID, and (4) a timestamp, which are needed during the generation phase. Using a custom post type seemed like a natural fit for this. (Also this is how msm-sitemap does it, if I understand correctly.)

But also I did not realize that using a custom database table was an option. That might be a better solution, because sitemaps are sort of "out of band" compared to posts. It should be easy to switch, because the actual database queries are abstracted behind the query_ methods. I will have to look into this.

As for registering the post types, it's possible that this is not necessary at all. What exactly does register_post_type do and when is it required? Can we store custom post types in the database without having to register them? Because the sitemap post types are not needed in the admin panel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest commit uses a custom database table to store the sitemap data.

@eliorivero
Copy link
Contributor

So sitemap.xml works, when I try to access sitemap-1.xml and when I try to access image-sitemap-1.xml, it fails.

@nbloomf
Copy link
Contributor Author

nbloomf commented Nov 14, 2016

@eliorivero : To figure out why sitemap-1.xml and image-sitemap-1.xml fail, I have some questions.

  1. Is the failure a WordPress 404, a blank page, or something else?
  2. Could you send me the contents of sitemap.xml (as well as it's location)?
  3. Could you send me the contents of your debug.log?

@eliorivero
Copy link
Contributor

Hi Natha, sorry for the delay. I've tested this now and it works perfectly with pretty, default and pathinfo permalinks. Great work! 🎉

One concern is the name of the sitemaps with the numbers in them. Maybe the main one could be named sitemap-index.xml and the children sitemaps could be named sitemap.xml and image-sitemap.xml.

}
public function __construct() {
$this->librarian = new Jetpack_Sitemap_Librarian();
$this->logger = new Jetpack_Sitemap_Logger();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this can only be instantiated when WP_DEBUG is defined and set to true? It's better not to load this every time during production.


require_once dirname( __FILE__ ) . '/sitemap-buffer.php';
require_once dirname( __FILE__ ) . '/sitemap-librarian.php';
require_once dirname( __FILE__ ) . '/sitemap-logger.php';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file and the class instantiation in line 87 should be done only when WP_DEBUG is defined and set to true.

* @since 4.5.0
*
* @param array $namespaces Associative array with namespaces and namespace URIs.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This inline doc for the filter must be moved one line down, right before apply_filters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same mistake was in two more places- all are fixed now.

return true;
} else {
return false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conditional can be simplified to return 1 === $num_rows_deleted;

/**
* Get the most recent timestamp among approved comments for the given post_id.
*
* @module sitemaps
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing @since 4.5.0 instead of @module sitemaps

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every method should now have a @since annotation.


$generated_by = Jetpack_Sitemap_Stylist::sanitize_with_links(
__(
'<em>Generated</em> by <a href="%s" target="_blank">Jetpack for WordPress</a>',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jetpack_Sitemap_Stylist::sanitize_with_links strips out the em tag. Just remove the em.

* @since 4.5.0
*
* @param string $filename The filename of the sitemap.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing @return.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out there were lots of missing @returns- I checked again and every function that returns something should now have a @return annotation.

public function construct_sitemap_url( $filename ) {
global $wp_rewrite;

$location = get_option( 'jetpack_sitemap_location', '' );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this benefit from using Jetpack_Options::get_option_and_ensure_autoload( 'jetpack_sitemap_location', '' )?
How about others get_option calls in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading through the source of get_option, it looks like a bunch of stuff happens when get_option is called on a nonexistent name- stuff that is unnecessary when we provide a default value. So I imagine there will be some benefit to storing the default, especially for the calls that happen many times (which is all of them).

@ebinnion ebinnion modified the milestones: 4.6, 4.5 Dec 7, 2016
@nbloomf
Copy link
Contributor Author

nbloomf commented Feb 21, 2017

@mdawaffe Okay- I think I'm done. There are always improvements to be made but I've hit the point of diminishing returns.

  • The P2 post is here: p7rd6c-B0-p2
  • What do you mean by caching strategy? This code does some hand rolled caching of the sitemap files in the database but I have a feeling you're referring to something else.
  • For the list of constant values I will refer to the sitemap-constants.php file lest this comment become out of date; they are all there, and I set them to "recommended" values.
  • As for what's left, this is pasted from the P2 post: (roughly in order from most to least important)
    • Testing on real servers!
    • Change the @SInCE annotations from 4.7.0 to whatever.
    • Make sure the list of video formats in query_videos_after_id is complete.
    • Nail down the jetpack_sitemap_location filter; I'm still not sure what this "should" do, but at the moment it doesn't do what the old version did.
    • Integrate the video sitemap with VideoPress (thumbnails etc.; check the video sitemap extension). At least some of the code for this will go in video_post_to_sitemap_item.
    • Lower priority, but would be neat-o: integrate with WPML or other multilingual plugins (see the extension for translations). I don't know enough about translating WP to say what is the best way to do this.

@jeherve jeherve changed the title In progress: Remove post limit on sitemap Sitemaps: remove post limit on sitemap Feb 21, 2017
@jeherve jeherve added [Status] Needs Review This PR is ready for review. and removed [Status] In Progress labels Feb 21, 2017
@mdawaffe
Copy link
Member

mdawaffe commented Mar 3, 2017

What do you mean by caching strategy? This code does some hand rolled caching of the sitemap files in the database but I have a feeling you're referring to something else.

That's all that I mean. How does the cade cache/store/otherwise preserve the stuff it generates? What stuff is cached? For how long? How is the cache invalidated, or is it? What stuff is not cached and always generated on the fly?

@nbloomf
Copy link
Contributor Author

nbloomf commented Mar 5, 2017

This is also copied to the P2 post:

This module deals with data of a few different kinds, that are cached in different ways. There's

  1. the sitemap pages,
  2. the news sitemap page,
  3. the sitemap state, and
  4. the sitemap XSL files.

Sitemap pages are stored in the posts table using several custom post types. This storage doesn't really act like a cache, because we don't worry about whether the data is stale. These files are stored indefinitely (until replaced by an updated version by the cron job).

The news sitemap page (singular) is cached using a transient. It is invalidated whenever a post is published but not when a comment is posted (see callback_action_flush_news_sitemap_cache), and has a lifetime of JP_NEWS_SITEMAP_INTERVAL (set to 12 hours) (see news_sitemap_xml).

The sitemap state, stored in an option and used to keep track of what needs to be generated next, isn't really cached either. But it is locked to prevent a race condition, and this lock (stored as a transient) has a lifetime of JP_SITEMAP_LOCK_INTERVAL (set to 15 minutes).

The sitemap XSL files are generated fresh on each request. These would be constant if not for some translated UI strings.

@samhotchkiss
Copy link
Contributor

Looks good and tests well. @jeherve, can you make sure that we include this in the testing instructions to make sure people test it out thoroughly?

@samhotchkiss samhotchkiss merged commit 918f0d3 into Automattic:master Mar 27, 2017
@samhotchkiss samhotchkiss removed the [Status] Needs Review This PR is ready for review. label Mar 27, 2017
@jeherve jeherve added [Status] Needs Changelog [Status] Needs Testing We need to add this change to the testing call for this month's release labels Mar 27, 2017
jeherve added a commit that referenced this pull request Mar 28, 2017
@jeherve jeherve added [Status] Has Changelog and removed [Status] Needs Changelog [Status] Needs Testing We need to add this change to the testing call for this month's release labels Mar 28, 2017
samhotchkiss pushed a commit that referenced this pull request Mar 29, 2017
* Readme: remove old release and add skeleton for 4.8.

* Changelog: add #6572

* Changelog: add #6567

* Changelog: add #6542

* Changelog: add #6527

* Changelog: add #6508

* Changelog: add #6478

* Changelog: add #6477

* Changelog: add #6249

* Update stable version and remove old version from readme.

* Changelog: add 4.7.1 to changelog.

* Readme: add new contributor.

* Sync: update docblock @SInCE version.

Related: #6053

* Changelog: add release post.

* changelog: add #6053

* Changelog: add #6413

* Changelog: add #6482

* Changelog: add #6584

* Changelog add #6603

* Changelog: add #6606

* Changelog: add #6611

* Changelog: add #6635

* Changelog: add #6639

* Changelog: add #6684

* Changelog: add #6710

* Changelog: add #6711

* Changelog: add #5461

* Testing list: update Settings UI feedback prompt.

Props @MichaelArestad

* Changelog: add #6789

* Changelog: add #6778

* Changelog: add #6777

* Changelog: add #6775

* Changelog: add #6755

* Changelog: add #6731

* Changelog: add #6721

* Changelog: add #6705

* Changelog: add #6702

* Changelog: add #6671

* Changelog: add #6637

* Changelog: add #6582

* Changelog: add #6566

* Changelog: add #6555

* Changelog: add #6529

* Changelog: add #6344

* Changelog: add #5763

* Changelog: add #5503

* Changelog: update #6637 changelog.

@see 40e115c#commitcomment-21523982

* Changelog: add #6699

* Changelog: add #6632

* Changelog: add #6769

* Changelog: add #6707

* Changelog: add #6590
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Feature] Sitemaps [Pri] Normal Trial Project [Type] Enhancement Changes to an existing feature — removing, adding, or changing parts of it
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants