-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replay preferring incorrect WARC response record #64
Comments
I started by asking about this over on IIPC Slack in the #pywb room to see if other pywb users have encountered a similar problem. |
The issue is likely related to revisit records and how they're resolved. In this case, the entry here is:
Given a revisit, pywb will look by digest pywb supports 'url agnostic' revisit records, so that's why it would look at other URLs for the payload. But, it should use the headers from the original record, and in this case, that's what matters, as it's a redirect.. The payload I'm guessing is probably some placeholder that gets repeated multiple times? It would be useful to look at the exact record to see what's happening.. |
Thanks for taking a look @ikreymer! Here is the WARC record referenced in that CDXJ entry that you included above:
It looks like the server could be configured to redirect all HTTP URLs to HTTPS? Would these have identical I'll admit to being confused by the discrepancy between If it's useful to see other records from this WARC file or the CDXJ index please let me know. |
Thanks! That's helpful, that is how it ends up loading the that record, as it uses |
Yeah, I guess I'm confused why when I index that record above with cdxj-indexer I get a CDX line like this:
Shouldn't the SURT key be for this URL instead?
Which record would it be helpful to see? |
No, the cdxj url is always generated from the I'm puzzled why this one isn't skipped. One thing to try, enabling: |
It looks like enabling
Could this be a function of pywb choosing to use:
instead of
in the revisit record? |
- use http headers from headers record! - parse records on initial lookup, as may need to use http headers from headers record - possible fix for sul-dlss/was-pywb#64
@ikreymer noticed that there was a small bug in replay where the HTTP headers from an identical payload were being returned instead of the HTTP headers from the original revisit record. When a revisit record has a profile of Identical Payload Digest and the
There is now a nice test in pywb that reproduces this situation and tests for the correct behavior. I was able to live test the revisit-headers-load-fix branch using the (soon to be retired) was-dev.stanford.edu server, which has access to the full WARC data and a recent-ish CDXJ index. Replay for for this example works now! It may be a small bug, but this greatly improves replay of some important (and thus heavily crawled/revisited) stanford.edu content. I'll leave this ticket open until the pywb branch is merged, released and installed with our weekly dependency updates. |
* revisit loading fix for revisit records with http headers: - if revisit record has http headers, always use those headers - otherwise, continue to use http headers from payload record - parse headers of http and payload records on initial lookup, to simplify loading - tests: add test for loading revisit records with different urls, different headers but same payload - fix for sul-dlss/was-pywb#64 * also bump version to 2.6.8
This is fixed in v.2.6.8 that was just deployed in qa/stage and prod! https://was-pywb-prod.stanford.edu/was/20180103160835/http://news.stanford.edu/ |
When comparing pywb and openwayback pages I noticed this error during pywb playback:
openwayback displays fine, but pywb displays this error instead of the content:
One thing I noticed is that openwayback appears to redirect to a slightly different timestamp:
Whereas if you look at the content of the pywb frame using https://was-pywb-prod.stanford.edu/was/20180103160835mp_/http://news.stanford.edu/ you can see it is redirecting to:
This appears to be an error from news.stanford.edu that was archived during the crawl. But I'm confused why this WARC record is being returned by pywb since the URL is so different than the one requested. The pywb index was created with cdxj-indexer using the
--post-append
option. Could that be a factor here?The text was updated successfully, but these errors were encountered: