Replay preferring incorrect WARC response record #64

edsu · 2022-06-23T13:05:27Z

When comparing pywb and openwayback pages I noticed this error during pywb playback:

openwayback displays fine, but pywb displays this error instead of the content:

{"code":"rest_missing_callback_param","message":"Missing parameter(s): url","data":{"status":400,"params":["url"]}}

One thing I noticed is that openwayback appears to redirect to a slightly different timestamp:

https://swap.stanford.edu/20180103160901/https://news.stanford.edu/

Whereas if you look at the content of the pywb frame using https://was-pywb-prod.stanford.edu/was/20180103160835mp_/http://news.stanford.edu/ you can see it is redirecting to:

https://was-pywb-prod.stanford.edu/was/20180103160835mp_/http://news.stanford.edu/thedish/wp-json/oembed/1.0/embed

This appears to be an error from news.stanford.edu that was archived during the crawl. But I'm confused why this WARC record is being returned by pywb since the URL is so different than the one requested. The pywb index was created with cdxj-indexer using the --post-append option. Could that be a factor here?

The text was updated successfully, but these errors were encountered:

edsu · 2022-06-23T13:19:37Z

I started by asking about this over on IIPC Slack in the #pywb room to see if other pywb users have encountered a similar problem.

ikreymer · 2022-06-25T23:37:14Z

The issue is likely related to revisit records and how they're resolved. In this case, the entry here is:

edu,stanford,news)/ 20180103160835 {"url": "http://news.stanford.edu/", "mime": "warc/revisit", "status": "301", "digest": "sha1:QH732FYSV7UM34JYWVYMB7EZGR2CYM6B", "length": "597", "offset": "5292", "filename": "ARCHIVEIT-5595-DAILY-JOB537658-20180103160832522-00000.warc.gz", "source": "was:level3.cdxj", "source-coll": "was", "access": "allow"}

Given a revisit, pywb will look by digest sha1QH732FYSV7UM34JYWVYMB7EZGR2CYM6B to find a matching non-revisit record.
But, there's quite a few records matching this digest:
https://was-pywb-prod.stanford.edu/was/cdx?url=http://news.stanford.edu/&filter=digest:QH732FYSV7UM34JYWVYMB7EZGR2CYM6B
and at different URLs:
https://was-pywb-prod.stanford.edu/was/cdx?url=http://news.stanford.edu/thedish/wp-json/oembed/1.0/embed&&filter=digest:QH732FYSV7UM34JYWVYMB7EZGR2CYM6B

pywb supports 'url agnostic' revisit records, so that's why it would look at other URLs for the payload. But, it should use the headers from the original record, and in this case, that's what matters, as it's a redirect.. The payload I'm guessing is probably some placeholder that gets repeated multiple times?

It would be useful to look at the exact record to see what's happening..

edsu · 2022-06-26T12:14:14Z

Thanks for taking a look @ikreymer! Here is the WARC record referenced in that CDXJ entry that you included above:

$ warcio extract /web-archiving-stacks/data/collections/pj520zv3364 /bv/478/zj/1829/ARCHIVEIT-5595-DAILY-JOB537658-20180103160832522-00000.warc.gz 5292

WARC/1.0
WARC-Type: revisit
WARC-Target-URI: http://news.stanford.edu/
WARC-Date: 2018-01-03T16:08:35Z
WARC-IP-Address: 104.196.197.5
WARC-Profile: http://netpreserve.org/warc/1.0/revisit/identical-payload-digest
WARC-Truncated: length
WARC-Refers-To: <urn:uuid:n>
WARC-Payload-Digest: sha1:QH732FYSV7UM34JYWVYMB7EZGR2CYM6B
WARC-Refers-To-Date: 2017-12-12T01:32:29Z
WARC-Refers-To-Target-URI: https://news.stanford.edu/thedish/wp-json/oembed/1.0/embed?url=https%3A%2F%2Fnews.stanford.edu%2Fthedish%2F2015%2F04%2F25%2Fyoung-alumni-arts-grants-support-recent-grads-in-their-artistic-pursuits%2F&format=xml
WARC-Record-ID: <urn:uuid:64e538d1-20ea-4d6e-9eb8-6fa4ad9fc829>
Content-Type: application/http; msgtype=response
Content-Length: 206

HTTP/1.1 301 Moved Permanently
Server: nginx
Date: Wed, 03 Jan 2018 16:08:36 GMT
Content-Type: text/html
Content-Length: 178
Connection: close
Location: https://news.stanford.edu/
X-Type: default

It looks like the server could be configured to redirect all HTTP URLs to HTTPS? Would these have identical WARC-Payload-Digest despite them having different HTTP response headers?

I'll admit to being confused by the discrepancy between WARC-Refers-To-Target-URI and WARC-Target-URI here. Shouldn't the former be getting indexed for revisit records?

If it's useful to see other records from this WARC file or the CDXJ index please let me know.

ikreymer · 2022-06-26T20:40:21Z

Thanks! That's helpful, that is how it ends up loading the that record, as it uses WARC-Refers-To-Target-URI (not sure if openwayback fully supported it). Basically, its supposed to use the headers from this record, and the payload from the record looked up, but somehow is using the redirect also from the referred-to record. Is it possible to lookup that record as well?

edsu · 2022-06-27T15:46:33Z

Yeah, I guess I'm confused why when I index that record above with cdxj-indexer I get a CDX line like this:

edu,stanford,news)/ 20180103160835 {"url": "http://news.stanford.edu/", "mime": "warc/revisit", "status": "301", "digest": "sha1:QH732FYSV7UM34JYWVYMB7EZGR2CYM6B", "length": "573", "offset": "0", "filename": "revisit.warc.gz"}

Shouldn't the SURT key be for this URL instead?

https://news.stanford.edu/thedish/wp-json/oembed/1.0/embed?url=https%3A%2F%2Fnews.stanford.edu%2Fthedish%2F2015%2F04%2F25%2Fyoung-alumni-arts-grants-support-recent-grads-in-their-artistic-pursuits%2F&format=xml

Which record would it be helpful to see?

ikreymer · 2022-06-28T06:20:01Z

Yeah, I guess I'm confused why when I index that record above with cdxj-indexer I get a CDX line like this:
edu,stanford,news)/ 20180103160835 {"url": "http://news.stanford.edu/", "mime": "warc/revisit", "status": "301", "digest": "sha1:QH732FYSV7UM34JYWVYMB7EZGR2CYM6B", "length": "573", "offset": "0", "filename": "revisit.warc.gz"}
Shouldn't the SURT key be for this URL instead?

No, the cdxj url is always generated from the WARC-Target-URI for any record, and the WARC-Refers-To-Target-URI is handled when the revisit record is used, causing another dynamic lookup of the cdx.
Also, an issue here is that this is a 'self-redirect', eg. it redirects to the same url with different http/https, since these are canonicalized together. What should happen is that this capture should be skipped and the next closest one in the list: https://was-pywb-prod.stanford.edu/was/cdx?url=http://news.stanford.edu/&closest=20180103160835&limit=10 should be used, which is the one at: 20180103160901, so it should also end up at: https://swap.stanford.edu/20180103160901/https://news.stanford.edu/ like in openwayback..

I'm puzzled why this one isn't skipped. One thing to try, enabling:redirect_to_exact: true in the config.yaml

edsu · 2022-08-05T20:35:53Z

It looks like enabling redirect_to_exact: true resulted in the addition of a Content-Location with the correct URL in the response. However the browser replay is unaffected since Location is there with the incorrect URL:

HTTP/1.1 301 Moved Permanently^M
Date: Fri, 05 Aug 2022 20:20:57 GMT^M
Server: Apache/2.4.41 (Ubuntu)^M
Strict-Transport-Security: max-age=31536000; includeSubDomains^M
X-Archive-Orig-Server: nginx^M
X-Archive-Orig-Date: Tue, 12 Dec 2017 01:32:29 GMT^M
Content-Type: text/html^M
X-Archive-Orig-Content-Length: 178^M
X-Archive-Orig-Connection: close^M
Location: https://was-pywb-prod.stanford.edu/was/20180103160835mp_/http://news.stanford.edu/thedish/wp-json/oembed/1.0/embed^M
Memento-Datetime: Wed, 03 Jan 2018 16:08:35 GMT^M
Link: <http://news.stanford.edu/>; rel="original", <https://was-pywb-prod.stanford.edu/was/http://news.stanford.edu/>; rel="timegate", <https://was-pywb-prod.stanford.edu/was/timemap/link/http://news.stanford.edu/>; rel="timemap"; type="application/link-format", <https://was-pywb-prod.stanford.edu/was/20180103160835mp_/http://news.stanford.edu/>; rel="memento"; datetime="Wed, 03 Jan 2018 16:08:35 GMT"; collection="was"^M
Content-Location: https://was-pywb-prod.stanford.edu/was/20180103160835mp_/http://news.stanford.edu/^M
Content-Security-Policy: default-src 'unsafe-eval' 'unsafe-inline' 'self' data: blob: mediastream: ws: wss: ; form-action 'self'^M
Transfer-Encoding: chunked^M

Could this be a function of pywb choosing to use:

WARC-Refers-To-Target-URI: WARC-Refers-To-Target-URI: https://news.stanford.edu/thedish/wp-json/oembed/1.0/embed?url=https%3A%2F%2Fnews.stanford.edu%2Fthedish%2F2015%2F04%2F25%2Fyoung-alumni-arts-grants-support-recent-grads-in-their-artistic-pursuits%2F&format=xml

instead of

WARC-Target-URI: http://news.stanford.edu/

in the revisit record?

- use http headers from headers record! - parse records on initial lookup, as may need to use http headers from headers record - possible fix for sul-dlss/was-pywb#64

edsu · 2022-08-07T10:23:46Z

@ikreymer noticed that there was a small bug in replay where the HTTP headers from an identical payload were being returned instead of the HTTP headers from the original revisit record. When a revisit record has a profile of Identical Payload Digest and the Content-Length is not zero, the HTTP headers in the supplied revisit record should be used:

A ‘revisit’ record using this profile may have no record block, in which case a Content-Length of zero must be written. If a record block is present, it shall be interpreted the same as a ‘response’ record type for the same URI, but truncated to avoid storing the duplicate content. A WARC-Truncated header with reason ‘length’ shall be used for any identical-digest truncation. It is recommended that server response headers be preserved in the revisit record in this manner.

There is now a nice test in pywb that reproduces this situation and tests for the correct behavior.

I was able to live test the revisit-headers-load-fix branch using the (soon to be retired) was-dev.stanford.edu server, which has access to the full WARC data and a recent-ish CDXJ index.

Replay for for this example works now!

It may be a small bug, but this greatly improves replay of some important (and thus heavily crawled/revisited) stanford.edu content.

I'll leave this ticket open until the pywb branch is merged, released and installed with our weekly dependency updates.

* revisit loading fix for revisit records with http headers: - if revisit record has http headers, always use those headers - otherwise, continue to use http headers from payload record - parse headers of http and payload records on initial lookup, to simplify loading - tests: add test for loading revisit records with different urls, different headers but same payload - fix for sul-dlss/was-pywb#64 * also bump version to 2.6.8

edsu · 2022-09-06T16:15:42Z

This is fixed in v.2.6.8 that was just deployed in qa/stage and prod!

https://was-pywb-prod.stanford.edu/was/20180103160835/http://news.stanford.edu/

edsu changed the title ~~Missing parameter(s): url~~ Replay error: Missing parameter(s): url Jun 23, 2022

edsu added the web archiving 2022 web archiving work cycle label Jun 23, 2022

edsu self-assigned this Jun 23, 2022

edsu changed the title ~~Replay error: Missing parameter(s): url~~ Replay preferring incorrect WARC response record Jun 23, 2022

edsu added the replay label Jul 12, 2022

mjgiarlo mentioned this issue Jul 21, 2022

Replay for site redirecting to a different URL not working #95

Closed

ikreymer added a commit to webrecorder/pywb that referenced this issue Aug 6, 2022

revisit loading:

413ba21

- use http headers from headers record! - parse records on initial lookup, as may need to use http headers from headers record - possible fix for sul-dlss/was-pywb#64

ikreymer mentioned this issue Aug 16, 2022

Revisit headers load fix webrecorder/pywb#751

Merged

8 tasks

edsu closed this as completed Sep 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replay preferring incorrect WARC response record #64

Replay preferring incorrect WARC response record #64

edsu commented Jun 23, 2022 •

edited

Loading

edsu commented Jun 23, 2022

ikreymer commented Jun 25, 2022 •

edited

Loading

edsu commented Jun 26, 2022

ikreymer commented Jun 26, 2022

edsu commented Jun 27, 2022 •

edited

Loading

ikreymer commented Jun 28, 2022

edsu commented Aug 5, 2022 •

edited

Loading

edsu commented Aug 7, 2022

edsu commented Sep 6, 2022

Replay preferring incorrect WARC response record #64

Replay preferring incorrect WARC response record #64

Comments

edsu commented Jun 23, 2022 • edited Loading

edsu commented Jun 23, 2022

ikreymer commented Jun 25, 2022 • edited Loading

edsu commented Jun 26, 2022

ikreymer commented Jun 26, 2022

edsu commented Jun 27, 2022 • edited Loading

ikreymer commented Jun 28, 2022

edsu commented Aug 5, 2022 • edited Loading

edsu commented Aug 7, 2022

edsu commented Sep 6, 2022

edsu commented Jun 23, 2022 •

edited

Loading

ikreymer commented Jun 25, 2022 •

edited

Loading

edsu commented Jun 27, 2022 •

edited

Loading

edsu commented Aug 5, 2022 •

edited

Loading