Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replay preferring incorrect WARC response record #64

Closed
edsu opened this issue Jun 23, 2022 · 9 comments
Closed

Replay preferring incorrect WARC response record #64

edsu opened this issue Jun 23, 2022 · 9 comments
Assignees
Labels
replay web archiving 2022 web archiving work cycle

Comments

@edsu
Copy link
Contributor

edsu commented Jun 23, 2022

When comparing pywb and openwayback pages I noticed this error during pywb playback:

openwayback displays fine, but pywb displays this error instead of the content:

{"code":"rest_missing_callback_param","message":"Missing parameter(s): url","data":{"status":400,"params":["url"]}}

One thing I noticed is that openwayback appears to redirect to a slightly different timestamp:

Whereas if you look at the content of the pywb frame using https://was-pywb-prod.stanford.edu/was/20180103160835mp_/http://news.stanford.edu/ you can see it is redirecting to:

This appears to be an error from news.stanford.edu that was archived during the crawl. But I'm confused why this WARC record is being returned by pywb since the URL is so different than the one requested. The pywb index was created with cdxj-indexer using the --post-append option. Could that be a factor here?

@edsu edsu changed the title Missing parameter(s): url Replay error: Missing parameter(s): url Jun 23, 2022
@edsu edsu added the web archiving 2022 web archiving work cycle label Jun 23, 2022
@edsu edsu self-assigned this Jun 23, 2022
@edsu
Copy link
Contributor Author

edsu commented Jun 23, 2022

I started by asking about this over on IIPC Slack in the #pywb room to see if other pywb users have encountered a similar problem.

@edsu edsu changed the title Replay error: Missing parameter(s): url Replay preferring incorrect WARC response record Jun 23, 2022
@ikreymer
Copy link

ikreymer commented Jun 25, 2022

The issue is likely related to revisit records and how they're resolved. In this case, the entry here is:

edu,stanford,news)/ 20180103160835 {"url": "http://news.stanford.edu/", "mime": "warc/revisit", "status": "301", "digest": "sha1:QH732FYSV7UM34JYWVYMB7EZGR2CYM6B", "length": "597", "offset": "5292", "filename": "ARCHIVEIT-5595-DAILY-JOB537658-20180103160832522-00000.warc.gz", "source": "was:level3.cdxj", "source-coll": "was", "access": "allow"}

Given a revisit, pywb will look by digest sha1QH732FYSV7UM34JYWVYMB7EZGR2CYM6B to find a matching non-revisit record.
But, there's quite a few records matching this digest:
https://was-pywb-prod.stanford.edu/was/cdx?url=http://news.stanford.edu/&filter=digest:QH732FYSV7UM34JYWVYMB7EZGR2CYM6B
and at different URLs:
https://was-pywb-prod.stanford.edu/was/cdx?url=http://news.stanford.edu/thedish/wp-json/oembed/1.0/embed&&filter=digest:QH732FYSV7UM34JYWVYMB7EZGR2CYM6B

pywb supports 'url agnostic' revisit records, so that's why it would look at other URLs for the payload. But, it should use the headers from the original record, and in this case, that's what matters, as it's a redirect.. The payload I'm guessing is probably some placeholder that gets repeated multiple times?

It would be useful to look at the exact record to see what's happening..

@edsu
Copy link
Contributor Author

edsu commented Jun 26, 2022

Thanks for taking a look @ikreymer! Here is the WARC record referenced in that CDXJ entry that you included above:

$ warcio extract /web-archiving-stacks/data/collections/pj520zv3364 /bv/478/zj/1829/ARCHIVEIT-5595-DAILY-JOB537658-20180103160832522-00000.warc.gz 5292
WARC/1.0
WARC-Type: revisit
WARC-Target-URI: http://news.stanford.edu/
WARC-Date: 2018-01-03T16:08:35Z
WARC-IP-Address: 104.196.197.5
WARC-Profile: http://netpreserve.org/warc/1.0/revisit/identical-payload-digest
WARC-Truncated: length
WARC-Refers-To: <urn:uuid:n>
WARC-Payload-Digest: sha1:QH732FYSV7UM34JYWVYMB7EZGR2CYM6B
WARC-Refers-To-Date: 2017-12-12T01:32:29Z
WARC-Refers-To-Target-URI: https://news.stanford.edu/thedish/wp-json/oembed/1.0/embed?url=https%3A%2F%2Fnews.stanford.edu%2Fthedish%2F2015%2F04%2F25%2Fyoung-alumni-arts-grants-support-recent-grads-in-their-artistic-pursuits%2F&format=xml
WARC-Record-ID: <urn:uuid:64e538d1-20ea-4d6e-9eb8-6fa4ad9fc829>
Content-Type: application/http; msgtype=response
Content-Length: 206

HTTP/1.1 301 Moved Permanently
Server: nginx
Date: Wed, 03 Jan 2018 16:08:36 GMT
Content-Type: text/html
Content-Length: 178
Connection: close
Location: https://news.stanford.edu/
X-Type: default

It looks like the server could be configured to redirect all HTTP URLs to HTTPS? Would these have identical WARC-Payload-Digest despite them having different HTTP response headers?

I'll admit to being confused by the discrepancy between WARC-Refers-To-Target-URI and WARC-Target-URI here. Shouldn't the former be getting indexed for revisit records?

If it's useful to see other records from this WARC file or the CDXJ index please let me know.

@ikreymer
Copy link

Thanks! That's helpful, that is how it ends up loading the that record, as it uses WARC-Refers-To-Target-URI (not sure if openwayback fully supported it). Basically, its supposed to use the headers from this record, and the payload from the record looked up, but somehow is using the redirect also from the referred-to record. Is it possible to lookup that record as well?

@edsu
Copy link
Contributor Author

edsu commented Jun 27, 2022

Yeah, I guess I'm confused why when I index that record above with cdxj-indexer I get a CDX line like this:

edu,stanford,news)/ 20180103160835 {"url": "http://news.stanford.edu/", "mime": "warc/revisit", "status": "301", "digest": "sha1:QH732FYSV7UM34JYWVYMB7EZGR2CYM6B", "length": "573", "offset": "0", "filename": "revisit.warc.gz"}

Shouldn't the SURT key be for this URL instead?

https://news.stanford.edu/thedish/wp-json/oembed/1.0/embed?url=https%3A%2F%2Fnews.stanford.edu%2Fthedish%2F2015%2F04%2F25%2Fyoung-alumni-arts-grants-support-recent-grads-in-their-artistic-pursuits%2F&format=xml

Which record would it be helpful to see?

@ikreymer
Copy link

Yeah, I guess I'm confused why when I index that record above with cdxj-indexer I get a CDX line like this:

edu,stanford,news)/ 20180103160835 {"url": "http://news.stanford.edu/", "mime": "warc/revisit", "status": "301", "digest": "sha1:QH732FYSV7UM34JYWVYMB7EZGR2CYM6B", "length": "573", "offset": "0", "filename": "revisit.warc.gz"}

Shouldn't the SURT key be for this URL instead?

No, the cdxj url is always generated from the WARC-Target-URI for any record, and the WARC-Refers-To-Target-URI is handled when the revisit record is used, causing another dynamic lookup of the cdx.
Also, an issue here is that this is a 'self-redirect', eg. it redirects to the same url with different http/https, since these are canonicalized together. What should happen is that this capture should be skipped and the next closest one in the list: https://was-pywb-prod.stanford.edu/was/cdx?url=http://news.stanford.edu/&closest=20180103160835&limit=10 should be used, which is the one at: 20180103160901, so it should also end up at: https://swap.stanford.edu/20180103160901/https://news.stanford.edu/ like in openwayback..

I'm puzzled why this one isn't skipped. One thing to try, enabling:redirect_to_exact: true in the config.yaml

@edsu
Copy link
Contributor Author

edsu commented Aug 5, 2022

It looks like enabling redirect_to_exact: true resulted in the addition of a Content-Location with the correct URL in the response. However the browser replay is unaffected since Location is there with the incorrect URL:

HTTP/1.1 301 Moved Permanently^M
Date: Fri, 05 Aug 2022 20:20:57 GMT^M
Server: Apache/2.4.41 (Ubuntu)^M
Strict-Transport-Security: max-age=31536000; includeSubDomains^M
X-Archive-Orig-Server: nginx^M
X-Archive-Orig-Date: Tue, 12 Dec 2017 01:32:29 GMT^M
Content-Type: text/html^M
X-Archive-Orig-Content-Length: 178^M
X-Archive-Orig-Connection: close^M
Location: https://was-pywb-prod.stanford.edu/was/20180103160835mp_/http://news.stanford.edu/thedish/wp-json/oembed/1.0/embed^M
Memento-Datetime: Wed, 03 Jan 2018 16:08:35 GMT^M
Link: <http://news.stanford.edu/>; rel="original", <https://was-pywb-prod.stanford.edu/was/http://news.stanford.edu/>; rel="timegate", <https://was-pywb-prod.stanford.edu/was/timemap/link/http://news.stanford.edu/>; rel="timemap"; type="application/link-format", <https://was-pywb-prod.stanford.edu/was/20180103160835mp_/http://news.stanford.edu/>; rel="memento"; datetime="Wed, 03 Jan 2018 16:08:35 GMT"; collection="was"^M
Content-Location: https://was-pywb-prod.stanford.edu/was/20180103160835mp_/http://news.stanford.edu/^M
Content-Security-Policy: default-src 'unsafe-eval' 'unsafe-inline' 'self' data: blob: mediastream: ws: wss: ; form-action 'self'^M
Transfer-Encoding: chunked^M

Could this be a function of pywb choosing to use:

WARC-Refers-To-Target-URI: WARC-Refers-To-Target-URI: https://news.stanford.edu/thedish/wp-json/oembed/1.0/embed?url=https%3A%2F%2Fnews.stanford.edu%2Fthedish%2F2015%2F04%2F25%2Fyoung-alumni-arts-grants-support-recent-grads-in-their-artistic-pursuits%2F&format=xml

instead of

WARC-Target-URI: http://news.stanford.edu/

in the revisit record?

ikreymer added a commit to webrecorder/pywb that referenced this issue Aug 6, 2022
- use http headers from headers record!
- parse records on initial lookup, as may need to use http headers from headers record
- possible fix for sul-dlss/was-pywb#64
@edsu
Copy link
Contributor Author

edsu commented Aug 7, 2022

@ikreymer noticed that there was a small bug in replay where the HTTP headers from an identical payload were being returned instead of the HTTP headers from the original revisit record. When a revisit record has a profile of Identical Payload Digest and the Content-Length is not zero, the HTTP headers in the supplied revisit record should be used:

A ‘revisit’ record using this profile may have no record block, in which case a Content-Length of zero must be written. If a record block is present, it shall be interpreted the same as a ‘response’ record type for the same URI, but truncated to avoid storing the duplicate content. A WARC-Truncated header with reason ‘length’ shall be used for any identical-digest truncation. It is recommended that server response headers be preserved in the revisit record in this manner.

There is now a nice test in pywb that reproduces this situation and tests for the correct behavior.

I was able to live test the revisit-headers-load-fix branch using the (soon to be retired) was-dev.stanford.edu server, which has access to the full WARC data and a recent-ish CDXJ index.

Replay for for this example works now!

Screen Shot 2022-08-07 at 6 01 30 AM

It may be a small bug, but this greatly improves replay of some important (and thus heavily crawled/revisited) stanford.edu content.

I'll leave this ticket open until the pywb branch is merged, released and installed with our weekly dependency updates.

ikreymer added a commit to webrecorder/pywb that referenced this issue Aug 19, 2022
* revisit loading fix for revisit records with http headers:
- if revisit record has http headers, always use those headers
- otherwise, continue to use http headers from payload record
- parse headers of http and payload records on initial lookup, to simplify loading
- tests: add test for loading revisit records with different urls, different headers but same payload
- fix for sul-dlss/was-pywb#64
* also bump version to 2.6.8
@edsu
Copy link
Contributor Author

edsu commented Sep 6, 2022

This is fixed in v.2.6.8 that was just deployed in qa/stage and prod!

https://was-pywb-prod.stanford.edu/was/20180103160835/http://news.stanford.edu/

@edsu edsu closed this as completed Sep 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
replay web archiving 2022 web archiving work cycle
Projects
None yet
Development

No branches or pull requests

2 participants