-
Notifications
You must be signed in to change notification settings - Fork 776
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expand embedded metadata detection #77
Comments
With recent changes to Embeded Metadata, EM performs quite well on the linked page. It properly detects the type as thesis, which is not detected with either MODS or MARCXML. In this particular case, MARCXML performs quite poorly, but it does supply numPages and seriesNumber, while the others do not. MODS supplies the full abstract and picks up additional authors, which are probably more appropriately classified as contributors (i.e. professor, university, etc. Is this even desirable?). It also contains the ISBN and publication title, but these are not preserved in a thesis itemType. So the advantage here is that EM detects it as a thesis, but it would be nice to get the full abstract. If we decide to supplement EM data with MARC or MODS, it may become difficult to determine which data we would rather prefer. Would we include all the authors? Add all the notes? (MODS adds 7 additional notes, which are fairly redundant and don't seem to enhance the metadata) One feasible solution, I think, would be to parse linked MODS and MARC pages and only supplement fields that are completely missing in the EM translator. This would leave out the abstract. Perhaps we can say that MODS/MARC abstracts will always be more complete? |
I was just wondering to myself why we didn't do this. Seems like it would be absolutely trivial to implement, for us and publishers. We're also trying to improve embedded metadata support by adding support for JSON-LD, but this would be a lower-tech solution when sites already have BibTeX, RIS, etc. — basically, an easier, more standards-compliant, non-abandoned unAPI. The HTML 5.1 draft also allows Any reason we shouldn't do this? I guess the biggest downside is that, as with unAPI, we'd need to make a separate request for each link and run detection on the result in order to show a proper icon. |
no reason from my side. @zuphilip has brought this up, too (he'd know where, but in some related EM discussion), and it seems like a very good idea to me. |
Do you mean this discussion about blacklight discovery system: #893 (comment) ? |
yup, thanks. |
Here are some examples from Blacklight catalogs: |
Another example, where MODS is available: https://purl.stanford.edu/fv751yt5934 <link rel="alternate" title="MODS XML" type="application/xml" href="https://purl.stanford.edu/fv751yt5934.mods" /> For a case like this I think we'd just want to look for 'mods' in the title and href when the type is |
Translator additions and fixes
The
<link rel="alternate" />
syntax for providing alternate representations should be used when we look for embedded metadata. A recent discussion notes a site providing dissertations that we don't import correctly. In addition to Google/Highwire metadata which we're parsing, it includes such<link rel="alternate" />
references to structured descriptions:I don't think we can expect to read these as-is, since the
text/xml
type is too vague, but we should look for known types for formats we do read, just like we do for intercepting RIS/BibTeX download. That meansapplication/mods+xml
for MODS, etc.The text was updated successfully, but these errors were encountered: