Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exact authorship match with multiple authorship #126

Closed
BenMerSci opened this issue Nov 25, 2024 · 8 comments
Closed

Exact authorship match with multiple authorship #126

BenMerSci opened this issue Nov 25, 2024 · 8 comments

Comments

@BenMerSci
Copy link

I am not sure if this is normal behaviour or not and so I just want to confirm.

The plant Agrostis tenuis isn't valid as per VASCAN database, and links to two different species based on the authorship that described Agrostis tenuis (https://data.canadensys.net/vascan/name/Agrostis%20tenuis):

Agrostis tenuis Sibthorp links to Agrostis capillaris Linnaeus
and
Agrostis tenuis Vasey links to Agrostis idahoensis Nash

When we query specifially for Agrostis tenuis Sibthorp from VASCAN:
https://verifier.globalnames.org/api/v1/verifications/Agrostis+tenuis+Sibthorp?capitalize=true&all_matches=true&data_sources=147

we get two results, one for Agrostis capillaris and another for Agrostis idahoensis.
I would have expected only one result since the authorship is provided and is supposed to only link to Agrostis capillaris Linnaeus.
Is this normal behaviour?

@dimus
Copy link
Member

dimus commented Nov 25, 2024

We use authorship matching for sorting of returned results. So Agrostis capillaris Linnaeus should always be the first 'best' result. However, when 'all-matches' is set to true, all results are visible. We assume all-matches is for people who want to see everything.

For example here we see only 'best' results for each name, and they look as expected:

https://verifier.globalnames.org/?capitalize=on&ds=147&format=html&names=Agrostis+tenuis+Vasey%0D%0AAgrostis+tenuis+Sibthorp%0D%0A

It also manages to figure out 'best' result if authors are abbreviated:

https://verifier.globalnames.org/?capitalize=on&ds=147&format=html&names=Agrostis+tenuis+Vasey%0D%0AAgrostis+tenuis+Sibthorp%0D%0AAgrostis+tenuis+Vas.%0D%0AAgrostis+tenuis+Sibth.%0D%0A

I was thinking to make another 'all-matches' option that would return best match for each data-source, but was not sure if it would be confusing, or helpful.

@BenMerSci
Copy link
Author

Got it!
I'm just trying to find a way to optimally generalize all my queries, where I can have taxons with or without authorship and taxons incorrectly (fuzzy) spelled and still get the matches I need to resolve the taxonomy.

Here in my example with Agrostis tenuis if I have the authorship (whether Sibthorp or Vasey, it works if I remove the argument all_matches=true because it returns only the match with the correct authorship and I'm not "interested" in the other match since it's a "bad" match (wrong species).
But if we only have, lets say Agrostis tenuis without authorship or even a name incorrectly written (fuzzy), we could want all the matches possible to decide afterward.

I guess that for some taxons I would want all the matches and for others not, but I can't know in advance (querying for a lot of different taxons)...

@dimus
Copy link
Member

dimus commented Nov 26, 2024

would it work for you to pick the first result from the returned list, if you do want to keep all-matches for all your queries? The first result is guaranteed to correspond to the 'best' match for each data-source. Results are not sorted by data-sources, only by the the quality algorithm, but the first result for each data-source is always the 'best' result for that data-source.

https://verifier.globalnames.org/?all_matches=on&capitalize=on&ds=147&ds=197&ds=196&format=html&names=Agrostis+tenuis+Vasey%0D%0AAgrostis+tenuis+Vas.%0D%0AAgrostis+tenuis+Sibthorp%0D%0AAgrostis+tenuis+Sib.%0D%0AAgrostis+tenuis%0D%0A

@dimus
Copy link
Member

dimus commented Nov 26, 2024

may be I do need to add a flag 'best-by-data-source' or something of this sort?

@BenMerSci
Copy link
Author

The thing is we always want to keep all the results for all the data source in the query, to have the synonyms, or if the taxon we queried for is written fuzzy etc., unless the result is simply "wrong" like in my example (Agrostis tenuis Sibthorp matching to Agrostis idahoensis Nash is wrong based on VASCAN).

But we can't know before querying that the taxon may have a "wrong" match and that we should use only the best match.
I don't know if that makes sense?

@dimus
Copy link
Member

dimus commented Nov 26, 2024

I can imagine 2 things that might help:

  1. Looking at ScoreDetails->AuthorMatchScore. If it is zero: authors did not match at all, if it is less than 0.3, one or both authorships were absent, everything higher means authors matched to some degree.
  2. preparsing names and running only names without authorship with all-matches option.

https://parser.globalnames.org/?code=&format=csv&names=Agrostis+tenuis+Vasey%0D%0AAgrostis+tenuis&with_details=on

@BenMerSci
Copy link
Author

BenMerSci commented Nov 26, 2024

Yes but we want the matches for all the datasources queried, even for matches with an authorship.

I think we'll manage to find a way to work around this on our side after the query, using the authorship and another field that we have (parent_scientific_name which is either the kingdom or the phylum) to parse through the results from the API with all_matches=true and keep the ones where the authorship matches and that our parent_scientific_name is in the classificationPath key.

I think we can close the issue.
Again, thank you @dimus for taking the time to go through this!

@dimus
Copy link
Member

dimus commented Nov 26, 2024

#127

@dimus dimus closed this as completed Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants