Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

subgenus = Incertae sedis then name string doesn't parse, also strange looking quality values #277

Open
debpaul opened this issue Jan 22, 2025 · 3 comments

Comments

@debpaul
Copy link

debpaul commented Jan 22, 2025

Raw data (unparsed): beulah-first-5000-name-strings-unparsed.csv

Modified GNParsed Data Set: beulath-taxonnames-gnparsed-first-5000-rows.txt

  • added family column, value = Carabidae
  • opened file in Notepad ++
  • changed CRLF line endings to UNIX (LF) (b/c upload to TW batch requires this)

Noticed

  • the Quality values look strange? Maybe on import into Excel, I need to select a certain data type for this field?
    Image

  • see also line 11 above where the value pseudoflavipes appears changed to pseudoflavipe0s in CanonicalFull column (also lines 116, 117)

    • don't know where that 0 comes from
  • see also Author Year leading and trailing 0. Not sure where they are coming from either
    Image

  • More 0 issues (and delimiters issue?), origin uncertain
    Image

  • Some names did not parse. (Not sure why). See screenshot next. Maybe because all these names have subgenus = (Incertae sedis) and GN doesn't recognize this value at this rank?

Image

  • In general, subgenus is missing from all parsed values.

Maybe in future?

  • option to parse (further atomize) down to lowest rank provided
@dimus
Copy link
Member

dimus commented Jan 22, 2025

Thanks @debpaul, interesting

  1. Looks like I am missing case where subgenus is Inserte cedis. I do agree, that names like these should be parsed. I will make a separate issue about it.

  2. Strange results in quality is an artefact of postprocessing, it is impossible to get quality 10. The '0' in the middle of Canonical also seems to be postprocessing problem. Try to run this name by itself in parser

  3. Subgenus is provided, just not in the CSV format. If you pick JSON format on the web UI, you will see the subgenus results.

@debpaul
Copy link
Author

debpaul commented Jan 22, 2025

@dimus thanks! I did note that on import to Excel, it asks about modifying or removing leading zeroes. Note sure why. I told it not to modify the data. I'll test again as you suggest.

@dimus
Copy link
Member

dimus commented Jan 22, 2025

this is what I get without preprocessing;

beulah-parsed.txt

@debpaul can you also try Libreoffice? It consistently gives me better results than Excel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants