FindID is an API that turns text strings describing academic entities into OpenAlex IDs. It will support these entities:
- works (references): Find a work ID from a bibliographic refernce. Not implemented yet.
- authors: Find an author ID from a name + metadata about a paper they wrote. Not implemented yet.
- sources (journals, repositories, etc): Find a source ID from a name. Not implemented yet.
- institutions: Find a list of institution IDs from an affiliation string.
Get a list of matching institution IDs from a text string.
- Query parameters:
query
: The text string match against.
query
: The original queryentities
: A list of Institution objects for that query. Note this is a list, because each query string can have multiple institution matches in it.
We use the Python flashgeotext library to find geonames in the query string. We save these for later. Geonames are often scattered all over the string, so we can't connect them to tokens unfortunately, but they can still be useful for disambiguating common institution names.
We split the query string into tokens. Each token represents a potential institution name. Here's the tokenization:
- words in all-uppercase become tokens and are removed from the string.
- Words or phrases within parantheses become tokens and are removed from the string (parentheses are discarded).
- The rest of the string is split on commas, semicolons, and any type of dash or hyphen (em dash, en dash, etc).
- Remove any tokens shorter than 3 chars.
For each token, we:
- Collapse all whitespace to a single space.
- Lowercase all characters unless the word is all-uppercase.
- Remove all punctuation.
- Replace all accented characters with their unaccented equivalents.
For each normalized token, we look up the institution in the in-memory lookup table. What we do next depends on the number of matches:
- 0 matches: We skip this token.
- 1 match: We add this institution to the list of matching Institutions that will be returned to the user
-
1 match: look for a match between any of the geonames found in the overall affiliation string, and the location_name, country_subdivision_name, or country_name of each institution. If there's one match, we use that institution. If there's more than one match, we skip this token.
Same as GET /entities/institutions, but in a batch.
none
queries
: A list of text strings to match against.
queries
: A list of queries with their matchesquery
: The original querygeonames
: A list of geonames found in the overall affiliation stringmatches
: A list describing matching institutionstoken
: The token we matched oninstitution
: The matching Institution objectis_token_unique
: A boolean indicating whether the token was unique, or if there were multiple institutions that matched
Run a batch of tests from a particular dataset.
dataset
: The dataset of tests to use, as defined in the "dataset" column of the tests csv.
tests
: A list of tests to run.id
: The ID of the test.query
: The text string to match against.expected_entities
: A list of expected Institution objects.
meta
:total
: The total number of tests runpassing
: The number of passing testsfailing
: The number of failing testsperformance
: A dictionary of performance metricspercentage_passing
: The percentage of passing testsprecision
: The precision of the testsrecall
: The recall of the tests
timing
: A dictionary of timing metricstotal
: The total time taken to run the testssetup
: The time taken to setup the testsper_test
: The average time taken per test. This doesn't include setup time.
results
: A list of Test objects
When the test endpoint is called, it:
- Pulls all the tests from a CSV here: https://docs.google.com/spreadsheets/d/e/2PACX-1vR_sVx4ts9ndZJ6UP8mPqKd-Rw_v-_A_ShaIvgIE4QhmdPeNb5H7GUPZIBZiMEXvLax1iAChlH6Mk6W/pub?output=csv. Don't save the tests anywhere; load them fresh every time the endpoint is called.
- Filters the tests to run only the ones from the specified dataset.
- Runs each test by calling the same function used by
GET /entities/institutions
- Returns the results. The expected entities are hydrated into Institution objects using data from the ror_with_openalex.csv file.
query
: The original querygeonames
: A list of geonames found in the overall affiliation stringmatches
: A list describing matching institutionstoken
: The token we matched oninstitution
: The matching Institution objectis_token_unique
: A boolean indicating whether the token was unique, or if there were multiple institutions that matched
id
: The OpenAlex ID of the institutionname
: The name of the institutionror
: The ROR ID of the institutionalternate_names
: A list of alternate names for the institution
-
id
: The ID of the test -
query
: The original query -
is_passing
: A boolean indicating whether the test passed -
results
:correct
: A list of QueryResult objects that matched the expected institutionovermatched
: A list of QueryResult objects that did NOT match the expected institutionundermatched
: A list of Institution objects that should have been matched, but weren't
Normalize a string by:
- Collapsing all whitespace to a single space.
- Lowercasing all characters unless the word is all-uppercase.
- Removing all punctuation.
- Replacing all accented characters with their unaccented equivalents.
When the app boots, it builds a few dictionaries as lookup tables in memory:
normalized_name_to_ror_id
:- key: The normalized name of the institution (using the
normalize
function) - value: A list of ROR IDs that match that name
- key: The normalized name of the institution (using the
ror_id_to_record
:- key: The ROR ID of the institution
- value: The Institution object
The app is build in Python 3, using the FastAPI framework. The app is deployed on Vercel