-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Different method for release approach #15
Comments
I spent a little more time looking through that mass of data and found a lot of the items were missing the 'begin' and 'end' keys. Also, some of the dates are way off. I saw a 1925 in there for the 'begin' and 'end' on one of the recordings. Seems your method is by far the best. I just wish that it allowed you fetch the whole recording for each of the items in the recording-relation-list when you grab the work. Oh well. Thanks again. |
Yes, unfortunately there is a lot of incomplete data and fetching all of the releases is the only way to guarantee good accuracy. The problem is that each one is a separate request, which makes it very slow. If Musicbrainz implemented something like GraphQL, the amount of requests could be reduced significantly, but for now a good workaround is to run your own mirror. |
@phirestalker I have made a fair few changes to close all open issues. If you'd like to test them, you can let me know if the plugin still works as intended and the changes too. Regarding this issue, I thought of a different, more heuristic approach. It is theoretically possible to perform a search for releases, with a specific artist ID and the title of the track. Such a search can return up to 100 matches, including the date fields for the release. This can be used to get this information in one request, see example below. However, it's possible for the search to return unrelated items, which would somehow have to be detected and removed. It also remains to see if the artist id on various releases is correctly set, for it might work properly for well-known artists, but might not be accurate enough for more obscure ones. Still, it seems worth investigating. |
That sounds interesting, and it should be relatively trivial to filter the unrelated ones. Another interesting thing I found while rebuilding my local musicbrainz server
To me, that sounds like a material table that has the exact data we want. I wonder if they make it available to the API easily somehow? EDIT: |
The problem with filtering the search is with slightly different versions of the same song. For example, the matched recording you picked has A different problem occurs when the title is slightly different because it's two songs in one, or it includes Regarding the For example, you perform a search for recordings with a specific artist ID that match a specific release/recording title, and look for the import requests
from datetime import datetime
from urllib.parse import quote_plus
artist_id = "83d91898-7763-47d7-b03b-b92132375c47"
release_title = "wish you were here"
url = f"https://musicbrainz.org/ws/2/recording/?query=arid:{quote_plus(artist_id)}%20AND%20recording:%22{quote_plus(release_title)}%22&limit=100&fmt=json"
response = requests.get(url)
data = response.json()
oldest_date = datetime.max
for recording in data['recordings']:
# Check if 'first-release-date' is in the recording data
if 'first-release-date' in recording:
for format in ["%Y-%m-%d", "%Y-%m", "%Y"]:
try:
release_date = datetime.strptime(recording['first-release-date'], format)
if release_date < oldest_date:
oldest_date = release_date
break
except ValueError:
pass
print(f"The oldest release date is: {oldest_date.strftime('%Y-%m-%d')}")
|
It may take me a couple of days, but I will incorporate this into a function and run it alongside your releases method. Then compare the date returned for the oldest date. I will try to get some metrics on when they match and print a list of the ones that don't match so we can analyze those cases. I may attempt to contact metabrainz and ask them how this table is generated, with user entries or from data already in the database using a query similar to how you do it now. I guess I will need to save some data for the ones that don't match so we know if it is using the wrong recordings like you mentioned. |
Quick question. Your code uses a search function to get the first-release-date field. Does that mean it is not returned when using get_recording_by_id in musicbrainzngs? If so I wonder if that is a failing of the library or if musicbrainz does not offer it through those queries. |
I now notice that https://musicbrainz.org/ws/2/recording/8f3471b5-7e6a-48da-86a9-c1c07a0f47ae?fmt=json However, this doesn't solve the fundamental problem of needing to fetch every single recording (or enough of them) to get the real original date for a song. Besides, it isn't included with all recordings, because sometimes the releases do not have a date associated with them: https://musicbrainz.org/ws/2/recording/aae44009-6745-40ae-a477-f215c4f76488?fmt=json It appears musicbrainzngs does not return this field: |
I guess we can't use that until musicbrainzngs adds it. I noticed that recordings that have first-release-date have no other date field. I wonder if they are working to clean up the data so that the album has a release date and each track has a first-release-date linked to a work? That would be fantastic. I think if musicbrainzngs returned it, it might be ok (once we determine its accuracy) to use first-release-date if it exists, and switch to using the user set approach after that, or make it another ooption to approach. To make theses tests I am creating a new OldestDatePlugin object and setting config options directly. How would I be able o set the mb host and rate limit since those are set at the beets level? EDIT: |
Man, I started this test on Thursday and it is still running. Because of the way os.walk() works I can't use a progress bar wrapped around it, so I have no idea how long it has taken or how long it has left to run. No big, but I wanted to keep you updated. |
Which fields are you planning on comparing? The |
I am fetching a certain recording and its first-recording-date. I then use the same MB id to feed to your plugin with the releases approach and save the date it gives. I will show the amonut of recordings missing a first-recording-date, the number of those that found a date with your method. I will also show when the dates differ between methods and which one is older. I'm not sure how much deeper I can go since your function only returns the date and not the recording or release it go tthe date from. Let me know what other metrics or data you would like. |
Finally! Well, that was fun. Let me know if you want me to test the idea you had before I saw first-recording-date. |
Can you elaborate more on what exactly you tested? Did you just compare getting the first-recording-date or first-release-date for a singular recording at a time? |
I took the id from the tags of each song on my computer and used that to query the recording info directly. In that info is the first-recording-date. I saved that and then used the id to call _get_oldest_date and saved the date returned from that. I ddi not test for first-release-date. I am not sure how to do that properly. I guess I would find all the releases of a recording and get the first-release-date for each of them saving the oldest. I am willing to test many more methods. I can always expand the sqlite database to store more information. I just don't know what else to try out. |
No, What is worth investigating however is performing a search with an artist id and the title of the recording (i.e. track name), then getting the list of all |
I reversed those. My test does indeed use the first-release-date from the recording and not first-recording-date. I'm not sure how I mixed them up. I will try to start work on testing the search method you mentioned in a little while. Luckily I won't have to redo getting the date using the plugin method. I might use fuzzywuzzy to match the titles to make it a little easier, and remove matches that contain words like remaster. I will store the title of the matched recording along with the first-release-date. Let me know if that sounds good. |
Yes, i reckon there can be list of words like remaster, remix that can be removed from the title, then Lucene search syntax can be used to do a fuzzy search. |
Are you saying that I can clean the title of my track first and then trust the search results from MB using that cleaned title? |
I only removed remix and remaster from my titles before the search. My results are as follows: EDIT: forgot some metrics |
I used the wrong field for comparison. The one that stores the date from your plugin is oldest_date. So here are some correct metrics. |
What exactly are those numbers showing here? |
445 items where your plug-in has the oldest date there are only 28 times that your plug-in did not find a date. Also, there are 1343 times where the artist ID search method didn't get the exact same title. Let me know what other information you would like. |
How many times did the search method not find a date as old as the releases method? |
This line |
I am using releases approach since you mentioned it is more accurate. I had to spin up my own musicbrainz server because HOLY CRAP is it slow. It is much faster with. my own server, but still pretty slow. At least it should finish today.
I was looking through the code and an example recording that has an associated work. I realize we must fetch the work from the work id. I noticed that each item has a 'start' and 'end' key with a date. I was wondering why the 'start' or 'end' dates are not used instead of fetching the recording for each item in 'recording-relation-list'? Are those dates not reliable?
For example:
Poker face by Lady Gaga
Recording ID: 35618652-47d7-495d-806a-ee1b88eeb776 has the associated work id: 9f6363b8-7df7-3732-b2b6-94c0f02e0bde
and when using
musicbrainzngs.get_work_by_id('9f6363b8-7df7-3732-b2b6-94c0f02e0bde', ['recording-rels'])['work']
, I get the following data.If I am understanding the structure correctly, you could filter out the covers or whatever, and then each one has the 'start', 'end' and under 'recording' is the title, which you could match with the current title to make sure it is the same.
I would like to know your thoughts on this approach. Did you consider it, and it did not give accurate results? Or might it speed things up by using less requests to MB?
Great plugin by the way.
The text was updated successfully, but these errors were encountered: