Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nucleotide-count: add case-agnostic test #959

Closed

Conversation

Vankog
Copy link
Contributor

@Vankog Vankog commented Oct 17, 2017

Splitting #895 into multiple PRs.

This adds a test case that ensures that lower- and uppercase inputs do not matter and don't interfere with each other. You can use both interchangeably.

Reasoning:

  1. "Postel's law" (a.k.a. "robustness principle")
    https://en.wikipedia.org/wiki/Robustness_principle
  2. Also, in everyday life, we got used to ignore cases. I see no reasoning why lower-case inputs should be in any way incorrect for this exercise.
  3. The test is located near the end of the spec to cover some generic special cases as part of the finishing touches of the exercise.

@Vankog Vankog force-pushed the nucleotide-count_add_case_agnostic_test branch from de6356a to 8a9a71c Compare October 17, 2017 12:44
@@ -50,6 +50,17 @@
}
},
{
"description": "is case-agnostic against a strand of mixed-case nucleotides",
"property": "nucleotideCounts",
"strand": "aGcTtTtCaTtCtGaCtGcAaCgGgCaAtAtGtCtCtGtGtGgAtTaAaAaAaGaGtGtCtGaTaGcAgC",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would AaCcGgTt also be valid for this test?

Is there a reason why this needs to be so long and mixed up?
If I'm debugging my program I'd like to be able to manually verify the expected inputs and outputs.

Copy link
Contributor Author

@Vankog Vankog Oct 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Insti
The input was taken directly from the original case above ("strand with multiple nucleotides").
I wanted to keep it consistent:
-> AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
-> aGcTtTtCaTtCtGaCtGcAaCgGgCaAtAtGtCtCtGtGtGgAtTaAaAaAaGaGtGtCtGaTaGcAgC
Any other suggested input is welcomed as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also object to the case above, but wasn't asking you to change that one.

@NobbZ
Copy link
Member

NobbZ commented Oct 17, 2017

I've never seen those in lowercase when talking about nucleotides, therefore I'm against that change

@rpottsoh
Copy link
Member

I agree with @NobbZ unless a mixed case example can be sited I would expect there would be a screening process ahead of the solution to screen for bad input.

@Vankog
Copy link
Contributor Author

Vankog commented Oct 17, 2017

@NobbZ @rpottsoh
Well, this brings us back to our discussion here: #902
About where, what and how much we should check for in tests.

@NobbZ
Copy link
Member

NobbZ commented Oct 17, 2017

Well, only halfway… If it is a valid input, we need to test it, if it is invalid input then we need to discuss if it is worth testing as an additional invalid case or not compared to those invalids that currently exist…

Since I do not have domain knowledge about biology or biochemistry, I have to assume, that those written representations of DNA that I've seen so far were correct and canonical. Among those I've never seen any lower case letter. Therefore I also assume given inputs with a lowercase character as invalid.

If my assumption is right, we need to decide wether we should test invalids or not, but if we do, we should explicitely for lowercase nucleotides because of

[…] in everyday life, we got used to ignore cases.

@rpottsoh
Copy link
Member

This is the best example I could quickly find that shows mixed case.

@Vankog
Copy link
Contributor Author

Vankog commented Oct 17, 2017

I found this:
http://www.sbcs.qmul.ac.uk/iubmb/misc/naseq.html

As the present recommendations present unique alphabetic symbols for each nucleotide combination, the use of upper- and lower-case letters as equivalent does not lead to confusion. However, such use may cause confusion between r (ribo-) and R (purine), and care must be taken in those rare cases where the various symbols are used in combination. In general, it should be emphasised (i) that upper-case symbols are advocated. and (ii) that the present recommendations are not intended to prejudice any possible future use of contrasting upper- and lower-case letters for specific purposes.

There is, however, a problem of discriminating between the upper-case letters G and C on poorly copied sequences. Nevertheless, the use of alternative symbols for G (such as a barred-G, ) is not recommended. Discrimination between the lower-case letters is much clearer.

When searching for it, there are multiple sources that suggest it is totally fine to use mixed cases to increase readability of some sort.
Quite interesting, I didn't knew this.

Copy link
Contributor

@Insti Insti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description lists the valid DNA symbols as 'ACGT'. Since this is not a case comparison exercise I agree with the others that mixed case input need not be tested here.

@Vankog
Copy link
Contributor Author

Vankog commented Oct 21, 2017

@Insti Sorry, am I mistaken? @NobbZ suggested we should check if mix-case is a valid use-case for this domain and act accordingly. And @rpottsoh found a valid case, so it seems he might agree with this assessment. Further research shows that mixed case is indeed a common use-case for DNA-strands.

I am not sure what to make out of this now (besides of choosing a better input).
We could also simply extend the description (in either way) to make it clearer.

@Insti
Copy link
Contributor

Insti commented Oct 21, 2017

@Vankog what is this exercise about?
What is the main aspect of this exercise that is important?

@Vankog
Copy link
Contributor Author

Vankog commented Oct 21, 2017

To me the main aspect is to count the number of occurrences of certain nucleotides.

How these nucleotides are represented is part of the domain specific language. With that in mind it turned out the nucleotides in question are usually presented inside sequences composed of A, C, G, T in upper- as well as lower-case and even mixed-case.

This, in combination with the robustness principle mentioned in the first post, is why I want to make sure the user's solutions do not fail silently on this (implicit) domain specific requirement nor everyday laziness.
Hence the test. Further, I think we should also (or: "at least") update the description to match what we will agree on in the end.

@Insti
Copy link
Contributor

Insti commented Oct 21, 2017

the main aspect is to count the number of occurrences of certain nucleotides.

I agree with this.

the nucleotides in question are usually presented inside sequences composed of A, C, G, T

And this.

as well as lower-case and even mixed-case.

This is where we diverge. The description specifies the characters A, C, G, T. These are only ones we need to use in our test inputs, and testing for anything else is an unnecessary distraction.

@Vankog
Copy link
Contributor Author

Vankog commented Oct 21, 2017

That is why I suggest to adjust the description as well, because it does not match with particular domain specifics.
I mean, for example, @Dysp also pointed out in #902 (comment), that the description does not really fit the domain specifics and created #913 to improve the description.

In the same vain I see it suitable to also fix the case-problem in either way.

@Dysp
Copy link
Contributor

Dysp commented Oct 21, 2017

In all papers I've read and in text books, uppercase is used, but as it is pointed out above, there is no rule against using lower case. But I honestly do not think it is of great importance whether it's lower or uppercase from a bio scientific point of view.

@Insti
Copy link
Contributor

Insti commented Oct 22, 2017

@Vankog if you feel that the description could be improved, feel free to create a separate PR that addresses the issue.

I do not think there is any problem with the description. It clearly specifies the 4 symbols to be used.

@Vankog
Copy link
Contributor Author

Vankog commented Nov 2, 2017

Im on vacation for the next two weeks. I'll take care of his then.

@Insti
Copy link
Contributor

Insti commented Nov 2, 2017

I'll close this PR for now, we can re-open it again later if the PR to the description makes it relevant again in the future.

Thanks for your contributions ❤️

Have a good holiday.

@Insti Insti closed this Nov 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants