-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use haplo.stats to estimate haplotypes #28
Comments
@sjmack the haplo-stats branch is ready for some initial testing:
You can run one of the tests:
This will print some diagnostic information with output etc. Caveats: this is not yet integrated with the .ini file or the txt/XML output, this is very far from being finished. There is quite a bit of work to go make sure all the measures line up and computing the correct haplotype names and frequency information. The way that haplo-stats calculates/represents haplotypes is quite different to emhaplofreq. But this prototype shows that, in principle, we can get all the information we need at least for haplotype frequency (LD is a different matter, @rsingle can maybe comment on that). Testing very large data files is yet another step beyond that again. |
Heres what I got. That looks like haplo.stats output!
|
Excellent, so we know now it works on MacOS as well. |
@sjmack try a new If you have the R
The haplotypes are in the |
Okay. here are the results of
Here are the results of
That looks the same as save.em in R
|
Interesting. It's consistent with between Python and R on the same platform. But it's not consistent cross-platform. For example on Linux, my output for those R commands matches my Python test case:
It only affects the hap1code, hap2code, at least in this particular example. I suspect there might be some subtle platform-dependent effect, either from the random number seed or in how it handles integers. In any case, it's an issue that affects the |
I should try installing PyPop on a Windows machine to see if we get still different results. In my windows machine, with R haplo.stats I get these results:
So here again are a different assignment ordering. It isn't random; here's a re-run on my Mac install of R:
|
Yes, it definitely a platform-specific issue and it's in the R code itself, not PyPop per se. So I'm not going to spend too much more effort investigating it right now, and I'm going to continue on the integration (adding the .ini section, adding to the XML output etc), but will make a note of it. I suspect it may not matter much, so long as the actual haplotype frequencies and probabilities, etc. are consistent from platform to platform. It might make sense to remove hap1code and hap2code test for the moment. |
@sjmack OK, I've started converting the internal data structures into the XML format. Try running:
The idea is to make sure I'm mapping the correct values over and decide what is and isn't appropriate to port from emhaplofreq XML output. For comparison here is a sample XML from the UchiTelle population (I've trimmed some of the stuff in the various sections to keep it short):
e.g do we want to capture all the the info in the CDATA sections (at least in the cases that there is a haplo.stats-equivalent)? For example, does the |
Here's what I get for
I think the CDATA summaries are important -- we should keep them. I also think that keeping/reporting the per-subject haplotype assignments is important too, and would increase PyPop's utility. |
Great, excellent to see it's working for you. For On this subject, given the current resources, I think we should start outlining what should be in the haplo-stats output. I propose we consider three (or maybe more) categories:
Also, currently the logic for how selecting specified loci, allPairwise options, and LD options all interact together are somewhat convoluted, both in terms of implementation, as well as user documentation. It would be good to revisit that logic and clean it up a bit. Since we're likely to end up deprecating the |
I was sort of hurrying when I posted my previous comment in the thread, and I wasn't looking as closely as I should have at what was in CDATA. What was focusing on is quoted below, which is actually before the CDATA.
So the |
For stretch goal 2, since @rsingle already is involved, and R dependency is already being built into PyPop, we may want to consider incorporating his assymLD R package (and future iterations that pull in more LD measures). |
OK, thanks for the clarification on the As I mentioned, doing the |
I have been able to remove the |
Yes. Its very important to remove the missing data. For BIGDAWG (which uses haplo.stats) the table of all haplotype possibilities grows rapidly as the # missing increases, and we recommend # missing <=2 (preferably 0). Forcing it to be 0 for PyPop makes sense. |
@sjmack great, good to clarify that, it will make the module implementation considerably simpler. |
For all following, a preliminary pairwise LD estimation is now in |
@sjmack and @rsingle: new version in
To test a simple run, with a two-locus data set, try:
This will do both the full data set and allPairwise (which in this particular case, is the same thing). A 3-locus dataset try:
This will do all-pairs of loci (3 in this case) as well the haplotype estimation for the 3-locus haplotypes. Caveats:
After 3) is done and looks OK, I'd like to merge to master and start testing this in earnest. It would be good if the |
This looks good and output for LD matches from other programs. I've
tried some larger files, but it is hard to compare exact LD stats. I
think this is because there is only one starting condition being used
currently (and other programs use multiple - thus slightly different
estimates).
This is not an issue for the smaller datasets where there are probably
no local maxima for the log likelihood.
Rich
…On 8/17/2017 11:19 PM, Alex Lancaster wrote:
@sjmack <https://github.com/sjmack> and @rsingle
<https://github.com/rsingle>: new version in |haplo-stats| branch to
test that can be run end-to-end via the |bin/pypop.py| script. This
implements two options: |lociToEstHaplo| which works more-or-less as the
version in |Emhaplofreq|, and |allPairwise| which does pairwise LD.
|git checkout haplo-stats ./setup.py test |
To test a simple run, with a two-locus data set, try:
|./bin/pypop.py -c tests/data/Test_Small_Haplostats.ini
tests/data/Test_Small_Haplostats.pop |
This will do both the full data set and allPairwise (which in this
particular case, is the same thing). A 3-locus dataset try:
|./bin/pypop.py -c tests/data/Test_Larger_Haplostats.ini
tests/data/Test_Larger_Haplostats.pop |
This will do all-pairs of loci (3 in this case) as well the haplotype
estimation for the 3-locus haplotypes.
Caveats:
1. the output from the command-line is a mess, it sends lots of
debugging stuff to the console, ignore for the moment
2. the text output works but there are duplicate entries in some of the
output tables, this is because sometimes we are repeating the
analyses, there is some complicated logic for the emhaplofreq which
I'm loathe to duplicate here. The main thing to check is whether the
values are all there.
3. this does /not/ yet handle missing data by removing rows in any
given submatrix that have missing data, this requires more
|StringMatrix| jiggery-pokery, so for the moment, make sure no rows
have missing data. This will be the next priority item after you've
kicked the tires on this.
4. |d_ij| stuff is not yet integrated
5. probably a heap more stuff I'm forgetting
After 3) is done and looks OK, I'd like to merge to master and start
testing this in earnest. It would be good if the
|Test_Larger_Haplostats.pop| file could be modified to match a real
example so we can do more unit testing on it. I just took the
|haplo.stats| example we've been using (originally from the |hla.demo|
package) and made up an extra |C| locus with bogus data.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#28 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFdJcnkKzqHlcPEoXTZa6-JzJbeHeqljks5sZQK_gaJpZM4OfvJ7>.
--
=======================================
Richard M. Single, Ph.D.
Dept. of Mathematics and Statistics
University of Vermont
Burlington, VT 05405
802-656-8631, Richard.Single@uvm.edu
=======================================
|
As a follow-up ... below is the summary of iterations from one of the
larger files I was running. You can see that the max (-76.729274) is
achieved from a few starting conditions, but there are lots of local
maxima. Comparing results from init.cond 0 to init.cond 4 for LD will
give some lack of agreement (general trends will be similar though).
…--- Iteration Summary for Original Data
-------------------------------------------
Init. condition 0: LL after 15 iters: -85.047040, error_flag: 0
Init. condition 1: LL after 76 iters: -77.252522, error_flag: 0
Init. condition 2: LL after 15 iters: -80.025105, error_flag: 0
Init. condition 3: LL after 61 iters: -78.638816, error_flag: 0
Init. condition 4: LL after 12 iters: -76.729274, error_flag: 0
Init. condition 5: LL after 13 iters: -78.638816, error_flag: 0
Init. condition 6: LL after 12 iters: -84.707242, error_flag: 0
Init. condition 7: LL after 23 iters: -76.729274, error_flag: 0
Init. condition 8: LL after 17 iters: -80.888127, error_flag: 0
Init. condition 9: LL after 14 iters: -77.252522, error_flag: 0
Init. condition 10: LL after 15 iters: -78.638816, error_flag: 0
...
On 8/18/2017 10:55 AM, Richard M. Single wrote:
This looks good and output for LD matches from other programs. I've
tried some larger files, but it is hard to compare exact LD stats. I
think this is because there is only one starting condition being used
currently (and other programs use multiple - thus slightly different
estimates).
This is not an issue for the smaller datasets where there are probably
no local maxima for the log likelihood.
Rich
On 8/17/2017 11:19 PM, Alex Lancaster wrote:
> @sjmack <https://github.com/sjmack> and @rsingle
> <https://github.com/rsingle>: new version in |haplo-stats| branch to
> test that can be run end-to-end via the |bin/pypop.py| script. This
> implements two options: |lociToEstHaplo| which works more-or-less as
> the version in |Emhaplofreq|, and |allPairwise| which does pairwise LD.
>
> |git checkout haplo-stats ./setup.py test |
>
> To test a simple run, with a two-locus data set, try:
>
> |./bin/pypop.py -c tests/data/Test_Small_Haplostats.ini
> tests/data/Test_Small_Haplostats.pop |
>
> This will do both the full data set and allPairwise (which in this
> particular case, is the same thing). A 3-locus dataset try:
>
> |./bin/pypop.py -c tests/data/Test_Larger_Haplostats.ini
> tests/data/Test_Larger_Haplostats.pop |
>
> This will do all-pairs of loci (3 in this case) as well the haplotype
> estimation for the 3-locus haplotypes.
>
> Caveats:
>
> 1. the output from the command-line is a mess, it sends lots of
> debugging stuff to the console, ignore for the moment
> 2. the text output works but there are duplicate entries in some of the
> output tables, this is because sometimes we are repeating the
> analyses, there is some complicated logic for the emhaplofreq which
> I'm loathe to duplicate here. The main thing to check is whether the
> values are all there.
> 3. this does /not/ yet handle missing data by removing rows in any
> given submatrix that have missing data, this requires more
> |StringMatrix| jiggery-pokery, so for the moment, make sure no rows
> have missing data. This will be the next priority item after you've
> kicked the tires on this.
> 4. |d_ij| stuff is not yet integrated
> 5. probably a heap more stuff I'm forgetting
>
> After 3) is done and looks OK, I'd like to merge to master and start
> testing this in earnest. It would be good if the
> |Test_Larger_Haplostats.pop| file could be modified to match a real
> example so we can do more unit testing on it. I just took the
> |haplo.stats| example we've been using (originally from the |hla.demo|
> package) and made up an extra |C| locus with bogus data.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#28 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AFdJcnkKzqHlcPEoXTZa6-JzJbeHeqljks5sZQK_gaJpZM4OfvJ7>.
>
>
--
=======================================
Richard M. Single, Ph.D.
Dept. of Mathematics and Statistics
University of Vermont
Burlington, VT 05405
802-656-8631, Richard.Single@uvm.edu
=======================================
|
Yes, the number of iterations is set to 1 currently, since we've been mainly in testing mode. I will add the iterations as an .ini file option, but we should also have some kind of default, probably whatever |
@rsingle OK,
I output this in the |
Hi Alex,
That helps. I can match up the LD statistics now from running a test on
Bluemoon. I do have some questions below.
Regarding iters, there is no record kept of the number of iters from
haplostats. max.iter is passed as an argument to the .c program, but it
is only used to bail out of iterations if they go on for too long. The
actual # of iters is not returned.
I notice that there are 2 entries for LD stats (see output below) for
the locus pair. I'm guessing that this corresponds to the first initial
condition and then the results from the "best". Is that right? The 2nd
line (D'=0.638008 Wn=0.602736) agree with emhaplofreq on Bluemoon.
Below the LD results is the screen output from running pypop. What are
the 3 numbers printed for each row (there were 18 rows for this run) Are
they the seeds for the random # generator maybe? I see the "found a
better lnlikelihood! -1391.93797682" comment. That is the same LL as
shown for the first row of LD stats (rounding to 2 decimal places). For
testing it might help to have a few more decimal places. I'm guessing
that the LLs are actually slightly different. There is a pretty big diff
in Wn values for the 2.
Pairwise LD estimates
---------------------
Locus pair D D' Wn ln(L_1) ln(L_0) S # permu
p-value
A:B *0.640559 0.554153 -1391.94 NaN -
-
A:B *0.638008 0.602736 -1391.94 NaN -
-
rich@ubuntu:~/Documents/github/pypop$ ./bin/pypop.py -c
emhaplofreq/Test_LargerPopId_Haplostats.ini emhaplofreq/test-guarani.pop
LOG: Data file has no header data block
23455 13636 17703
12837 26012 11956
11730 23343 26651
found a better lnlikelihood! -1391.93797682
21829 25797 29118
16869 14218 25712
17220 13891 12363
12825 13287 16863
23780 18907 11446
21203 24220 22737
23987 21573 28827
23225 19469 13839
25545 15151 26443
13074 20115 19772
20621 15708 15035
11370 24345 12727
29129 21281 28240
15900 23370 22237
10985 23201 11131
…On 8/18/2017 5:05 PM, Alex Lancaster wrote:
@rsingle <https://github.com/rsingle> OK, |haplo.stats| seems to use 10
as the default, so I set that as the global default. Try a git pull. I
also added an .ini file option so the following should work
|[Haplostats] allPairwise=1 numInitCond=5 |
I output this in the |<iterConverged>5</iterConverged>| XML tag for the
moment, although I know this isn't technically correct, I wanted to get
it in there so we can check that it's done all the iterations, we can
change it later. Does |haplo.stats| have an equivalent of the number of
iterations until convergence? It wasn't clear to be from the code.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#28 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFdJcvGI6Up4wrMMrBhzYLG-DlKi8yxCks5sZfyLgaJpZM4OfvJ7>.
--
=======================================
Richard M. Single, Ph.D.
Dept. of Mathematics and Statistics
University of Vermont
Burlington, VT 05405
802-656-8631, Richard.Single@uvm.edu
=======================================
|
The dupes are I think, because it's outputting info to the XML twice: once for the haplotypes and LD for the specified loci (which is "*" which resolves to the pairs), then again in all-pairwise mode for LD. This is what I was talking about in caveat 2 in #28 (comment) they should be the same in principle, so I don't know why the stats are off so much. I'd run it several times without the all pairwise option just specifying the If we actually need the iterations from the C code, we could add it to the list of returned values, if you can find the right location they are generated. Although it could be a big timesink to do that, so how important is it to have that information for now? The found a better likelihood was debugging I put in to check the loop, check the |
Another thing, so the Looking in the C code it does record the the number of iterations reached in the local variable |
You are right about the number of initial conditions (n.tries) vs. number of iters. If we did record the number of iters, we would want it to correspond to the best log likelihood. So it would mean keeping track of it for each initial condition and swapping the value each time you find a better log likelihood. That said, I think this is a lower priority, especially since it means changing the c code.
Rich
…Sent from my iPad
On Aug 19, 2017, at 11:20 PM, Alex Lancaster ***@***.***> wrote:
Another thing, so the numInitCond (which is currently done in the Python code) is different to the number of iterations (done within the C), is that right? The iterations is for the convergence of the EM algorithm itself, and the numInitCond (n.tries in the R code) is for the number of times we restart the EM algorithm, is that right? If so, since there's going to be a different number of iterations for each condition, which one would we keep, just the last one that resulted in an improved log-likelihood ration?
Looking in the C code it does record the the number of iterations reached in the local variable n_iter in haplo_em_pin(), we could return this value from the function as well if need be.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
That makes sense and I could follow it in the Haplo.py code. I bumped up
the numInitCond to 15 and it took a lot of tries until I found
substantially different LD results for the 2 versions (lociToEstHaplo=*
and allPairwise=1).
Rich
…On 8/19/2017 11:10 PM, Alex Lancaster wrote:
The dupes are I think, because it's outputting info to the XML twice:
once for the haplotypes and LD for the specified loci (which is "*"
which resolves to the pairs), then again in all-pairwise mode for LD.
This is what I was talking about in caveat 2 in #28 (comment)
<#28 (comment)> they
should be the same in principle, so I don't know why the stats are off
so much. I'd run it several times without the all pairwise option just
specifying the |lociToEst| for any specific pair.
If we actually need the iterations from the C code, we could add it to
the list of returned values, if you can find the right location they are
generated. Although it could be a big timesink to do that, so how
important is it to have that information for now? The found a better
likelihood was debugging I put in to check the loop, check the
|Haplo.py| code there's a print statement there. In general you should
have a look at the |Haplo.py| code yourself and see if it duplicates the
logic of the R equivalent and feel free to add further debug statements
to see if it is doing what you expect.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#28 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFdJcsxmBtAr8qMaArJf05W6co4YW18Xks5sZ6OVgaJpZM4OfvJ7>.
--
=======================================
Richard M. Single, Ph.D.
Dept. of Mathematics and Statistics
University of Vermont
Burlington, VT 05405
802-656-8631, Richard.Single@uvm.edu
=======================================
|
logic is: if calling in 'testMode', we fix random number seeds for initial as well as subsequent calls of haplo_em_pin to be fixed series of integers otherwise we use random_start=0 for the first set of tries (which uses the input iseed1,2,3) and then random_start=1 for the subsequent tries (this matches the R code, which doesn't have a 'testMode')
@rsingle OK, I've implement a You can use
In theory this means that all the output should be identical from run to run, if there are differences we should track them down to make sure we're running in a completely deterministic way. |
I just did a git pull in haplo-stats, ran |
OK, try again on
I just updated unit tests a bit to check LD and ALD. I'm about to do a pull request to merge this version into master for testing and send an e-mail out. There will still be a fairly rough version needing more testing. |
I made another small adjustment to fix the unit tests, you might to do a new Two major things of the new version:
|
I'm afraid that there is still a problem with this HaploStats implementation. The bug that I thought had been addressed yesterday is back in the new master branch. When I run The more critical error is illustrated below (from the out.txt):
First off, in the .pop file, there are only 10 possible haplotypes, because there are only 5 rows of genotype data.
However, there is only one DRB2 allele, so there should only be one DRB102 More importantly, the frequency of 2~44 is 5.0074985, which is obviously an error. Here are the
Here are the
The Also, while the out.xt file suggests that haplotype counts (# copies) are going to be presented, they aren't (because they don't seem to be present in the |
Should have reopened. |
Hey Steve,
Can you open this up as a new bug? I'm closing this one because it's now integrated into master, so as a feature it's now part of PyPop, so this should be a new issue. And it's easier to manage new issues as individual discrete issues, rather than in this thread.
Thanks
Alex
Closing.
…On August 22, 2017 2:02:14 PM EDT, sjmack ***@***.***> wrote:
Should have reopened.
--
You are receiving this because you modified the open/close state.
Reply to this email directly or view it on GitHub:
#28 (comment)
|
Hello, |
@rbpisupati, when missing data is defined in the .ini file (the default notation is "****"), it is automatically removed by the [Emhaplofreq] module. However, missing data is not currently removed when using the [Haplostats] module. Examples are provided in issue #41. |
As an alternative to
emhaplofreq
which has limitations of number of loci and number of individuals @rsingle and I are working on thehaplo-stats
branch: https://github.com/alexlancaster/pypop/tree/haplo-stats to wrap the C modulehaplo_em_pin
for estimating haplotypes (originally part of the R packagehaplo.stats
and licensed under the GPL). @sjmack has previously testedhaplo.stats
outside the PyPop context.The text was updated successfully, but these errors were encountered: