CHANGES


Bugs? Ideas? Need for improvement? 
soeding@mpibpc.mpg.de

_______________________________________________________________________________

HH-suite 2.1.0  March 2013

* Introduced a new format for hhblits databases: the prefilter flat files 
  *.cs219 containing the column-state sequences for prefiltering are replaced
  by a pair of ffindex files. This leads to a speed up for reading the prefilter 
  db by ~4 seconds, improved scaling of hhblits with the number of cores, and
  lower memory requirements. For compatibility with older versions of HH-suite, 
  the *.cs219 files will be still be provided with new db versions.

* Improved sensitivity and alignment accuracy through introduction of a new,
  discriminative method to calculate pseudocounts, described in Angermuller C, 
  Biegert A, and Soding J, Bioinformatics 28: 3240-3247 (2012).

* Increased speed of hhsearch about 8-fold and of HHblits about 1.5-fold by 
  implementing the HMM-HMM Viterbi alingment in SSE2 SIMD (single input 
  multiple data) instructions 

* Added option -macins to hhblits, hhsearch and hhalign that controls the
  costs of internal gap positions in the Maximum Accuracy Algorithm. 
  0: gap cost is half the mact value (default) 1: no gap cost, leads to very 
  gappy alignments.

* Reduced memory requirements (in Viterbi algorithm by a factor 3,
  in MAC algorithm by a factor 4, in total by a factor >5), improved -maxmem
  option to more accurately reflect total memory required 

* Added section to hhsuite user guide on 
  - how to efficiently run hhblits, in particular on a cluster
  - how to modify or extend existing databases
  - how to visually check multiple sequence alignments
  - troubleshooting

_______________________________________________________________________________

HH-suite 2.0.16  February 2013

* Added options to hhblitsdb.pl for adding (-a) and removing (-r) files 
  from HH-suite database. Updeated help list and did stdin and stdout fixes.

* Added options -i stdin and -o stdout to cstranslate to read from stdin and 
  write to stdout.

* hhblits: Fixed -i stdin, -oa3m stdout and -v 0.

* hhsearch/hhalign: Fixed small bug that removed all sequences from output 
  alignment when trying to use append.

* Set default number of threads to 2 for HHblits. (Before, prefiltering was 
  done with 2 threads, but HMM-HMM comparison with only 1.)

* Improved error messages for memory problems

* Removed bug in hhblits that would skip realignment of first three hits if 
  their E-value was above the one given by -e <Evalue>
  
* Removed bug in filtering with -neff option that led to a single sequence 
  if -neff target value was within 0.01 of input MSA Neff value; implemented
  speed up for -neff filtering. 

* Removed bug that led to occasional segementation faults with the -atab option


_______________________________________________________________________________

HH-suite 2.0.15  June 2012

* Fixed a bug in the counting of the number of sequences found.

* Removed a regression in -nodiff option: sequences in query MSA were filtered by
  normal, active filters such as -diff and -id. Only db MSAs were not filtered.

* Fix hhalign regression introduced in 2.0.14.

* Removed a regression that read sequences from query HHM or <query>.a3m even 
  when a sequence file query.seq was given.

* Renamed option -nodiff to -all. The name -nodiff is very misleading as
  the diff option is not actually switched off on the query and db alignments.

* Improved the filtering section of the help lists

* Fixed problem for small -e <E-value> thesholds by declaring par.e as double

* Refactored some of the code in hhblits.


_______________________________________________________________________________

HH-suite 2.0.14  May 2012

* All error messages are now written to stderr (not stdout) and are of the form
  "Error in <program_name>: <error description>"

* Improved error handling in hhblitsdb.pl, addss.pl, multithread.pl

* Added more flexible format recognition in cstranslate binary which is called
  in hhblistdb.pl

* Added -nodiff and -diff to standard help 

* In hhblitsdb.pl we force the MSA files to have extension a3m now. Same for 
  hhm (HHsuite model) and hmm (HMMER model) files. 

_______________________________________________________________________________

HH-suite 2.0.13  February 2012

* Added script splitfasta.pl to split multiple-sequence FASTA file in multiple
  single-sequence FASTA files. These are needed to run hhblits in parallel
  using multithread.pl.

* Implemented -maxmem option specifying maximum memory in GB. This changes the 
  maximum length, up to which database sequences will be realigned using the
  memory-hungry maximum accuracy algorithm. This change will lead to more accurate
  results for longer HMMs (>5000 match states).

* more sensible memory size defaults in init_prefilter(), fixes a segfault

* Improved error messages if databases are too big (e.g. >2GB on 32bit systems).

* Made code segfault-safe when name lengths and other strings exceed usual
  sizes. Now, all strings should be cut at maximum length. E.g., name lengths 
  of sequences and HMMs at >=512 characters (const int NAMELEN in util.C)

* Merged some Makefile changes for Debian from Laszlo Kajan

* Corrected calculation of "Identities" for pairwise query-template alignments. 
  Now, gaps '-' in aligned match states of query and template master sequences 
  are not counted as identical residue anymore. (hhfullalignment.C, line 182)

* Added detailed description in user guide of how Identities, Similarity, 
  and Sum_probs are calculated

* Fixed bug in hhblitsdb.pl (printf on line 272) introduced in 2.0.12


_______________________________________________________________________________

HH-suite 2.0.12  February 2012

* Reworked SSE flags: Support compilation without SSE3 (make NO_SSE3=1), without 
  SSE2 (make NO_SSE2=1) with SSE4.1 (make WITH_SSE41=1) and enable all other 
  instructions avalable on compilation host (make WITH_NATIVE=1).

* hhblitsdb.pl: The requirement for identical file names and names of master 
  sequences is dropped. Violation of this requirement manifested itself as 
  file-not-found error during Viterbi-stage of hhblits. The code was streamlined.

* Added progress dots ... for prefiltering stage

* Cleaned up output for -v 1 and -v 0 options in hhblits and hhsearch: -v 1 
  should print out only warnings and errors

* Fixed segfault occuring with very long sequence names 

* Minor improvements of documentation on installation

_______________________________________________________________________________

HH-suite 2.0.11  February 2012

* Corrected wrong paths for calls of ffindex binaries in hhblitsdb.pl

* Elminated rare segfault in realignment step that occurred in particular with
  repeat proteins as queries

* Minor improvements of documentation on installatin, HHblits databases, FAQs 
  etc.

_______________________________________________________________________________

HH-suite 2.0.2  January 2012

* The iterative HMM-HMM search method HHblits has been added and the entire 
  package is now called HH-suite. HHblits brings the power of HMM-HMM comparison 
  to mainstream, general-purpose sequence searching and sequence analysis. 

* The speed of HHsearch was further increased through the use of SSE3 
  instructions.

* A local amino acid compositional bias correction was introduced. Improvements 
  are slight (<=1%) on a standard SCOP single domain benchmark. However, the 
  improvement for more realistic sequences containing multiple domains, repeats, 
  and regions of strong compositional bias will is probably more pronounced. 
  The score shift parameter was changed from -0.01 to -0.03 bits and the mact 
  parameter from 0.5 to 0.35. 

* The memory requirement for HMM-HMM aligment in hhsearch, hhblits, and hhalign
  was reduced by a factor of ~2 by a better implementation of the Forward-
  Backward algorithm

* Removed hhalign's -sto option for generating stochastic alignments. Sorry!
  A better, deterministic replacement is planned for the future.

_______________________________________________________________________________


HHsearch 1.6.1 April 2011

* Increased speed by using SSE3 instructions in some functions

* Adding new option "-atab" for writing alignment as a table (with posteriors) 
  to file

* HHsearch is now able to read HMMER3 profiles (but should not be used due to a 
  loss of sensitivity)

* Removed a bug related with "neutralize HIS-tag" option

* Removed a bug related with empty alignments (no match states)

_______________________________________________________________________________

HHsearch 1.6.0 November 2010

* A new procedure for estimation of P- and E-values has been implemented that
  circumvents the need to calibrate HMMs. Calibration can still be done if 
  desired. By default, however, HHsearch now estimates the lamda and mu 
  parameters of the extreme value distribution (EVD) for each pair of query 
  and database HMMs from the lengths of both HMMs and the diversities of their 
  underlying alignments. Apart from saving the time for calibration, this 
  procedure is more reliable and noise-resistant. This change only applies to 
  the default local search mode. For global searches, nothing has changed. Note
  that E-values in global search mode are unreliable and that sensitivity is 
  reduced. Old calibrations can still be used:
   -calm 0 : use empirical query HMM calibration (old default)
   -calm 1 : use empirical db HMM calibration
   -calm 2 : use both query and db HMM calibration
   -calm 3 : use neural network calibration (new default)

* Previous versions of HHsearch sometimes showed non-homologous hits with high 
  probabilities by matching long stretches of secondary structure states, 
  in particular long helices, in the absence of any similarity in the amino 
  acid profiles. Capping the SS score by a linear function of the profile score
  now effectively suppresses these spurious high-scoring false positives. 

* The output format for the query-template alignments has slightly changed.
  A 'Confidence' line at the bottom of each alignment block now reports the 
  posterior probabilities for each alignment column when the -realign option
  is active (which it is by default). These probabilities are calculated in the 
  Forward-Backward algorithm that is needed as input for the Maximium ACcuracy 
  alignment algorithm. Also, the lines 'ss_conf' with the confidence values 
  for the secondary structure prediction are omitted by default. (They can  
  be displayed with option '-showssconf'). To compensate, secondary structure 
  predictions with confidence values between 7 and 9 are given in capital 
  letters, while for the predictions with values between 0 and 6 lower-case 
  letters are used. 

* In the hhsearch output file in the header lines before each query-database 
  alignment, the substitution matrix score (without gap penalties) of the query 
  with the database sequence is now reported in bits per column. Also, the sum
  of probabilities for each pair of aligned residues from the MAC algorithm 
  is reported here (0 if no MAC alignment is performed).

* The buildali.pl script now uses context-specific iterative BLAST (CSI-BLAST) 
  instead of PSI-BLAST. This considerably increases the sensitivity of 
  buildali.pl/HHsearch. 

* Removed a bug which produced a segfault for input alignments with more than
  15000 match columns. Now, the HHsearch binaries will issue a warning and 
  will transform only the first 15000 match columns into an HMM.

* Removed a bug in the multi-threading code that could lead to occasional 
  hang-ups (race condition) in situations where slow file access was impeding 
  program execution and inter-thread signaling was unreliable.

* Removed a memory leak and optimized memory management.

* Removed a bug in hhalign that could lead to unreasonably significant E-values 
  and probabilities due to calibration problems.

* HHsearch now performs realign with MAC-alignment only around Viterbi-hit


_______________________________________________________________________________


HHsearch 1.5.0 August 2007


* By default, HHsearch realigns all displayed alignments in a second stage 
  using the more accurate Maximum Accuracy (MAC) alignment algorithm 
  (Durbin, Eddy, Krough, Mitchison: Biological sequence analysis, page 95;
  HMM-HMM version: J. S\"oding, unpublished). As before, the Viterbi algorithm 
  is employed for searching and ranking the matches. The realignment step is 
  parallelized (`-cpu <int>`) and typically takes a few seconds only. You can 
  switch off the MAC realignment with the -norealign option. 
     The posterior probability threshold is controlled with the -mact [0,1[ 
  option. This parameter controls the alignment algorithm's greediness. More 
  precisely, the MAC algorithm finds the alignment that maximizes the sum of 
  posterior probabilities minus mact for each aligned pair. Global alignments 
  are generated with -mact 0, whereas -mact 0.5 will produce quite conservative
  local alignments. Default value is -mact 0.35, which produces alignments of
  roughly the same length as the Viterbi algorithm. 
     The -global and -local (default) option now refer to both the Viterbi 
  search stage as well as the MAC realignment stage. With -global (-local), the 
  posterior probability matrix will be calculated for global (local) 
  alignment. Note that '-local -mact 0' will produce global alignments from
  a local posterior probability matrix (which is not at all unreasonable).

* An amino acid compositional bias correction is now performed by default.
  This increases the sensitivity by 25% at 0.01 errors per query and by 5% at 
  0.1 errors per query. By recalibrating the Probabilities, the increased 
  selectivity of this new version allows to give higher probabilities for the 
  same P-values. Also, the score offset could be increased from -0.1 bits to 0 
  as a consequence. 

* The algorithm that filters the set of the most diverse sequences (option 
  -diff) has been improved. Before, it determined the set of the N most 
  diverse sequences. In the case of multi-domain alignments, this could lead 
  to severely underrepresented regions. E.g. when the first domain is only 
  covered by a few fairly similar sequences and the second by hundreds of very 
  diverse ones, most or all of the similar ones were removed. The '-diff N' 
  option now filters the most diverse set of sequences, keeping at least N 
  sequences in each block of >50 columns. This generally leads to a total 
  number of sequences that is larger than N. Speed is similar. The default
  is '-diff 100' for hhmake and hhsearch. Speed is similar. Use -diff 0 to 
  switch this filter off.

* The sensitivity for the -global alignment option has been significantly 
  increased by a more robust statistical treatment. The sensitivity in -global
  mode is now only 0-10% lower than for the default -local option on a SCOP
  benchmark, i.e. when the query or the templates represent single structural 
  domains. The E-values are now more realistic, although still not as 
  reliable as for -local. The Probabilities were recalibrated.

* A new binary HHalign has been added. It is similar to hhsearch, but performs
  only pairwise comparisons. It can produce dot plots, tables of aligned 
  residues, and it can sample alternative alignments stochastically. It uses 
  the MAC algorithm by default. 

* HHsearch and hhalign can generate query-template multiple alignments in 
  FASTA, A2M, or A3M format with the -ofas, -oa2m, -oa3m options 

* Return values were changed to comply with convention that 0 means no errors:
   0: Finished successfully
   1: Format error in input files
   2: File access error
   3: Out of memory 
   4: Syntax error on command line
   6: Internal logic error (please report)
   7: Internal numeric error (please report)
   8: Other
	
* Added script buildali.pl <file> to automatically build PSI-BLAST multiple 
  sequence alignments, including predicted and DSSP secondary structure. 
  Buildali.pl prunes sequences individually from both ends to reduce corruption 
  by non-homologous fragments (J. Soeding, unpublished).

* Added script hhmakemodel.pl <file.hhr> that parses hhsearch results files 
  and can generate FASTA or PIR multiple alignments or build rough 3D models.

* Removed a bug due to which pseudocounts where added to HMMer HMMs (which 
  already have their own pseudocounts added). This bug severely reduced 
  sensitivity for HMMs read in HMMer format.

* Moved a lot of memory allocation from stack to heap to avoid segmentation 
  faults under some Windows systems.

* Removed a bug due to which the query-template alignments where not displayed
  on some platforms when output was directed to stdout

* Removed a bug that caused occasional segfaults under SunOS when reading HMMer 
  files

* Added multi-threading (-cpu <int>) for Windows x86 platform

* Cleaned up output formatting of summary list for Windows x86

* Stopped support for the Alpha/DEC platform 

* Anyone still interested in Mac OSX/PPC or SunOS support?


_______________________________________________________________________________


HHsearch 1.2.0 January 2006
 
* Parallelized hhsearch for SMPs: using option -cpu <int>, hhsearch can be run 
in parallel on several CPUs of a symmetric multiprocessor (SMP) machine. Speed 
with two CPUs is ~1.9x, with two Dual Core AMDs ~3.4x. 
Memory requirements for dynamic programming matrix is now 
~100kb*(Query_length+2) * number of queue bins. 
Number of queue bins = floor(1.2*CPUs+1) (for cpu>1), 1 otherwise

* HHsearch can now automatically detect and read HMMER format (ASCII). 
Predicted secondary structure (or consensus secondary structure) can be 
used in HMMER format in the same way as in HHsearch format. The script
addpsipred.pl is able to add the information for predicted secondary structure
to a HMMER file (as well as to a FASTA alignment). However, representative 
sequences for query-template alignments are not supported by HMMER's format. 
Performance is not as good HMMer-format as for hhm format, so use hhsearch's 
hhm format as far as possible.

* Multiple databases can now be searched with -d 'db1.hhm db2.hhm ...' 

* Added option -alt <int> for displaying up to <int> alternative alignments 
that have no residue pairs in common (e.g. when aligning repeat proteins)

* Added percent sequence identity to alignment information 

* Increased speed of hhsearch database searching by ~15% 

* Added possibility to read from standard input and write to standard output

* Removed the limit 30 000 on maximum number of HMMs in database.

* Increased maximum number of match states in HMM from 6000 to 15000.

* Checked code sanity with Valgrind (a great tool! http://valgring.org)
and removed a bug that was responsible for rare segmentation faults.

* Removed a bug in the maximum sequence identity filtering routine. Before, 
when two sequences had no overlap, the shorter of the two was filtered out.
This could have affected HMMs for multidomain proteins quite badly.

* Changed options for minimum and maximum numbers of sequences in output file 
(options -p, -E, -B, -b, -Z, -z)

* Impoved sensitivity somewhat (~10%) by using option '-diff 100' as default 
for hhmake and hhsearch (use '-diff 0' to turn this option off). Optimized 
internal pseudocount admixture parameter with -diff 100  setting. Results 
between HHsearch 1.2 and earlier versions may therefore differ slightly.

* Perl scripts: Added -Q option to alignblast.pl that allows to include query 
sequence as first sequence of extracted alignment. Script addpsipred.pl can 
now add PSIPRED-predicted secondary structure also to HMMER model files.

* Included binaries for Windows32 (a bit slow, no threads supported)

_______________________________________________________________________________


HHsearch 1.1.4 March 2005
 
* Removed small memory allocation bug in function that reads .hhdefaults file.
* Reserve dynamic programming memory dynamically. Memory requirement is now 
6*MAXRES*(Query_length+2) bytes (+ space for ALL sequences in SEQ records of 
HMM database).
* Changed maximum number of sequences in input alignments from 20 000 to 30 000
* Changed maximum length of sequences in input alignments from 20 000 to 30 000
* Changed maximum number of match states MAXRES in input alignments from 5000 
to 6000
* In output alignments, changed the name of the consensus sequences from
'Cons-<id>' to 'Consensus'.
* Added secondary structure-dependent gap penalties ('-ssgap' option).
Gap opening within SS elements costs X bits after 1st, 2X bits after 2nd and 
3X bits after third SS residue, where X is 1.0 bit by default and can be 
changed with the '-ssgapd X' option. 
  => A small benchmark on the CAFASP4 targets showed NO improvement! 
* Included binaries for MacOSX (Darwin) and OSF1/Tru64_UNIX (thanks to Paul 
Sarokas, GSK) 
_______________________________________________________________________________


HHsearch 1.1.3 November 2004

* Allowed up to 10000 sequences to be displayed in output alignments
* Added warning message when not enough memory available
* Corrected warning message when not enough different superfamilies are 
contained in the searched database to fit the score distribution.
* Removed a bug in the part that builds the output alignments that could cause
a segmentation fault on rare occasions.
* Remove carriage return (CR) symbols in names of input sequences that 
could mess up output alignments when sequences were read in via a web form. 
* Corrected a bug in the calculation of transition probabilities for insert 
states. (Improvement in sensitivity by ~2%.) Need to recalculate old HMMs.
* Changed format of hitlist in output. Now, the first 30 characters of the 
sequence name and description are shown, not only the ID and the family code: 
(No Hit Prob E-value...). This is more suitable for databases other than SCOP 
* Removed limit of 200 characters for sequence descriptions to be able to 
display the extensive annotation for Pfam and SMART alignments, for instance.
* In the output alignment the numbering of the consensus sequence did not 
correspond to the match columns anymore. This has been corrected.
_______________________________________________________________________________


HHsearch 1.1 August 2004

* Format of *.hhm files was changed. Transition pseudocounts are now 
calculated on the basis of the effective number of sequences that are in
M states, I states and M states, respectively. These numbers are listed
as Neff, Neff_I,  Neff_D in seperate columns of the hhm files. (Improvement
in sensitivity by a few percent.)
* Reduced memory requirements by a factor of four. Memory requirement now is 
approx. 6*MAXRES^2 bytes ( + space for ALL sequences in SEQ records of HMM 
database). MAXRES is maximum number of query residues, 5000 at the moment

_______________________________________________________________________________


HHsearch 1.0 May 2004

* Version used for benchmark published in Soding J, Bioinformatics (2005).