Enable multi-source input in marian-server #505

tomekd · 2019-09-19T09:06:11Z

At the moment, marian-server could not handle multi-source models. This pull request aims to fix this problem. The solution assumes that the source inputs are separated by \t.

Additionally, I noticed that marian-server does not follows max-length limitations. I fixed it too.

…st and all features that This PR is to allow Marian to be built on Windows without Boost, since Boost 1.66 causes build errors under Flo. With this PR, if the preprocessor symbol `NO_BOOST` is defined, Boost will not be included anywhere, and configurations that rely on that cannot be used. The long-term goal is to remove Boost completely. This `#define` allows to easily locate all relevant locations. Boost is currently used for two things: * the timer in the AutoTuner * when shuffling the corpus to a temp file (not --shuffle-in-ram) With `NO_BOOST`, attempting to shuffle to file will terminate the program, and auto-tuner will select the first algorithm. Not tested with Linux.

… zstr::ifstream. Optional gzip filter

…Corpus classes

…in/max quantization to avoid overflow 1. Change the weight matrix quantization to use 7-bit min/max quantization -> This resolves all the overflow issue, because weight and activations are quantized by min/max range. 2. Clip fp16 quantization to avoid overflow 3. Fix windows build errors (cmake options, vcproj file) 4. int8 pack model (encoder -> fp16)

…encoders as well For int8 quantized model, use int8 quantization for encoders as well. The quality difference between fp16 encoder and int8 encoder is small, but they have quite amount of speed difference.

* Add basic support for TSV inputs * Fix mini-batch-fit for TSV inputs * Abort if shuffling data from stdin * Fix terminating training with data from STDIN * Allow creating vocabs from TSV files * Add comments; clean creation of vocabs from TSV files * Guess --tsv-size based on the model type * Add shortcut for STDIN inputs * Rename --tsv-size to --tsv-fields * Allow only one 'stdin' in --train-sets * Properly create separate vocabularies from a TSV file * Clearer logging message * Add error message for wrong number of valid sets if --tsv is used * Use --no-shuffle instead of --shuffle in the error message * Fix continuing training from STDIN * Update CHANGELOG * Support both 'stdin' and '-' * Guess --tsv-fields from dim-vocabs if special:model.yml available * Update error messages * Move variable outside the loop * Refactorize utils::splitTsv; add unit tests * Support '-' as stdin; refactorize; add comments * Abort if excessive field(s) in the TSV input * Add a TODO on passing one vocab with fully-tied embeddings * Remove the unit test with excessive tab-separated fields

* fix 0 * nan behavior in concatention * bump patch * change epsilon to margin

* Refactorize processPaths * Fix relative paths for shortlist and sqlite options * Rename InterpolateEnvVars to interpolateEnvVars * Update CHANGELOG

…anch Cherry pick a few improvements/fixes from Frank's branch * Adds Frank's fix for label-based mini-batch sizing from Frank's current experimental branch. * Also copies minor improvements and a few comments.

* python3 shebang from #620 * Add changelog entry for python3 change

snukky · 2020-04-18T16:05:18Z

I will revisit this soon after merging #617.

* Fix server build with current boost, move simple-websocket-server to submodule * Change submodule to marian-nmt/Simple-WebSocket-Server * Update submodule simple-websocket-server Co-authored-by: Gleb Tv <glebtv@gmail.com>

snukky · 2020-05-01T15:40:32Z

It seems that GitHub got confused after merging with the current master. I will open a new pull request that replaces this one.

tomekd requested review from emjotde and snukky September 19, 2019 09:06

frankseide and others added 28 commits October 22, 2019 17:28

use std::filebuf and InputFileStream derives from istream rather than…

8e8aae3

… zstr::ifstream. Optional gzip filter

delete file buffer in destructor

61d7ba4

replace InputFileStream with InputFileStreamNew in some files

84ce4cd

replace InputFileStream with InputFileStreamNew in everything except …

750d204

…Corpus classes

replace all InputFileStream with InputFileStreamNew

d342e44

delete InputFileStream

b8e2f06

start OutputFileStream

a10f5fb

try using OutputFileStream

908dcad

flush stream to make sure everything comes out. Doens't work

e4182e5

flush both filters

1f53dea

replace all OutputFileStream with OutputFileStreamNew

c5c1a4c

delete OutputFileStream and other crappy classes

33f6002

debuggging message

cdf9b52

start temp file class

999b8fa

use new temp file class in corpus

f4056af

More helpful error message when CUDA libraries cannot be found.

d1ed2c2

use new temp file class everywhere

8f591f6

delete old temp file class

46b8eba

Roll-back of all changes not directly related to graceful shutdown.

9a89c25

make sure file open even if it doesn't exist

7159f45

CreateFileName()

7c7fc54

merge NormalizeTempPrefix into CreateFileName()

79d8307

rename name_ -> file_

16d4ce6

string -> path

95fc443

inherite from OutputFileStreamNew instead of std::fstream

9963bea

don't delete inStream_. Passing to a unique ptr

6915f9c

store iStream_ in unique pointer and move to new location when requested

b26ce69

ykim362 and others added 21 commits March 25, 2020 02:52

Merged PR 12243: For int8 quantized model, use int8 quantization for …

4a1d918

…encoders as well For int8 quantized model, use int8 quantization for encoders as well. The quality difference between fp16 encoder and int8 encoder is small, but they have quite amount of speed difference.

resolve merge conflicts

2248a65

bump version

e78a068

actually save the merge file

a1d2f94

use float values for catch::Approx

f561e12

Fix TSV training with mini-batch-fit after the last merge

e6f82f5

Update submodule regression-tests

3126e2b

fix 0 * nan behavior in concatention

485a077

Fix 0 * nan behavior due to using -O3 instead of -OFast (#630)

d593608

* fix 0 * nan behavior in concatention * bump patch * change epsilon to margin

Merge branch 'pmaster'

fe0572b

Update submodule regression-tests

39cea6d

Merge branch 'pmaster'

c70d93d

Support relative paths in shortlist and sqlite options (#612)

cbb2990

* Refactorize processPaths * Fix relative paths for shortlist and sqlite options * Rename InterpolateEnvVars to interpolateEnvVars * Update CHANGELOG

Fix Iris example on CPU (#623)

18e6a9a

Dump version

81631e8

Merge branch 'pmaster'

c0b6686

Merged PR 12442: cherry pick a few improvements/fixes from Frank's br…

5af9899

…anch Cherry pick a few improvements/fixes from Frank's branch * Adds Frank's fix for label-based mini-batch sizing from Frank's current experimental branch. * Also copies minor improvements and a few comments.

update changelog and version

5e21a28

python3 shebang from #620 (#621)

3c0c1e1

* python3 shebang from #620 * Add changelog entry for python3 change

snukky and others added 4 commits April 26, 2020 16:43

Update submodule regression-tests

58e316d

Merge with multi-source-server

d3c8fbd

Add function converting multi-line tab-separated textual input

455724d

snukky closed this May 1, 2020

snukky mentioned this pull request May 1, 2020

Support tab-separated inputs in marian-server #649

Merged

4 tasks

snukky deleted the multi-source-server branch February 15, 2022 12:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable multi-source input in marian-server #505

Enable multi-source input in marian-server #505

tomekd commented Sep 19, 2019

snukky commented Apr 18, 2020

snukky commented May 1, 2020

Enable multi-source input in marian-server #505

Enable multi-source input in marian-server #505

Conversation

tomekd commented Sep 19, 2019

snukky commented Apr 18, 2020

snukky commented May 1, 2020