Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add VowpalWabbit ngram support #696

Merged
merged 10 commits into from
Oct 9, 2019

Conversation

eisber
Copy link
Collaborator

@eisber eisber commented Sep 23, 2019

Vowpal Wabbit supports on-the-fly ngram generation, thus the featurized dataset does not need to contain the ngrams and will not blow up in memory. Minor complication in all of this is that the features are already hashed in the SparseVector and unfortunately always get sorted by the Spark SparseVector implementation.

To overcome this the sparse indices get prefixed with a counter at the highest level bits. The number of bits for the counter can be configured, but together with the hashing bits need to fit within 30 bits (yeah, signed integers). The prefix counter is stripped by VW JNI layer as the bit mask is always re-applied.

@eisber
Copy link
Collaborator Author

eisber commented Sep 23, 2019

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@eisber eisber changed the title added VowpalWabbit ngram support Add VowpalWabbit ngram support Sep 23, 2019
@codecov
Copy link

codecov bot commented Sep 23, 2019

Codecov Report

Merging #696 into master will increase coverage by 0.04%.
The diff coverage is 90.9%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #696      +/-   ##
==========================================
+ Coverage   79.99%   80.03%   +0.04%     
==========================================
  Files         229      229              
  Lines        9118     9142      +24     
  Branches      469      470       +1     
==========================================
+ Hits         7294     7317      +23     
- Misses       1824     1825       +1
Impacted Files Coverage Δ
...microsoft/ml/spark/vw/VowpalWabbitFeaturizer.scala 87.5% <90.9%> (+1.56%) ⬆️
...osoft/ml/spark/io/http/PartitionConsolidator.scala 95.55% <0%> (+2.22%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a44dafd...fca470f. Read the comment docs.

@eisber eisber changed the title Add VowpalWabbit ngram support feat: Add VowpalWabbit ngram support Sep 23, 2019
@eisber
Copy link
Collaborator Author

eisber commented Sep 30, 2019

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@eisber
Copy link
Collaborator Author

eisber commented Oct 8, 2019

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@eisber
Copy link
Collaborator Author

eisber commented Oct 8, 2019

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

mhamilton723
mhamilton723 previously approved these changes Oct 8, 2019
Copy link
Collaborator

@mhamilton723 mhamilton723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve with nits

@eisber
Copy link
Collaborator Author

eisber commented Oct 8, 2019

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@eisber
Copy link
Collaborator Author

eisber commented Oct 9, 2019

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@eisber
Copy link
Collaborator Author

eisber commented Oct 9, 2019

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mhamilton723 mhamilton723 merged commit da124d7 into microsoft:master Oct 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants