Modify API for FeaturizeText ? #2460

abgoswam · 2019-02-07T17:02:16Z

In the MLContext for the text featurizer, the input column names are taken as a IEnumerable

machinelearning/src/Microsoft.ML.Transforms/Text/TextCatalog.cs

Lines 43 to 46 in 834e471

    
           public static TextFeaturizingEstimator FeaturizeText(this TransformsCatalog.TextTransforms catalog, 
        
               string outputColumnName, 
        
               IEnumerable<string> inputColumnNames, 
        
               TextFeaturizingEstimator.Options options)

#2394 (comment) recommends making them params instead.

Should we modify this API ?

@sfilipi

sfilipi · 2019-02-07T23:10:09Z

The TextFeaturizingEstimator was modeled like the learners. IMO should follow the same pattern of having Options..

abgoswam · 2019-02-08T00:01:16Z

@sfilipi . Just to clarify .. we do follow the pattern of passing Options for the advanced options for the algorithm.

The issue was created to discuss your comment of whether we should use IEnumerable<string> for the inputColumnNames. I believe your suggestion was to not use IEnumerable and use params instead ?

justinormont · 2019-02-08T12:04:52Z

Last I looked, we also couldn't change the ngram/chargram lengths, which is quite odd as this is the main text transform hyperparameter which shows gains on text problems.

We should be encouraging users to modify this hyperparameter.

You can see a good chart of the sensitivity to ngram/chargran lengths in #2305:

Each line/color represents a certain ngram+chargram length with the pareto frontier highlighted; the connected line varies with a sweep across iter=N. The fastest results are to the right, and the best accuracy is at the top, hence points to the top right are best.

shauheen · 2019-02-22T00:51:01Z

#838 related?

abgoswam · 2019-02-23T22:15:02Z

@shauheen . This is not related to #838 . Both #838 and Justin's comment above are slightly orthogonal to the original issue.

The FeaturizeText transform estimator can take in a set of column names as input, represented as IEnumerable<string> inputColumnNames in the API above. Note this API was part of an earlier commit e3830910 itself , related to the input , output ordering of column names.

In one of the recent PRs, @sfilipi was curious if we should change the API signature to take in the inputColumnNames as params string[] inputColumnNames instead of IEnumerable<string> inputColumnNames. This issue was created to track that comment : #2394 (comment)

I suggest we keep the API as-is .

A params parameter must be the last parameter in a formal parameter list. So, if we make it params string[], the API would look like :

public static TextFeaturizingEstimator FeaturizeText(this TransformsCatalog.TextTransforms catalog, 
     string outputColumnName, 
     TextFeaturizingEstimator.Options options,
     params string[] inputColumnNames)

Note how this API breaks the convention we follow in the other transform estimators : the outputColumnName should be immediately followed by inputColumnNames .

@sfilipi . Should I close this issue ?

wschin · 2019-03-01T01:40:38Z

@abgoswam 's suggestion sounds reasonable to me. Making it params means we will only have one variable-length argument, which is input columns' names in the original comment. But why can't output names be a param?

abgoswam · 2019-03-01T01:47:24Z

@wschin .. could you give an example. not sure what you mean by output names being a param ?

wschin · 2019-03-01T16:55:04Z

There are one-to-one mapping, many-to-one mapping, and many-to-many maaping. If we use params to denote input names, why shouldn't we do the same for output names? Using params also makes our signature less typed, so I don't like it.

abgoswam mentioned this issue Feb 7, 2019

Creation of components through MLContext and cleanup (text transform) #2394

Merged

abgoswam added the API Issues pertaining the friendly API label Feb 7, 2019

Ivanidzo4ka mentioned this issue Feb 28, 2019

TextFeaturizer API is non-standard #2801

Closed

wschin self-assigned this Mar 1, 2019

wschin mentioned this issue Mar 1, 2019

[Tiny] Use string[] instead of IEnumerable<string> in column names #2815

Merged

shauheen added this to the 0319 milestone Mar 4, 2019

wschin closed this as completed in #2815 Mar 5, 2019

ghost locked as resolved and limited conversation to collaborators Mar 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify API for FeaturizeText ? #2460

Modify API for FeaturizeText ? #2460

abgoswam commented Feb 7, 2019 •

edited

Loading

sfilipi commented Feb 7, 2019

abgoswam commented Feb 8, 2019

justinormont commented Feb 8, 2019

shauheen commented Feb 22, 2019

abgoswam commented Feb 23, 2019 •

edited

Loading

wschin commented Mar 1, 2019

abgoswam commented Mar 1, 2019

wschin commented Mar 1, 2019

Modify API for FeaturizeText ? #2460

Modify API for FeaturizeText ? #2460

Comments

abgoswam commented Feb 7, 2019 • edited Loading

sfilipi commented Feb 7, 2019

abgoswam commented Feb 8, 2019

justinormont commented Feb 8, 2019

shauheen commented Feb 22, 2019

abgoswam commented Feb 23, 2019 • edited Loading

wschin commented Mar 1, 2019

abgoswam commented Mar 1, 2019

wschin commented Mar 1, 2019

abgoswam commented Feb 7, 2019 •

edited

Loading

abgoswam commented Feb 23, 2019 •

edited

Loading