K-Skip-N-Gram#
K-skip-n-grams are a technique similar to n-grams, whereby n-grams are formed but in addition to allowing adjacent sequences of words, the next k words will be skipped forming n-grams of the new forward looking sequences. The tokenizer outputs tokens ranging from min to max number of words per token.
Parameters#
| # | Name | Default | Type | Description | 
|---|---|---|---|---|
| 1 | min | 2 | int | The minimum number of words in a single token. | 
| 2 | max | 2 | int | The maximum number of words in a single token. | 
| 3 | skip | 2 | int | The number of words to skip over to form new sequences. | 
Example#
use Rubix\ML\Tokenizers\KSkipNGram;
$tokenizer = new KSkipNGram(2, 3, 2);