Labeled#
A Labeled dataset is used to train supervised learners and for testing a model by providing the ground-truth. In addition to the standard dataset API, a labeled dataset can perform operations such as stratification and sorting the dataset using the label column.
Note
Since PHP silently converts integer strings (ex. '1') to integers in some circumstances, you should not use integer strings as class labels. Instead, use an appropriate non-integer string class name such as 'class 1', '#1', or 'first'.
Parameters#
| # | Name | Default | Type | Description |
|---|---|---|---|---|
| 1 | samples | array | A 2-dimensional array consisting of rows of samples and columns with feature values. | |
| 2 | labels | array | A 1-dimensional array of labels that correspond to each sample in the dataset. | |
| 2 | verify | true | bool | Should we verify the data? |
Example#
use Rubix\ML\Datasets\Labeled;
$samples = [
[0.1, 20, 'furry'],
[2.0, -5, 'rough'],
[0.01, 5, 'furry'],
];
$labels = ['not monster', 'monster', 'not monster'];
$dataset = new Labeled($samples, $labels);
Additional Methods#
Selectors#
Return the labels of the dataset in an array:
public labels() : array
Return a single label at the given row offset:
public label(int $offset) : mixed
Return the data type of the label:
public labelType() : DataType
echo $dataset->labelType();
continuous
Return all of the possible outcomes i.e. the unique labels in an array:
public possibleOutcomes() : array
var_dump($dataset->possibleOutcomes());
array(2) {
[0]=> string(5) "female"
[1]=> string(4) "male"
}
Stratification#
Group samples by their class label and return them in their own dataset:
public stratify() : array
$strata = $dataset->stratify();
Split the dataset into left and right subsets such that the proportions of class labels remain intact:
public stratifiedSplit($ratio = 0.5) : array
[$training, $testing] = $dataset->stratifiedSplit(0.8);
Return k equal size subsets of the dataset such that class proportions remain intact:
public stratifiedFold($k = 10) : array
$folds = $dataset->stratifiedFold(3);
Transform Labels#
Transform the labels in the dataset using a callback function and return self for method chaining:
public transformLabels(callable $fn) : self
Note
The callback function called for each individual label and should return the transformed label as a continuous or categorical value.
$dataset->transformLabels('intval');
$dataset->transformLabels('floatval');
$dataset->transformLabels(function ($label) {
switch ($label) {
case 0:
return 'disagree';
case 1:
return 'neutral';
case 2:
return 'agree';
default:
return '?';
}
});
$dataset->transformLabels(function ($label) {
return $label > 0.5 ? 'yes' : 'no';
});
Filter#
Filter the dataset by label:
public filterByLabel(callable $fn) : self
Note
The callback function is given a label as its only argument and should return true if the row should be kept or false if the row should be filtered out of the result.
$filtered = $dataset->filterByLabel(function ($label)) {
return $label <= 10000;;
});
Sorting#
Sort the dataset by label and return self for method chaining:
public sortByLabel(bool $descending = false) : self
Describe by Label#
Describe the features of the dataset broken down by label:
public describeByLabel() : Report
echo $dataset->describeByLabel();
{
"not monster": [
{
"type": "categorical",
"num_categories": 2,
"probabilities": {
"friendly": 0.75,
"loner": 0.25
}
},
{
"type": "continuous",
"mean": 1.125,
"variance": 12.776875,
"std_dev": 3.574475485997911,
"skewness": -1.0795676577113944,
"kurtosis": -0.7175867765792474,
"min": -5,
"25%": 0.6999999999999993,
"median": 2.75,
"75%": 3.175,
"max": 4
}
],
"monster": [
{
"type": "categorical",
"num_categories": 2,
"probabilities": {
"loner": 0.5,
"friendly": 0.5
}
},
{
"type": "continuous",
"mean": -1.25,
"variance": 0.0625,
"std_dev": 0.25,
"skewness": 0,
"kurtosis": -2,
"min": -1.5,
"25%": -1.375,
"median": -1.25,
"75%": -1.125,
"max": -1
}
]
}
Describe the Labels#
Return an array of descriptive statistics about the labels in the dataset:
public describeLabels() : Report
echo $dataset->describeLabels();
{
"type": "categorical",
"num_categories": 2,
"probabilities": {
"not monster": 0.6666666666666666,
"monster": 0.3333333333333333
}
}