Skip to content

[source]

Labeled#

A Labeled dataset is used to train supervised learners and for testing a model by providing the ground-truth. In addition to the standard dataset API, a labeled dataset can perform operations such as stratification and sorting the dataset using the label column.

Note

Since PHP silently converts integer strings (ex. '1') to integers in some circumstances, you should not use integer strings as class labels. Instead, use an appropriate non-integer string class name such as 'class 1', '#1', or 'first'.

Parameters#

# Name Default Type Description
1 samples array A 2-dimensional array consisting of rows of samples and columns with feature values.
2 labels array A 1-dimensional array of labels that correspond to each sample in the dataset.
2 verify true bool Should we verify the data?

Example#

use Rubix\ML\Datasets\Labeled;

$samples = [
    [0.1, 20, 'furry'],
    [2.0, -5, 'rough'],
    [0.01, 5, 'furry'],
];

$labels = ['not monster', 'monster', 'not monster'];

$dataset = new Labeled($samples, $labels);

Additional Methods#

Selectors#

Return the labels of the dataset in an array:

public labels() : array

Return a single label at the given row offset:

public label(int $offset) : mixed

Return the data type of the label:

public labelType() : DataType

echo $dataset->labelType();
continuous

Return all of the possible outcomes i.e. the unique labels in an array:

public possibleOutcomes() : array

var_dump($dataset->possibleOutcomes());
array(2) {
    [0]=> string(5) "female"
    [1]=> string(4) "male"
}

Stratification#

Group samples by their class label and return them in their own dataset:

public stratify() : array

$strata = $dataset->stratify();

Split the dataset into left and right subsets such that the proportions of class labels remain intact:

public stratifiedSplit($ratio = 0.5) : array

[$training, $testing] = $dataset->stratifiedSplit(0.8);

Return k equal size subsets of the dataset such that class proportions remain intact:

public stratifiedFold($k = 10) : array

$folds = $dataset->stratifiedFold(3);

Transform Labels#

Transform the labels in the dataset using a callback function and return self for method chaining:

public transformLabels(callable $fn) : self

Note

The callback function called for each individual label and should return the transformed label as a continuous or categorical value.

$dataset->transformLabels('intval');

$dataset->transformLabels('floatval');

$dataset->transformLabels(function ($label) {
    switch ($label) {
        case 0:
            return 'disagree';

        case 1:
            return 'neutral';

        case 2:
            return 'agree';

        default:
            return '?';
    }
});

$dataset->transformLabels(function ($label) {
    return $label > 0.5 ? 'yes' : 'no';
});

Filter#

Filter the dataset by label:

public filterByLabel(callable $fn) : self

Note

The callback function is given a label as its only argument and should return true if the row should be kept or false if the row should be filtered out of the result.

$filtered = $dataset->filterByLabel(function ($label)) {
    return $label <= 10000;;
});

Sorting#

Sort the dataset by label and return self for method chaining:

public sortByLabel(bool $descending = false) : self

Describe by Label#

Describe the features of the dataset broken down by label:

public describeByLabel() : Report

echo $dataset->describeByLabel();
{
    "not monster": [
        {
            "type": "categorical",
            "num_categories": 2,
            "probabilities": {
                "friendly": 0.75,
                "loner": 0.25
            }
        },
        {
            "type": "continuous",
            "mean": 1.125,
            "variance": 12.776875,
            "std_dev": 3.574475485997911,
            "skewness": -1.0795676577113944,
            "kurtosis": -0.7175867765792474,
            "min": -5,
            "25%": 0.6999999999999993,
            "median": 2.75,
            "75%": 3.175,
            "max": 4
        }
    ],
    "monster": [
        {
            "type": "categorical",
            "num_categories": 2,
            "probabilities": {
                "loner": 0.5,
                "friendly": 0.5
            }
        },
        {
            "type": "continuous",
            "mean": -1.25,
            "variance": 0.0625,
            "std_dev": 0.25,
            "skewness": 0,
            "kurtosis": -2,
            "min": -1.5,
            "25%": -1.375,
            "median": -1.25,
            "75%": -1.125,
            "max": -1
        }
    ]
}

Describe the Labels#

Return an array of descriptive statistics about the labels in the dataset:

public describeLabels() : Report

echo $dataset->describeLabels();
{
    "type": "categorical",
    "num_categories": 2,
    "probabilities": {
        "not monster": 0.6666666666666666,
        "monster": 0.3333333333333333
    }
}

Last update: 2021-01-26