Controlling the sampling of the Classifier’s context
This script demonstrates how to use sampling context methods with NICL for classification.
NOTE
This illustrates an advanced use case. For a simpler classification example that does not demonstrate manual control of the sampled context, see this example.
Sampling Context Methods
When working with very large datasets, inference can become computationally expensive and time-consuming. To manage this, it is often advised to apply row sampling, selecting a representative subset of the data to provide as context while preserving its generalisation capability. In our example, we illustrate this using random sampling. As with other preprocessing steps, it is recommended to experiment with different sampling strategies and proportions to determine what best fits the data characteristics and available computational resources.
WARNING
For this example to run, the environment variables NEURALK_USERNAME and
NEURALK_PASSWORD must be defined. They will be used to connect to the
Neuralk API.
We start by generating an example dataset
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
X, y = make_classification(
n_samples=1_000_000, n_features=10, n_informative=8, n_classes=3, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=10_000)
print(f"{X_train.shape=} {y_train.shape=} {X_test.shape=} {y_test.shape=}")
X_train.shape=(990000, 10) y_train.shape=(990000,) X_test.shape=(10000, 10) y_test.shape=(10000,)
As the dataset is quite large, it is not feasible to feed the whole training set as context to the Neuralk model when making a prediction. Therefore, if we send the whole dataset, a portion of it will be sampled.
Often, we have information on which rows are more interesting to keep in the context. For example, there may be sensible groupings of our data (by date, geographical location, or other criteria, …) and we may want to keep examples that are similar to the ones for which we need a prediction, or to ensure some diversity across those groups.
Here to keep the example simple, we just sample the context randomly.
import numpy as np
rng = np.random.default_rng()
sample_indices = rng.choice(np.arange(X_train.shape[0]), size=10_000, replace=False)
sampled_X_train, sampled_y_train = X_train[sample_indices], y_train[sample_indices]
print(f"Sampling ratio: {sampled_X_train.shape[0] / X_train.shape[0]:.2%}")
Sampling ratio: 1.01%
Now we fit the classifier.
from neuralk import Classifier
classifier = Classifier()
# Fit classifier (nothing happens here as we are using a pre-trained model).
classifier.fit(sampled_X_train, sampled_y_train)
And we can make predictions for the test set.
predictions = classifier.predict(X_test)
Finally, we measure the accuracy.
from sklearn.metrics import accuracy_score
acc = accuracy_score(y_test, predictions)
print(f"Accuracy: {acc:.3f}")
Accuracy: 0.947
Total running time of the script: (0 minutes 38.403 seconds)