Using the Neuralk Classifier

The Classifier is the simplest way to use Neuralk’s In-Context Learning model for classification. It offers the usual scikit-learn classifier interface so it can easily be inserted into any machine-learning pipeline.

WARNING

For this example to run, the environment variables NEURALK_USERNAME and NEURALK_PASSWORD must be defined. They will be used to connect to the Neuralk API.

Simple example on toy data

We start by using the Classifier on simple data that needs no preprocessing.

Generate simple data:

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification()
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(f"{X_train.shape=} {y_train.shape=} {X_test.shape=} {y_test.shape=}")

X_train.shape=(75, 20) y_train.shape=(75,) X_test.shape=(25, 20) y_test.shape=(25,)

Now we apply Neuralk’s classifier.

from sklearn.metrics import accuracy_score
from neuralk import Classifier

# Note: nothing actually happens during fit() -- in-context learning models are
# pretrained but require no fitting on our specific dataset.
classifier = Classifier().fit(X_train, y_train)

predictions = classifier.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy}")

Accuracy: 0.92

Working with non-numeric data

The Neuralk Classifier is a raw classifier that does not perform any preprocessing. To handle complex datasets, we need to encode non-numeric data and possibly reduce the feature dimension. The example below shows a simple pipeline that yields good results for most datasets.

The example dataset contains the descriptions and sale price of houses. The prediction target is the sale price (binned to transform it into a classification task).

import skrub
from neuralk import datasets

X, y = datasets.housing()

skrub.TableReport(X.assign(Sale_Price=y), max_plot_columns=100)

WARNING​

Simple example on toy data​

Working with non-numeric data​

WARNING

Simple example on toy data

Working with non-numeric data