Model tutorial: Classification

With the Neuralk API you can run our foundation model NICL on classification tasks across any tabular dataset containing textual, categorical, and numerical features.

Import required libraries

The below libraries are required to run the classification tutorial examples.

import os

import numpy as np

import polars as pl

from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.pipeline import Pipeline

from skrub import TextEncoder, ApplyToCols

# Neuralk imports
from neuralk import Classifier

Load credentials (see Quickstart for more)

try:
    from dotenv import load_dotenv
    load_dotenv()
except ImportError:
    print("python-dotenv not installed, skipping .env loading")

user = os.environ.get("NEURALK_USERNAME")
password = os.environ.get("NEURALK_PASSWORD")

assert (
    user is not None and password is not None
), "Missing NEURALK_USER or NEURALK_PASSWORD. Set them in your environment or a .env file."

Load your dataset

For this example we use a synthetically generated dataset from the skrub library.

X, y = make_classification(
    n_samples=1000, 
    n_features=10,
    n_informative=8,
    n_classes=3,
    random_state=42
)

Classify your dataset

Type 1: General classification

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")
print(f"Number of features: {X_train.shape[1]}")
print(f"Number of classes: {len(np.unique(y))}")

# Train the classifier
classifier = Classifier()
classifier.fit(X_train, y_train)
print("✓ Classifier fitted successfully - In our case, this only means saving the X_train and y_train in the classifier object for the inference")

# Make predictions
predictions = classifier.predict(X_test)

# Evaluate performance
acc = accuracy_score(y_test, predictions)
print(f"✓ Accuracy: {acc:.3f}")

Type 2: Classification on textual data

Since NICL operates on numeric vectors, any textual data in your dataset must first be converted into numerical features compatible with NICL.

Each text sample should first be converted into a dense representation using a text vectorization method such as a pre-trained sentence embedding model.

These embeddings are then compressed using PCA to a smaller dimension to improve computation efficiency.

tip

We recommend using PCA while retaining the first 40 components. This encoding ensures that the inputs meet NICL’s size requirements in most cases.

For this example we use a synthetically generated dataset.

text_templates = [
        "This is a product description for electronics category",
        "Customer review about home and garden items",
        "Technical specification for automotive parts",
        "Food and beverage product information"
]

# Generate random text data (only for this tutorial)
n_samples = X.shape[0]
texts = np.random.choice(text_templates, size=n_samples).reshape(-1, 1)

# Combine numeric features with text in a Polars DataFrame

X_text = pl.DataFrame(np.concatenate([X, texts], axis=1), schema=[f"f_{i}" for i in range(X.shape[1])] + ["text"])

# Split
X_train_text, X_test_text, y_train_text, y_test_text = train_test_split(
        X_text, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set size: {len(X_train_text)}")
print(f"Test set size: {len(X_test_text)}")
print(f"Number of classes: {len(np.unique(y_train_text))}")

enc = TextEncoder(model_name='intfloat/e5-small-v2', n_components=40)

# Column transformer: encode text only, scale numeric if needed
preprocessor = ApplyToCols(
        enc,
        cols=["text"],
        allow_reject=True
)

# Fit-transform train, transform test
X_train_encoded = preprocessor.fit_transform(X_train_text, y_train_text)
X_test_encoded = preprocessor.transform(X_test_text)

print(f"✓ Text encoded to {X_train_encoded.shape[1]} features")

# Train classifier on sampled data
print("\n🤖 Training classifier...")
nicl_classifier = Classifier()
nicl_classifier.fit(X_train_encoded, y_train_text)
print("✓ Text classifier fitted successfully")

# Make predictions
print("\n🔮 Making predictions...")
predictions = nicl_classifier.predict(X_test_encoded)

# Evaluate performance
text_acc = accuracy_score(y_test_text, predictions)
print(f"✓ Accuracy: {text_acc:.3f}")

Type 3: Classification on categorical data

For categorical features, we use an OrdinalEncoder instead of a one-hot encoder to maintain control over input dimensionality. While one-hot encoding creates a new binary column for each possible category the ordinal approach assigns a unique integer to each category within a feature.

This representation ensures that NICL receives a fixed number of inputs.

note

The preprocessing pipeline presented here represents a minimal set of steps to ensure compatibility with the model’s input expectations, such as appropriate numerical ranges, feature types, and dimensional balance. It is designed to provide optimized inputs that allow the model to function correctly.

# Create synthetic mixed dataset (numerical + categorical) - only for this tutorial 
np.random.seed(42)

n_samples = X.shape[0]

# Generate random categorical features (only for this tutorial)
categories_a = np.random.choice(['A', 'B', 'C'], n_samples).reshape(-1, 1)
categories_b = np.random.choice(['X', 'Y', 'Z', 'W'], n_samples).reshape(-1, 1)

X_with_categories = pl.DataFrame(np.concatenate([X, categories_a, categories_b], axis=1))

X_train_with_categories, X_test_with_categories, y_train_with_categories, y_test_with_categories = train_test_split(
X_with_categories, y, test_size=0.2, random_state=42, stratify=y
)

# Setup preprocessing pipeline
print("\n🔧 Setting up preprocessing pipeline...")

vec = TableVectorizer(
    numeric=Pipeline([
        ("imputer", SimpleImputer(strategy="mean")),
        ("scaler", StandardScaler())
    ]),
    low_cardinality=Pipeline([
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("encoder", OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1))
    ])
)

# Fit and transform the data
X_train_processed = vec.fit_transform(X_train_with_categories)
X_test_processed = vec.transform(X_test_with_categories)

print(f"✓ Transformed shape: {X_train_processed.shape}")

# Train classifier
print("\n🤖 Training classifier...")
nicl_classifier = Classifier()
nicl_classifier.fit(X_train_processed, y_train_with_categories)
print("✓ Mixed data classifier fitted successfully")

# Make predictions
print("\n🔮 Making predictions...")
predictions = nicl_classifier.predict(X_test_processed)
# Evaluate performance
categ_acc = accuracy_score(y_test_with_categories, predictions)
print(f"✓ Accuracy: {categ_acc:.3f}")

Advanced: Classification on large datasets with context sampling

When working with very large datasets, training can become computationally expensive and time-consuming.

To manage this, it is often advised to apply row sampling, selecting a representative subset of the data to train the model correctly while preserving its generalisation capability.

In our example, we illustrate this using random sampling of what we refer to as the context (the portion of the dataset used for model train input).

tip

This approach helps balance performance and computational feasibility. As with other preprocessing steps, it is recommended to experiment with different sampling strategies and proportions to determine what best fits the data characteristics and available computational resources.

# Create large synthetic dataset (only for this tutorial)
X, y = make_classification(
    n_samples=10000, 
    n_features=10,
    n_informative=8,
    n_classes=3,
    random_state=42
)

# Sampling Context Methods Example
print("=" * 60)
print("SAMPLING CONTEXT METHODS")
print("=" * 60)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

classes = np.unique(y_train)
n_classes = len(classes)
n_samples = max(int(X_train.shape[0] * 0.2), n_classes * 2)

# We ensure at least one element per class
selected_indices = []
for c in classes:
    class_indices = np.where(y_train == c)[0]
    chosen = np.random.choice(class_indices, 1, replace=False)
    selected_indices.extend(chosen)

# We sample the remaining indices
remaining_indices = np.setdiff1d(np.arange(len(y_train)), selected_indices)
n_remaining = n_samples - n_classes
extra_indices = np.random.choice(remaining_indices, n_remaining, replace=False)
selected_indices.extend(extra_indices)

sampled_X_train, sampled_y_train = X_train[selected_indices], y_train[selected_indices]

print(f"Original training set size: {X_train.shape[0]}")
print(f"Sampled training set size: {sampled_X_train.shape[0]}")
print(f"Sampling ratio: {sampled_X_train.shape[0] / X_train.shape[0]:.2%}")

# Run classification
nicl_classifier = Classifier()
nicl_classifier.fit(sampled_X_train, sampled_y_train)
predictions = nicl_classifier.predict(X_test)
acc = accuracy_score(y_test, predictions)
print(f'Accuracy: {acc:.3f}')

note

As with any machine learning workflow, these steps should be viewed as a starting point rather than a fixed recipe. It is recommended to experiment with different preprocessing strategies, such as alternative encoders, scaling methods, or feature selection techniques, to identify the configuration that yields the best performance for the specific data and task at hand.

Import required libraries​

Load credentials (see Quickstart for more)​

Load your dataset​

Classify your dataset​

Type 1: General classification​

Type 2: Classification on textual data​

Type 3: Classification on categorical data​

Advanced: Classification on large datasets with context sampling​