Built-in selection of the most informative context
When using in-context learning, the quality of the results is directly tied to the quality of the context provided during training. Group-Wise Processing is a mechanism for optimizing this context. It ensures the model sees the most representative examples without exceeding GPU capacity.
Overview
When training data exceeds GPU memory capacity, the system automatically splits data into groups using a “prompter” strategy. The system supports two modes:
- Automatic Mode (default): System automatically calculates optimal number of groups based on available memory
- Manual Mode: You specify the prompter strategy and configuration
Automatic Mode (Default)
By default, the system handles everything automatically:
# Prepare your data
X_train = np.random.randn(1000000, 50).astype(np.float32)
y_train = np.random.randint(0, 2, 1000000)
X_test = np.random.randn(10000, 50).astype(np.float32)
# Send request - system automatically handles grouping
clf = OnPremiseClassifier(host=<...>)
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)
The system will:
- Handle memory constraints automatically
Manual Mode
You can override the automatic behavior by using the strategy parameter.
clf = OnPremiseClassifier(
host=<...>,
strategy = "random",
n_groups = 10,
)
The n_groups parameter specifies the target number of groups to split your data into. Each group will be processed separately to fit within GPU memory capacity.
Available Strategies
Feature Strategy (Custom column selection)
Groups samples based on specific features/columns from your data. This strategy allows you to use domain-specific features that are most relevant for creating meaningful groups.
How it works:
The Feature Strategy uses the values from specified columns to determine group membership. Samples with similar values in the selected grouping columns will be assigned to the same group. This is useful when you have domain knowledge about which features are most important for creating coherent groups.
clf = OnPremiseClassifier(host = <...>,
strategy = "feature",
column_names = ["age", "income", "score", "rating"], # All column names in order
selected_features = ["income", "score"], # Columns to use for grouping
n_groups = 10 # Optional
)
clf.fit(X_train, y_train)
Parameters:
grouping_columns: List of column names to use for groupingn_groups: (Optional) Target number of groups. If not specified, the system will determine an optimal number.
Random Strategy (Best for guaranteed even splits)
Randomly assigns samples to groups:
clf = OnPremiseClassifier(host = <...>,
strategy = "random",
n_groups = 15
)
Correlation Strategy (Data-driven but may create uneven groups)
Automatically selects features based on their correlation with the target variable and groups samples using quantile-based distribution.
How it works:
The Correlation Strategy uses correlation-based feature selection to identify the most informative features for grouping:
- Feature Selection: The system automatically identifies features that are most correlated with the target variable. This data-driven approach selects features that are most predictive of the outcome.
- Quantile-based Grouping: Samples are grouped based on quantile distribution of the selected correlated features. This means samples with similar values in the most predictive features are grouped together, creating clusters that respect the natural structure of your data.
- Automatic Process: Unlike the Feature Strategy where you manually specify which columns to use, the Correlation Strategy automatically determines which features are most relevant based on statistical correlation with the target.
Usage:
clf = OnPremiseClassifier(host = <...>,
strategy = "correlation",
n_groups = 10
)
How it differs from Feature Strategy:
- Feature Strategy: You manually specify which columns to use for grouping based on domain knowledge. You have full control but need to know which features are relevant.
- Correlation Strategy: The system automatically selects features based on their correlation with the target variable. No manual feature selection needed, but you have less control over which features drive grouping.
- Feature Strategy: Groups samples with similar values in your manually selected features.
- Correlation Strategy: Groups samples based on quantile distribution of automatically selected correlated features, which may create more natural clusters but can result in uneven group sizes.
When to use:
- Use Correlation Strategy when you want data-driven feature selection and don’t have strong domain knowledge about which features are most relevant.
- Use Feature Strategy when you have domain expertise and want explicit control over which features drive grouping.
Precomputed Groups Strategy (Use existing group assignments)
Uses pre-existing group IDs from your data. This strategy is ideal when you have already computed optimal group assignments externally and want to use them directly.
How it works:
The system reads group IDs from a specified column in your data and uses those precomputed assignments directly, without computing groups itself.
clf = OnPremiseClassifier(host = <...>,
strategy = "precomputed_groups",
column_names = ["feature1", "feature2", "group_id"], # All column names in order
selected_features = ["group_id"] # Must be exactly 1 feature: the group_id column
)
Requirements:
strategymust be"precomputed_groups"selected_featuresmust contain exactly one feature name, which is the group_id columncolumn_namesmust include the group_id column name to map numpy array columns correctly
How it works:
The system validates that selected_features contains exactly one feature (the group_id), then builds a training dataframe to access the group_id column and uses those precomputed assignments directly.
Automatic Mitigation
If a group exceeds capacity, the system automatically applies stratified sampling to reduce it while preserving class distribution.
Best Practices
When to Use Manual Mode
✅ Use Manual Mode when:
- You understand your data structure and want specific grouping
- You have domain knowledge about natural clusters in your data
- You need reproducible grouping for experiments
- You want to compare different prompter strategies
❌ Use Automatic Mode when:
- You’re unsure about optimal configuration
- You want the system to handle memory constraints automatically
- You prioritize simplicity over control
- Your data characteristics may change between requests
Setting n_groups
Too few groups: May exceed capacity, system will raise a warning.
Too many groups: Slower training, more overhead, less data per model
Monitoring
Response Metadata
Important: Metadata is only returned for manual mode (when prompter_config is provided). Automatic mode does not include metadata in the response.
{
"processing_mode": "group_wise",
"strategy": "feature",
"mode": "manual",
"n_groups": 8,
"max_group_size": 12500,
"avg_group_size": 12500.0,
"group_sizes_train": {"0": 12500, "1": 12500, "2": 12500, ...},
"group_sizes_test": {"0": 1250, "1": 1250, "2": 1250, ...}
}
Manual Mode Response (Warning - Groups Exceed Capacity):
{
"processing_mode": "group_wise",
"strategy": "correlation",
"mode": "manual",
"n_groups": 5,
"max_group_size": 450000,
"avg_group_size": 200000,
"group_sizes_train": {"0": 50000, "1": 450000, "2": 30000, ...},
"group_sizes_test": {"0": 500, "1": 4500, "2": 300, ...},
"capacity_warning": "CAPACITY WARNING: Largest group (450000 rows) exceeds capacity (368550 rows). Your configuration (strategy=correlation, n_groups=5) created uneven groups. Stratified sampling will be applied. Consider: 1) Increase n_groups to 7, 2) Use an other strategy or 'random' for even splits."
}
Automatic Mode Response:
{}
Note: Automatic mode returns an empty metadata object. All grouping is handled internally.
Key Fields (Manual Mode Only):
processing_mode: Always “group_wise” when data exceeded capacitystrategy: Prompter strategy used (e.g., “feature”, “random”)mode: Always “manual” (automatic mode returns no metadata)n_groups: Actual number of groups createdmax_group_size: Size of largest training group (in rows)avg_group_size: Average training group size (in rows)group_sizes_train: Dict mapping group_id → training sample countgroup_sizes_test: Dict mapping group_id → test sample countcapacity_warning: (Optional) Warning message if groups exceed capacity
Automatic Mode: Returns empty metadata {} - all grouping handled internally
Troubleshooting
Groups Exceed Capacity
Problem: Logs show Largest group (X rows) exceeds capacity (Y rows)
Solutions:
- Increase
n_groupsin manual config - Switch to
"random"for even splits - Use automatic mode (omit
prompter_config)
Unexpected Group Count
Problem: Requested 20 groups but only 3 were created
Cause: Data-driven strategies (correlation) may create fewer groups if data naturally clusters into fewer patterns
Solution: Use strategy="random" for exact group counts
Poor Prediction Quality
Problem: Predictions worse than expected
Possible causes:
- Groups too small (increase
n_groups= less data per model) - Groups ignore natural clusters (try
strategy="correlation") - Heavy stratified sampling (groups too large, data being downsampled)
Solution: Experiment with different strategies and monitor group sizes in logs