Forge - Automated Feature Engineering Platform

Forge is an automated feature engineering platform that reduces the time data scientists spend on feature engineering by 10x. It provides automatic feature generation, intelligent selection, and seamless scikit-learn compatibility.

Features

Automatic Feature Generation: Generate features from numeric, categorical, temporal, and text data
Intelligent Feature Selection: Statistical, importance-based, and SHAP-powered selection
scikit-learn Compatible: Works seamlessly with sklearn pipelines
Data Analysis: Automatic type inference and data quality assessment
Missing Value Handling: Smart imputation strategies with missing indicators
Visualization: Built-in plotting for feature importance and correlations

Installation

pip install forge-features

With optional dependencies:

# Visualization support
pip install forge-features[viz]

# SHAP-based feature selection
pip install forge-features[shap]

# XGBoost and LightGBM support
pip install forge-features[boosting]

# Everything
pip install forge-features[all]

Quick Start

Basic Usage

import pandas as pd
from forge import AutoFeatureTransformer

# Load your data
X = pd.read_csv("data.csv")
y = pd.read_csv("labels.csv")["target"]

# Automatic feature engineering
transformer = AutoFeatureTransformer(max_features=100)
X_engineered = transformer.fit_transform(X, y)

# View generated features
print(f"Generated {len(transformer.get_feature_names_out())} features")
print(transformer.get_feature_importance().head(10))

sklearn Pipeline Integration

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from forge import AutoFeatureTransformer

pipeline = Pipeline([
    ("features", AutoFeatureTransformer(max_features=50)),
    ("classifier", RandomForestClassifier(n_estimators=100)),
])

pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)

Custom Feature Generation

from forge.generators.numeric import InteractionGenerator, PolynomialGenerator
from forge.generators.categorical import TargetEncoder
from forge.transformers import ForgePipeline

pipeline = ForgePipeline([
    ("interactions", InteractionGenerator(columns=["price", "quantity"])),
    ("polynomials", PolynomialGenerator(degree=2)),
    ("encoding", TargetEncoder(columns=["category", "region"])),
])

X_features = pipeline.fit_transform(X, y)

Data Analysis

from forge.analyzer import DataAnalyzer

analyzer = DataAnalyzer()
report = analyzer.analyze(X, y)

print(report.summary())
print(report.column_types)
print(report.quality_issues)

Feature Generators

Numeric Features

Aggregations: Group-by statistics (mean, sum, std, etc.)
Interactions: Feature multiplication, division, addition
Transformations: Log, sqrt, power, binning
Polynomials: Polynomial feature expansion

Categorical Features

Encoders: One-hot, target, frequency, ordinal encoding
Combinations: Category pair interactions
Statistics: Per-category target statistics

Temporal Features

Components: Year, month, day, hour, weekday extraction
Lags: Lagged feature values
Rolling: Rolling window statistics
Differences: Time deltas and intervals

Text Features

Basic: Length, word count, character count
TF-IDF: Term frequency-inverse document frequency

Feature Selection

Statistical: Chi-square, ANOVA F-test, mutual information
Importance: Tree-based feature importance
Correlation: Remove highly correlated features
Variance: Remove low-variance features
SHAP: SHAP value-based selection

Production Features

Drift Detection

Monitor feature distributions in production with PSI-based drift detection:

from forge import PSICalculator, DriftDetector

# Fit on training data
drift_monitor = PSICalculator(threshold=0.2)
drift_monitor.fit(X_train)

# Check for drift in production data
drift_monitor.transform(X_prod)
summary = drift_monitor.get_drift_summary()

if summary["flagged_features"]:
    print(f"Drift detected in: {summary['flagged_features']}")

Memory Management

Handle large datasets with memory-aware utilities:

from forge.utils.memory import (
    estimate_feature_engineering_memory,
    process_in_chunks,
    check_memory_and_warn,
)

# Estimate memory before processing
estimate = estimate_feature_engineering_memory(X, max_features=100)
print(estimate)  # Shows peak memory, availability, recommendations

# Process large datasets in chunks
result = process_in_chunks(transformer, X, y, chunk_size=100_000)

Performance

Forge is designed for speed and efficiency:

Dataset Size	Forge	sklearn Manual	Speedup
10K rows	0.45s	0.52s	1.2x
100K rows	2.1s	8.4s	4x
1M rows	18.3s	89.2s	4.9x

Comparison with Other Tools

Aspect	Forge	Featuretools	Feature-engine	tsfresh
Speed (100K rows)	2.1s	45.2s	3.8s	180s+
Memory (100K rows)	245 MB	1,024 MB	289 MB	2+ GB
sklearn native	Yes	No	Yes	No
Built-in selection	Yes	No	No	Yes

See benchmarks documentation for detailed performance analysis.

API Reference

AutoFeatureTransformer

AutoFeatureTransformer(
    max_features: int | float | None = None,    # Max features to keep
    numeric_transformations: list[str] = None,  # ["log", "sqrt", "bin"]
    categorical_encoding: str = "auto",         # Encoding strategy
    temporal_features: list[str] = None,        # Temporal components
    missing_strategy: str = "auto",             # Imputation strategy
    selection_method: str = "importance",       # Selection method
    n_jobs: int = -1,                           # Parallel jobs
    random_state: int | None = None,            # Random seed
    verbose: int = 0,                           # Verbosity level
)

Interactive Notebooks

Try Forge in your browser with our interactive notebooks:

Notebook	Description
01_quickstart.ipynb	Getting started with Forge
02_custom_generators.ipynb	Creating custom feature generators
03_feature_selection.ipynb	Feature selection techniques
04_sklearn_integration.ipynb	sklearn pipeline integration
05_kaggle_workflow.ipynb	Complete Kaggle workflow

Documentation

Full documentation is available at forge.dev/docs.

Community

We welcome contributions and feedback! Here's how to get involved:

GitHub Discussions: Questions, ideas, and general discussion
GitHub Issues: Bug reports and feature requests
Contributing: See CONTRIBUTING.md for guidelines

Looking for a way to contribute? Check out issues labeled good first issue.

Development

# Clone repository
git clone https://github.com/forge-features/forge.git
cd forge

# Install with dev dependencies
pip install -e ".[dev,all]"

# Run tests
make test

# Run linting
make lint

# Run type checking
make typecheck

# Run benchmarks
make benchmark

Citation

If you use Forge in your research, please cite it:

@software{forge2025,
  title = {Forge: Automated Feature Engineering for Machine Learning},
  author = {Forge Development Team},
  year = {2025},
  url = {https://github.com/forge-features/forge}
}

License

Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
.devcontainer		.devcontainer
.github		.github
benchmarks		benchmarks
binder		binder
docs		docs
examples		examples
notebooks		notebooks
src/forge		src/forge
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.release-please-manifest.json		.release-please-manifest.json
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
FEATURE_STATUS.md		FEATURE_STATUS.md
Justfile		Justfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
codecov.yml		codecov.yml
mkdocs.yml		mkdocs.yml
paper.bib		paper.bib
paper.md		paper.md
pyproject.toml		pyproject.toml
release-please-config.json		release-please-config.json

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Forge - Automated Feature Engineering Platform

Features

Installation

Quick Start

Basic Usage

sklearn Pipeline Integration

Custom Feature Generation

Data Analysis

Feature Generators

Numeric Features

Categorical Features

Temporal Features

Text Features

Feature Selection

Production Features

Drift Detection

Memory Management

Performance

Comparison with Other Tools

API Reference

AutoFeatureTransformer

Interactive Notebooks

Documentation

Community

Development

Citation

License

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages