mRMR

package module
v1.0.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 19, 2024 License: MIT Imports: 8 Imported by: 0

README

mRMR

Overview

Maximum relevance miminum redundancy (mRMR) is a filter-based feature selection method that explicitly considers redundancy among features while maximizing their relevance to the target variable. It has the particular advantage over some commonly used wrapper-based methods, such as Recursive Feature Elimination (RFE) and Boruta, in selecting a compact but informative set of features.

mRMR Variants

The core idea of mRMR is to select a feature subset S from the feature set F, such that:

mRMR equation

Classical mRMR employs mutual information for both relevance and redundancy calculations. However, this approach struggles with continuous data, as it requires estimating the probability density function (PDF), which is computationally expensive. As a workaround, one can

  • Discretize the data: Convert continuous features into discrete bins.
  • Use alternative metrics: F-statistic for relevance; Pearson correlation coefficient for redundancy.
Normalization-Based Approach

A known drawback of mRMR is the imbalance between the two terms in the subtraction. To address this, Vinh et al. proposed normalizing each term:

mRMR equation

Where:

  • C|: Number of classes.
  • N: Quantization level.
Quotient-Based Approach

An alternative variation of mRMR considers the quotient of relevance and redundancy instead of their difference:

mRMR equation

Install

go get github.com/PQMark/mRMR

Usage

Read Data

Use the ReadCSV helper function to load and preprocess your data from a CSV file.

data, features, groups := mRMR.ReadCSV(
    "path/to/data.csv",
    []int{1, 4}, // (1-based) Irrelevant columns, e.g. columns 1 and 4
    []int{},     // (1-based) Irrelevant rows, e.g. none
    1,           // (1-based) Index for features info
    2,           // (1-based) Index for group info
    true,        // Each column is a feature, e.g. true
)
mRMR
mRMRData := mRMR.DatamRMR{X: data, Class: groups}
parasmRMR := mRMR.ParasmRMR{
    Data: mRMRData,
}
featureSelectedIndices, relevance, redundancyMap := parasmRMR.MRMR()
featureSelected := GetFeatures(features, featureSelectedIndices)

Args for ParasmRMR:

  • Discretization (bool): Whether to discretize the data before feature selection.
  • BinSize (int): Number of bins used if discretization is enabled.
  • Method (string): Method for relevance/redundancy calculation.
    Options: "mi-mi", "fs-pearson", "nmi-nmi" (Default: "nmi-nmi").
  • Calculation (string): How to combine relevance and redundancy measures.
    Options: "diff", "quo".
  • MaxFeatures (int): Maximum number of features to select.
  • RedundancyMethod (string): Method for handling redundancy.
    Options: "avg", "max".
  • Threshold (float64): Controls the quantization error for normalized MI. (Default: 0.01)
  • Verbose (bool): If true, prints intermediate relevance, redundancy, and combined results.

Example on MNIST

Both methods achieve a weighted F1 score above 95%. Remarkably, mRMR selects far fewer pixel features than Boruta while still maintaining comparable performance.
mRMR (mi-mi):
mRMR Feature Importance

Boruta:
Boruta Feature Importance

  • Note: The importance scores are derived from a Random Forest trained on a stratified sample of 1,000 MNIST instances (digits 0, 1, and 7) with selected pixels, with 20% held out for testing.

References

  1. Maximum Relevance and Minimum Redundancy Feature Selection Methods for a Marketing Machine Learning Platform
    Hanchuan Peng, Fuhui Long, and Chris Ding
    arXiv preprint arXiv:1908.05376, 2019.
  2. A Novel Feature Selection Method Based on Normalized Mutual Information
    Jun Zhang, Pengjun Deng, and Yong Yu
    Applied Intelligence, 2011.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func CheckIfAllNegative

func CheckIfAllNegative(data []float64) bool

func CheckIfAllSmallerOne

func CheckIfAllSmallerOne(data []float64) bool

func Delete

func Delete[T any](data []T, idx int) []T

func Discretization

func Discretization(data [][]float64, binSize int) ([][]float64, [][]float64)

func FStatistic

func FStatistic(feature []float64, class []int) float64

FStatistic returns the f-statistic of feature and class.

func GetFeatures

func GetFeatures(features []string, indices []int) []string

func MinMaxNormalization

func MinMaxNormalization(data []float64) []float64

func MutualInfo

func MutualInfo[T1, T2 Numeric](data1 []T1, data2 []T2) float64

MutualInfo calculates the mutual information between two data slices.

func PairwiseOperation

func PairwiseOperation(data1, data2 []float64, operation string) []float64

func PearsonCorrelation

func PearsonCorrelation(data1, data2 []float64) float64

PearsonCorrelation returns the absolute value of pearson correlation coefficient

func QuantizationError

func QuantizationError(quantizedData, originalData []float64) float64

get the quantization error

func QuantizationLevel

func QuantizationLevel(data [][]float64, threshold float64) int

get the quantization level

func ReadCSV

func ReadCSV(filepath string, irrelevantCols, irrelevantRows []int, featureIndex, groupIndex int, colFeatures bool) ([][]float64, []string, []int)

ReadCSV reads a CSV file and returns data, feature strings and class lables.

func RedundancyUpdate

func RedundancyUpdate(data [][]float64, featureToConsider []int, target int, redundancyMap map[[2]int]float64, redundancyFunc func([]float64, []float64) float64) map[[2]int]float64

RedundancyUpdate calculates the redundancy between each unselected feature with last selected feature and updates the redundancy map.

func Relevance

func Relevance(data [][]float64, class []int, relevanceFunc func([]float64, []int) float64) []float64

Relevance computes the relevance of each feature with respect to the class and returns the scores as a slice.

Types

type DatamRMR

type DatamRMR struct {
	X     [][]float64
	Class []int
}

DatamRMR holds the input dataset and its class labels. Each row is an instance

type Numeric

type Numeric interface {
	int | int8 | int16 | int32 | int64 | float32 | float64
}

type ParasmRMR

type ParasmRMR struct {
	Data             DatamRMR
	Discretization   bool
	BinSize          int
	Method           string
	Calculation      string
	MaxFeatures      int
	RedundancyMethod string
	Threshold        float64
	Verbose          bool
	QLevel           int
	RelevanceFunc    func([]float64, []int) float64
	RedundancyFunc   func([]float64, []float64) float64
}

ParasmRMR holds parameters and functions needed to execute the mRMR algorithm.

func (*ParasmRMR) MRMR

func (paras *ParasmRMR) MRMR() ([]int, []float64, map[[2]int]float64)

MRMR executes the mRMR feature selection and returns: - selectedFeatures: the indices of selected features - relevanceAll: the relevance scores of all features - redundancyMap: a map storing pairwise redundancy values

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL