API References¶
Preprocessing Pipelines¶
- analysis.PreProcess.convert_to_astropy_table(data)[source]¶
Convert input data to an Astropy Table.
- Parameters
data (str, np.recarray, pd.DataFrame, or Table) – Input data, which can be a file path (CSV, FITS, TXT), a NumPy recarray, a Pandas DataFrame, or already an Astropy Table.
- Returns
Converted Astropy Table.
- Return type
Table
- Raises
ValueError – If the file type is unsupported or cannot be read.
TypeError – If the input data type is unsupported.
- analysis.PreProcess.galah_filter(star_data_in, dynamics_data_in, gaia_data_in, save_path=None)[source]¶
Applies quality cuts to GALAH, Gaia, and dynamics datasets to produce a refined sample of metal-poor, high-eccentricity stars.
This function filters stars based on data quality, chemical abundances, orbital properties, and distance uncertainties.
- Parameters
star_data_in (str, Table, np.recarray, or pd.DataFrame) – GALAH stellar data, provided as a file path (CSV, FITS, TXT) or an Astropy Table, NumPy recarray, or Pandas DataFrame.
dynamics_data_in (str, Table, np.recarray, or pd.DataFrame) – Dynamics dataset containing orbital properties (e.g., energy, eccentricity, actions).
gaia_data_in (str, Table, np.recarray, or pd.DataFrame) – Gaia dataset providing distances and photogeometric uncertainties.
save_path (str, optional) – If provided, saves the filtered dataset as a FITS file at the specified path.
- Returns
Table – An Astropy Table containing the filtered stellar sample.
Filtering Criteria
——————
**1. Data Quality Cuts** –
flag_sp == 0 → Only include stars with reliable stellar parameters.
snr_c3_iraf > 30 → Ensure good signal-to-noise ratio (SNR).
logg < 3.0 → Select only giant stars.
**2. Element Abundance Filters** –
[Fe/H]: Only stars with flag_fe_h == 0 and e_fe_h < 0.2.
[alpha/Fe]: Only stars with flag_alpha_fe == 0 and e_alpha_fe < 0.2.
[Na/Fe]: Remove unreliable Na measurements (flag_Na_fe == 0 and e_Na_fe < 0.2).
[Al/Fe]: Remove unreliable Al measurements (flag_Al_fe == 0 and e_Al_fe < 0.2).
[Mn/Fe]: Remove unreliable Mn measurements (flag_Mn_fe == 0 and e_Mn_fe < 0.2).
[Y/Fe]: Remove unreliable Y measurements (flag_Y_fe == 0 and e_Y_fe < 0.2).
[Ba/Fe]: Remove unreliable Ba measurements (flag_Ba_fe == 0 and e_Ba_fe < 0.2).
[Eu/Fe]: Remove unreliable Eu measurements (flag_Eu_fe == 0 and e_Eu_fe < 0.2).
**3. Derived Element Ratio Filters** –
[Mg/Cu]: Exclude stars with unreliable values (flag_Mg_Cu == 0, e_Mg_Cu < 0.2).
[Mg/Mn]: Exclude stars with unreliable values (flag_Mg_Mn == 0, e_Mg_Mn < 0.2).
[Ba/Eu]: Exclude stars with unreliable values (flag_Ba_Eu == 0, e_Ba_Eu < 0.2).
**4. Orbital and Kinematic Cuts** –
Eccentricity > 0.85 → Select stars on highly radial orbits.
Energy < 0 → Remove stars with unbound or positive energy.
R_ap > 5 → Require an apocenter larger than 5 kpc to focus on outer halo structures.
**5. Distance Uncertainty Cut (Gaia)** –
(r_hi_photogeo - r_med_photogeo) < 1500 pc → Reject stars with large upper uncertainty.
(r_med_photogeo - r_lo_photogeo) < 1500 pc → Reject stars with large lower uncertainty.
**6. Ensuring Data Consistency** –
The GALAH, Gaia, and dynamics datasets are matched using sobject_id.
Datasets are ordered to maintain consistency.
Duplicate entries are removed.
Output
——
- The filtered dataset is returned as an Astropy Table.
If save_path is specified, the dataset is saved as a FITS file.
Notes
These cuts aim to select metal-poor stars on extreme orbits, relevant for studies of the Galactic halo and accretion history.
Stars that pass the filters will have high-quality chemical abundances, well-measured kinematics, and accurate distances.
- analysis.PreProcess.apogee_filter(star_data_in, SQL=False, save_path=None)[source]¶
Applies quality cuts to APOGEE stellar data to produce a refined sample of chemically selected stars with extreme kinematics.
This function filters stars based on data quality, chemical abundances, and orbital properties to isolate metal-poor stars with extreme orbits.
If SQL=True, Gaia DR3 distances are queried using the astroquery package, and an additional filtering step removes stars with large distance errors.
- Parameters
star_data_in (str, Table, np.recarray, or pd.DataFrame) – APOGEE stellar data, provided as a file path (CSV, FITS, TXT) or an Astropy Table, NumPy recarray, or Pandas DataFrame.
SQL (bool, optional) – If True, queries Gaia DR3 for distances using astroquery.gaia and applies additional filtering based on distance uncertainties. Defaults to False.
save_path (str, optional) – If provided, saves the filtered dataset as a FITS file at the specified path.
- Returns
Table – An Astropy Table containing the filtered stellar sample.
Filtering Criteria
——————
**1. Data Quality Cuts** –
extratarg == 0 → Select only Main Red Stars (MRS).
logg < 3.0 → Restrict to giant stars.
**2. Element Abundance Filters** –
[Fe/H]: Require fe_h_flag == 0 and fe_h_err < 0.1 for reliable iron abundance.
[Al/Fe]: Require al_fe_flag == 0 and al_fe_err < 0.1 for accurate aluminum measurement.
[Ce/Fe]: Require ce_fe_flag == 0 and ce_fe_err < 0.15 for precise cerium abundance.
**3. Derived Element Ratio Filters** –
[Mg/Mn]: If missing, computed as mg_fe - mn_fe. Require: - mg_mn_flag == 0 (or mg_fe_flag == 0 & mn_fe_flag == 0 if mg_mn_flag is missing). - mg_mn_err < 0.1 for reliable measurement.
[alpha/Fe]: Constructed if missing from individual elements. Require: - alpha_fe_flag == 0 and alpha_fe_err < 0.1.
**4. Orbital and Kinematic Cuts** –
Eccentricity (ecc_50) > 0.85 → Select stars on highly radial orbits.
Energy (E_50) < 0 → Remove unbound or high-energy stars.
**5. Additional Filters (If SQL=True)** –
Queries Gaia DR3 for distances (r_med_photogeo, r_lo_photogeo, r_hi_photogeo).
Distance Uncertainty Cut: Rejects stars with: - (r_hi_photogeo - r_med_photogeo) < 1500 pc (upper bound uncertainty) - (r_med_photogeo - r_lo_photogeo) < 1500 pc (lower bound uncertainty)
**6. Ensuring Data Consistency** –
Checks for required keys before filtering.
Drops stars with missing values in ecc_50 or E_50.
Orders dataset to maintain consistency.
Ensures Gaia ID (GAIAEDR3_SOURCE_ID or dr3_source_id) is present when SQL=True.
Output
——
- The filtered dataset is returned as an Astropy Table.
If save_path is specified, the dataset is saved as a FITS file.
Notes
This selection aims to isolate metal-poor stars with extreme orbits, relevant for Galactic archaeology and halo studies.
Stars that pass the filters have high-quality chemical abundances, well-measured kinematics, and a robust selection based on APOGEE data.
Extreme Deconvolution Pipeline¶
- class analysis.XD.XDPipeline(star_data, data_keys, data_err_keys, scaling=True)[source]¶
Bases:
objectA pipeline for performing Extreme Deconvolution (XD) using a Gaussian Mixture Model (GMM).
Aims at analysing and fitting multi-dimensional stellar datasets.
The pipeline follows these key steps:
Initialisation (__init__):
Takes in stellar data as an Astropy Table, NumPy recarray, or Pandas DataFrame.
Extracts relevant features defined by data_keys and their errors data_err_keys.
Extreme Deconvolution (XD) (run_XD):
Normalises the dataset for efficient convergence. (Optional: scaling)
Runs XD over a specified range of Gaussian components.
Iterates through multiple random initialisations to ensure robust fitting.
Uses BIC and AIC scores to evaluate model performance.
Optionally saves results to a file for later analysis.
Model Comparison & Selection (compare_XD):
Compares different XD runs using Bayesian Information Criterion (BIC) or Akaike Information Criterion (AIC).
Identifies the best-fit model based on BIC or AIC scores.
Supports filtering results by a specific number of components or repeat cycle.
Generates a summary of failed runs and visualises scores across different components.
Star Assignment to Gaussian Components (assigment_XD):
Computes each star’s probability of belonging to each Gaussian component (responsibilities).
Assigns each star to the most probable Gaussian component.
Accounts for measurement uncertainties by modifying covariance matrices (error-aware).
Results Table (table_results_XD):
Constructs a summary table showing the properties of each Gaussian component.
Displays and outputs estimated weights, assigned star counts, and mean ± standard deviation for each parameter.
Plotting Results (plot_XD):
Generates a 2D scatter plot with color-coded Gaussian assignments.
Overlays Gaussian components as confidence ellipses (scaled by a z-score for different confidence levels).
Displays marginal histograms and Kernel Density Estimation (KDE) plots for feature distributions.
Includes a bar chart representing the relative weight of each Gaussian component.
- Parameters
star_data (Table, np.recarray, pd.DataFrame) – Input dataset containing stellar observations and features.
data_keys (List[str]) – List of feature name keys to be used for fitting the Gaussian Mixture Model.
data_err_keys (List[str]) – List of measurement uncertainty keys corresponding to feature keys.
scaling (bool) – If True, standardise the features to have zero mean and unit variance. If False, no scaling is applied globally, but energy-related columns (‘E_50’ or ‘Energy’) are divided by 1e5 for consistency.
- star_data¶
The input dataset - converted to an Astropy Table upon import.
- Type
Table
- feature_data¶
Extracted feature values for model fitting.
- Type
np.ndarray
- errors_data¶
Measurement errors associated with each feature.
- Type
np.ndarray
- n_samples¶
Number of stars (data points) in the dataset.
- Type
int
- n_features¶
Number of features used in the XD model.
- Type
int
- results_XD¶
Dictionary storing the results of XD fitting across different initialisations.
- Type
dict or None
- best_params¶
Best-performing XD model parameters based on the chosen optimisation metric.
- Type
dict
- filtered_best_params¶
Best-performing parameters after applying user-defined filters (e.g., fixed number of components).
- Type
dict or None
- assignment_metric¶
Specifies whether the assignments were based on the “best” or “best filtered” model.
- Type
str or None
Notes
The pipeline optionally scales input data using StandardScaler before fitting, ensuring numerical stability.
Measurement uncertainties are incorporated into the covariance matrices during model fitting.
The pipeline supports saving/loading XD results for reproducibility.
- __init__(star_data, data_keys, data_err_keys, scaling=True)[source]¶
Initialise the XDPipeline with stellar data and keys of intrest for the Gaussian Mixture Model (GMM) - Extreme Deconvolution (XD) process, defining the parameter space of interest.
- Parameters
star_data (Table, np.recarray, pd.DataFrame) – Dataset containing stellar information, which can be an Astropy Table, NumPy recarray, or Pandas DataFrame.
data_keys (List[str]) – List of column names representing features used in the GMM fitting.
data_err_keys (List[str]) – List of column names representing measurement uncertainties, corresponding to data_keys.
scaling (bool) – If True, standardise the features to have zero mean and unit variance. If False, no scaling is applied globally, but energy-related columns (‘E_50’ or ‘Energy’) are divided by 1e5 for consistency.
- Raises
TypeError – If the input dataset is not a supported type.
ValueError – If data_keys and data_err_keys have mismatched lengths or contain missing columns.
- _BICScore(log_likelihood, num_params, num_data_points)[source]¶
Compute the Bayesian Information Criterion (BIC) score.
- Parameters
log_likelihood (float) – Log-likelihood of the model.
num_params (int) – Number of free parameters in the model.
- Returns
Computed BIC score.
- Return type
float
- _AICScore(log_likelihood, num_params)[source]¶
Compute the Akaike Information Criterion (AIC) score.
- Parameters
log_likelihood (float) – Log-likelihood of the model.
num_params (int) – Number of free parameters in the model.
- Returns
Computed AIC score.
- Return type
float
- run_XD(gauss_component_range=(1, 10), max_iterations=1000000000, n_repeats=3, n_init=100, save_path=None, timings=None)[source]¶
Initialise the XDPipeline with stellar data and define the parameter space for Extreme Deconvolution (XD) using a specified set of features and their uncertainties.
This constructor supports optional scaling of features using standardisation. If scaling=False, the features are used in their original units; however, energy-related parameters (‘E_50’ or ‘Energy’) are manually scaled by 1e5 for consistency.
- Parameters
star_data (Table, np.recarray, or pd.DataFrame) – Input dataset containing stellar properties. Accepted formats are Astropy Table, NumPy recarray, or Pandas DataFrame.
data_keys (List[str]) – Column names representing the features to be used in the GMM-XD analysis.
data_err_keys (List[str]) – Column names representing the corresponding measurement uncertainties for each feature in data_keys.
scaling (bool, optional) – Whether to apply standard scaling (zero mean, unit variance) to the input features. If False, no scaling is applied globally, but energy-related columns (‘E_50’ or ‘Energy’) are manually divided by 1e5. Default is True.
timings (bool, optional) – If True, returns an distionary of the average time taken for each GMM fit. (ie per component count)
- Returns
If timings is True, returns a dict of average fit times per component count.
- Return type
Optional[Dict[int, float]]
- Raises
TypeError – If the input dataset is not a supported type.
ValueError – If data_keys and data_err_keys differ in length or contain missing columns not present in the input dataset.
- compare_XD(opt_metric='BIC', n_gauss_filter=None, repeat_no_filter=None, save_path=None, zoom_in=None, display_full=True)[source]¶
Analyse Extreme Deconvolution (XD) results using BIC or AIC. This method identifies the best-fit model, summarizes failed runs, and visualizes score distributions. If no filters are applied, the analysis is performed on all results. Otherwise, it is performed on filtered results.
- Parameters
opt_metric (str) – Optimization metric (‘BIC’ or ‘AIC’).
n_gauss_filter (Optional[int]) – Specific number of Gaussian components to filter results by.
repeat_no_filter (Optional[int]) – Specific repeat cycle to filter results by.
save_path (Optional[str]) – Path to load XD results if not already stored in the class.
- Raises
ValueError – If results are not available in the class and no valid save_path is given.
ValueError – If opt_metric is not ‘BIC’ or ‘AIC’.
ValueError – If filter values (n_gauss_filter, repeat_no_filter) are outside valid ranges.
- Return type
None
- assigment_XD(assignment_metric='best')[source]¶
Assign stars to Gaussian components based on the best-fit XD model. Computes the responsibility of each gaussians for each star and assigns it accordingly.
This method performs assignment in scaled feature space using StandardScaler to reproduce the scaling used during XD fitting. Covariance matrices are adjusted to include measurement errors, and ill-conditioned matrices are regularized to ensure numerical stability.
- Parameters
assignment_metric (str) – Selection criteria for the best-fit model (‘best’ or ‘best filtered’).
- Raises
ValueError – If no XD results are available.
ValueError – If an invalid assignment_metric is specified.
- Returns
- Updates star_data in place to include probability assignments:
prob_gauss_{i}: Probability of belonging to the i-th Gaussian component.
max_gauss: Index of the component with the highest probability (1-based index).
- Return type
None
- table_results_XD(component_name_dict=None, combine=None, labels_combined=None)[source]¶
Generate a summary table of the Extreme Deconvolution (XD) results showing the mean and error values of each Gaussian in high-dimensional space.
For each Gaussian the table includes:
Component Name (indexed numerically or custom if a mapping is provided)
XD assigned Weight (%)
Count of assigned stars
Count as a percentage of the total assigned stars
Mean values and standard deviations for each feature parameter
- Parameters
component_name_dict (dict, optional) – A dictionary mapping component indices (0-based) to custom names. The table will be ordered according to the order of keys in this dictionary if provided.
combine (list of list of int, optional) – List of lists, where each inner list contains indices of components to be combined.
labels_combined (list of str, optional) – Labels for the combined components. Must match the number of entries in combine.
- Returns
A formatted summary of the Gaussian components fitted by XD.
- Return type
pd.DataFrame
- plot_XD(x_key, y_key, z_score=2.0, full_survey_file=None, color_palette=None, xlim=None, ylim=None, legend=None)[source]¶
Creates a 2D plot of the Extreme Deconvolution (XD) results, displaying: - Individual stars colored by their assigned Gaussian component - Gaussian mixture model (GMM) components as confidence ellipses - Marginal histograms and KDE distributions for each axis - A bar chart representing the relative weight of each Gaussian component - Optional 2D histogram of full survey sample as grayscale background
The confidence ellipses are scaled according to a given z-score, providing a visual representation of the spread of each Gaussian component.
- Parameters
x_key (str) – The column name corresponding to the x-axis variable.
y_key (str) – The column name corresponding to the y-axis variable.
z_score (float, optional) – The z-score defining the confidence interval for the Gaussian ellipses. Defaults to 2, corresponding to a 95% confidence interval.
full_survey_file (str, optional) – Path to FITS file of the full survey sample for reference background.
color_palette (list, optional) – List of colors to use for each Gaussian component.
xlim (tuple, optional) – Tuple (min, max) to manually set x-axis limits on the main plot.
ylim (tuple, optional) – Tuple (min, max) to manually set y-axis limits on the main plot.
legend (tuple, optional) – Tuple (x, y) to manually set legend position on the main plot.
- Raises
ValueError – If the XD analysis has not been performed before plotting. If the provided x_key or y_key is not found in the dataset.
- Return type
None
- plot_nonXD(x_key, y_key, z_score=2.0, full_survey_file=None, color_palette=None, xlim=None, ylim=None)[source]¶
Creates a 2D diagnostic plot of the clustering results from Extreme Deconvolution (XD) assignments, without relying on the original XD model parameters (means/covariances).
Notes
This method:
Displays individual stars colored by their assigned Gaussian component (from XD).
Fits new 2D Gaussians (empirically) to each component in the projection space (x_key vs y_key).
Overlays 2σ confidence ellipses from these fitted Gaussians.
Adds marginal histograms and overlaid Gaussian projections for each axis.
Plots a bar chart summarizing the relative weight of each component.
Optionally includes a background density reference (e.g. full APOGEE–Gaia sample) via a 2D histogram.
- Parameters
x_key (str) – The column name for the x-axis variable.
y_key (str) – The column name for the y-axis variable.
z_score (float, optional) – Confidence level scaling factor for ellipses (default is 2, ~95% confidence region).
full_survey_file (str, optional) – Path to a FITS file containing a reference survey sample (e.g. total APOGEE) to be plotted as a grayscale 2D histogram in the background.
color_palette (list, optional) – Custom list of colors to assign to each Gaussian component.
xlim (tuple, optional) – Limits for the x-axis, e.g. (-2, 0.5).
ylim (tuple, optional) – Limits for the y-axis, e.g. (-0.5, 0.5).
- Raises
ValueError – If the XD assignment has not been performed before plotting. If the provided x_key or y_key is not present in the dataset.
- Return type
None
- __dict__ = mappingproxy({'__module__': 'analysis.XD', '__doc__': '\n A pipeline for performing Extreme Deconvolution (XD) using a Gaussian Mixture Model (GMM).\n\n Aims at analysing and fitting multi-dimensional stellar datasets.\n\n The pipeline follows these key steps:\n \n 1. **Initialisation** (`__init__`):\n\n - Takes in stellar data as an Astropy Table, NumPy recarray, or Pandas DataFrame.\n - Extracts relevant features defined by `data_keys` and their errors `data_err_keys`.\n\n 2. **Extreme Deconvolution (XD)** (`run_XD`):\n\n - Normalises the dataset for efficient convergence. (Optional: scaling)\n - Runs XD over a specified range of Gaussian components.\n - Iterates through multiple random initialisations to ensure robust fitting.\n - Uses BIC and AIC scores to evaluate model performance.\n - Optionally saves results to a file for later analysis.\n\n 3. **Model Comparison & Selection** (`compare_XD`):\n\n - Compares different XD runs using Bayesian Information Criterion (BIC) or Akaike Information Criterion (AIC).\n - Identifies the best-fit model based on BIC or AIC scores.\n - Supports filtering results by a specific number of components or repeat cycle.\n - Generates a summary of failed runs and visualises scores across different components.\n\n 4. **Star Assignment to Gaussian Components** (`assigment_XD`):\n\n - Computes each star\'s probability of belonging to each Gaussian component (responsibilities).\n - Assigns each star to the most probable Gaussian component.\n - Accounts for measurement uncertainties by modifying covariance matrices (error-aware).\n\n 5. **Results Table** (`table_results_XD`):\n\n - Constructs a summary table showing the properties of each Gaussian component.\n - Displays and outputs estimated weights, assigned star counts, and mean ± standard deviation for each parameter.\n\n 6. **Plotting Results** (`plot_XD`):\n\n - Generates a 2D scatter plot with color-coded Gaussian assignments.\n - Overlays Gaussian components as confidence ellipses (scaled by a z-score for different confidence levels).\n - Displays marginal histograms and Kernel Density Estimation (KDE) plots for feature distributions.\n - Includes a bar chart representing the relative weight of each Gaussian component.\n\n Parameters\n ----------\n star_data : Table, np.recarray, pd.DataFrame\n Input dataset containing stellar observations and features.\n data_keys : List[str]\n List of feature name keys to be used for fitting the Gaussian Mixture Model.\n data_err_keys : List[str]\n List of measurement uncertainty keys corresponding to feature keys.\n scaling : bool\n If True, standardise the features to have zero mean and unit variance.\n If False, no scaling is applied globally, but energy-related columns (\'E_50\' or \'Energy\') are divided by 1e5 for consistency.\n\n Attributes\n ----------\n star_data : Table\n The input dataset - converted to an Astropy Table upon import.\n feature_data : np.ndarray\n Extracted feature values for model fitting.\n errors_data : np.ndarray\n Measurement errors associated with each feature.\n n_samples : int\n Number of stars (data points) in the dataset.\n n_features : int\n Number of features used in the XD model.\n results_XD : dict or None\n Dictionary storing the results of XD fitting across different initialisations.\n best_params : dict\n Best-performing XD model parameters based on the chosen optimisation metric.\n filtered_best_params : dict or None\n Best-performing parameters after applying user-defined filters (e.g., fixed number of components).\n assignment_metric : str or None\n Specifies whether the assignments were based on the "best" or "best filtered" model.\n\n Notes\n -----\n - The pipeline optionally scales input data using `StandardScaler` before fitting, ensuring numerical stability.\n - Measurement uncertainties are incorporated into the covariance matrices during model fitting.\n - The pipeline supports saving/loading XD results for reproducibility.\n ', '__init__': <function XDPipeline.__init__>, '_BICScore': <function XDPipeline._BICScore>, '_AICScore': <function XDPipeline._AICScore>, 'run_XD': <function XDPipeline.run_XD>, 'compare_XD': <function XDPipeline.compare_XD>, 'assigment_XD': <function XDPipeline.assigment_XD>, 'table_results_XD': <function XDPipeline.table_results_XD>, 'plot_XD': <function XDPipeline.plot_XD>, 'plot_nonXD': <function XDPipeline.plot_nonXD>, '__dict__': <attribute '__dict__' of 'XDPipeline' objects>, '__weakref__': <attribute '__weakref__' of 'XDPipeline' objects>, '__annotations__': {}})¶
- __module__ = 'analysis.XD'¶
- __weakref__¶
list of weak references to the object (if defined)
- analysis.XD.compare_assignments(results_table, target_label, label_map, fractional_threshold=50.0)[source]¶
For stars primarily assigned to target_label, this function:
Notes
Finds their second-best Gaussian component assignments
Summarizes how often each second-best component occurs
- Calculates the mean, median, and standard deviation of:
The percent of the first-best probability that the second-best achieved
The absolute difference in probability between the first and second-best
- Parameters
results_table (Astropy Table) – Contains Gaussian Mixture Model results.
target_label (str) – Name of the component youre analysing (e.g., “Aurora”) to int.
label_map (dict) – Maps of intenger indices (starting from 1) to astrophysical names of componets.
fractional_threshold (float, optional) – Percentage threshold to filter second-best components.
GMM on UMAP Projections Pipeline¶
- class analysis.Reduced_GMM.ReducedGMMPipeline(star_data, data_keys, error_data_keys, umap_dimensions=2, umap_n_neighbors=15, umap_min_dist=0.1)[source]¶
Bases:
objectA pipeline for clustering stellar data using Gaussian Mixture Models (GMM) applied to UMAP-reduced feature space.
Designed to identify and analyse structure in high-dimensional stellar datasets by first projecting them into a lower-dimensional manifold using UMAP, and then clustering in this reduced space. Subsequent interpretation is performed both in the low-dimensional space and in the original parameter space.
The pipeline follows these key steps:
Initialisation (__init__):
Accepts stellar data in various formats: Astropy Table, NumPy recarray, or Pandas DataFrame.
Extracts the input features defined in data_keys.
Standardises the data and performs dimensionality reduction using UMAP (typically to 2D for visualisation and clustering).
Gaussian Mixture Model Fitting (run_GMM):
Applies GMM clustering in the UMAP-reduced space.
Runs GMM for a user-defined range of component numbers and initialisations.
Stores log-likelihood, BIC, AIC scores, model weights, means, covariances, and predicted labels.
Model Comparison & Selection (compare_GMM):
Compares fitted GMM models using BIC or AIC to select the optimal number of Gaussian components.
Allows filtering to evaluate a specific component count manually.
Assigns each star to a component based on the best (or filtered best) model.
Cluster Visualisation (plot_GMM_umap):
Generates a 2D scatter plot in UMAP space, coloured by GMM cluster assignments.
Overlays confidence ellipses around each Gaussian component.
Displays marginal histograms and Gaussian curves for UMAP axes.
Includes a bar chart summarising the weight of each component.
High-Dimensional Interpretation (table_results_GMM):
Computes and tabulates the mean and standard deviation of each original input feature per GMM cluster.
Supports custom cluster names and grouped/combined cluster analysis.
Helps relate low-dimensional clusters to their physical meaning in the original feature space.
- Parameters
star_data (Table, np.recarray, pd.DataFrame) – Input dataset containing stellar observations and features.
data_keys (List[str]) – List of feature names (column keys) to be used for UMAP projection and back-analysis.
error_data_keys (List[str]) – List of error feature names (column keys) corresponding to the data keys. Must be same length and order as data_keys.
umap_dimensions (int) – Number of dimensions to project the data into using UMAP (default is 2).
umap_n_neighbors (int) – Number of UMAP neighbors used for local structure preservation (default is 15).
umap_min_dist (float) – Minimum distance between points in UMAP space; controls clustering tightness (default is 0.1).
- star_data¶
The input dataset converted to an Astropy Table.
- Type
Table
- feature_data¶
Extracted original feature values used for scaling and reference.
- Type
np.ndarray
- feature_data_scaled¶
Standardised version of the original features used for UMAP.
- Type
np.ndarray
- umap_data¶
Lower-dimensional representation of the data after UMAP projection.
- Type
np.ndarray
- results_GMM¶
Stores GMM fitting results (log-likelihood, BIC/AIC scores, weights, means, covariances, labels).
- Type
dict or None
- best_params¶
Parameters of the best GMM model selected using the chosen metric.
- Type
dict
- filtered_best_params¶
Parameters from a user-filtered GMM model (e.g., fixed number of components).
- Type
dict or None
- assignment_metric¶
Indicates whether clustering assignments are from the “best” or “best filtered” model.
- Type
str or None
Notes
All clustering is performed in the UMAP-reduced space.
Final cluster properties are summarised in both reduced and full feature spaces.
Supports flexible visualisation, label customisation, and component combination for interpretation.
- __init__(star_data, data_keys, error_data_keys, umap_dimensions=2, umap_n_neighbors=15, umap_min_dist=0.1)[source]¶
Initialise the XDPipeline for UMAP-based dimensionality reduction and GMM clustering.
This method sets up the pipeline by validating the input stellar dataset, extracting specified features, applying standard scaling, and reducing the feature space to a lower-dimensional UMAP representation. Subsequent GMM clustering and interpretation can then be performed in this reduced space and mapped back to the original feature space.
- Parameters
star_data (Table, np.recarray, or pd.DataFrame) – Stellar dataset containing the features of interest. Can be an Astropy Table, NumPy recarray, or Pandas DataFrame.
data_keys (List[str]) – List of column names specifying the features to use for dimensionality reduction and clustering.
error_data_keys (List[str]) – List of column names specifying the errors corresponding to the features in data_keys. Must be the same length and order as data_keys.
umap_dimensions (int, optional) – Target number of UMAP dimensions (default is 2).
umap_n_neighbors (int, optional) – Number of neighbors considered by UMAP for local structure (default is 15).
umap_min_dist (float, optional) – Minimum distance between points in UMAP space (default is 0.1).
- Raises
TypeError – If the input dataset is not a supported type.
ValueError – If any of the keys in data_keys or ‘max_gauss’ are missing from the dataset.
- display_umap(label_dict=None, colour_dict=None)[source]¶
Visualize the 2D UMAP projection of the data colored by cluster labels.
- Parameters
label_dict (dict, optional) – A mapping from numeric GMM cluster labels (e.g., 1 to 7) to string names (e.g., ‘GS/E’). If not provided, numeric labels are used directly.
colour_dict (dict, optional) – A mapping from string names to matplotlib-compatible colors. Only used if label_dict is provided and names are available.
- Return type
None
- run_GMM(gauss_component_range=(1, 10), n_init=10, save_path=None, timings=None)[source]¶
Fit Gaussian Mixture Models (GMMs) to the UMAP-reduced stellar data across a range of component numbers.
This method fits GMMs with varying numbers of Gaussian components to the UMAP-reduced feature space. For each model, it computes key metrics (log-likelihood, BIC, AIC), stores the GMM parameters (weights, means, covariances), and predicts the cluster labels. Results are saved to disk if a path is provided.
- Parameters
gauss_component_range (Tuple[int, int], optional) – The range (min, max) of Gaussian components to try. Default is (1, 10).
n_init (int, optional) – Number of random initialisations for each GMM. Default is 10.
save_path (str, optional) – If provided, saves the dictionary of GMM results to this path as a pickle file.
timings (bool, optional) – If True, returns an distionary of the average time taken for each GMM fit. (ie per component count)
- Returns
If timings is True, returns a dict of average fit times per component count.
- Return type
Optional[Dict[int, float]]
- Raises
ValueError – If gauss_component_range is not a valid tuple of two integers, or min > max.
TypeError – If n_init is not a positive integer.
Notes
GMM is applied to the UMAP-reduced data stored in self.umap_data.
The best model per component count is selected automatically using sklearn’s GaussianMixture.
The results are stored in self.results_GMM and optionally written to disk.
- compare_GMM(opt_metric='BIC', n_gauss_filter=None, save_path=None, display_full=True, zoom_in=None)[source]¶
Compare Gaussian Mixture Model (GMM) fits using a selected metric and assign stars to clusters.
The method assigns stars to Gaussian components, stores the best-fit parameters, and optionally visualizes the model scores.
- Parameters
opt_metric (str, optional) – The metric used for model selection. Must be either ‘BIC’ or ‘AIC’. Default is ‘BIC’.
n_gauss_filter (int, optional) – If provided, only results with this number of components will be used for selection and assignment.
save_path (str, optional) – Path to load previously saved GMM results if self.results_GMM is not already populated.
display_full (bool, optional) – If True, prints a model comparison table and plots BIC/AIC scores. Default is True.
zoom_in (List[int], optional) – A list of component numbers to zoom in on in the plot (used for inset view of BIC/AIC curves).
- Raises
ValueError – If the GMM results are not available and no save path is provided. If opt_metric is not ‘BIC’ or ‘AIC’. If n_gauss_filter is out of the range of fitted components.
- Return type
None
- plot_GMM_umap(z_score=2.0, color_palette=None, xlim=None, ylim=None)[source]¶
Visualize GMM clustering results in the 2D UMAP-reduced space with component ellipses and marginal histograms.
This method creates a comprehensive figure showing:
A scatter plot of stars in the 2D UMAP space colored by their GMM-assigned components.
Ellipses representing 2D confidence intervals (z-score-scaled) for each Gaussian component.
Marginal histograms for UMAP-1 and UMAP-2 projections overlaid with individual and total Gaussian fits.
A bar chart showing the relative weights of each Gaussian component.
- Parameters
z_score (float, optional) – Z-score used to scale the confidence ellipses. Default is 2.0 (~95% confidence region).
color_palette (list, optional) – List of colors to use for the components. If None, defaults to seaborn “tab10” palette.
xlim (tuple, optional) – Limits for the x-axis (UMAP-1). If None, inferred from data.
ylim (tuple, optional) – Limits for the y-axis (UMAP-2). If None, inferred from data.
- Raises
ValueError – If no assignment metric is found. The GMM comparison must be run first.
- Return type
None
- table_results_GMM(component_name_dict=None, combine=None, labels_combined=None, deconvolve=False)[source]¶
Generate a summary table of the GMM components, projecting labels from UMAP space back to the original feature space.
For each Gaussian component, the table reports:
GMM weight (%), assigned star count, and count fraction
Mean ± standard deviation for each feature in self.data_keys
Optional:
Rename and reorder components using component_name_dict
Combine selected components with combine and labels_combined
Deconvolve observational uncertainties from the feature standard deviations using deconvolve=True
- Parameters
component_name_dict (dict, optional) – Maps component indices to custom names and defines display order.
combine (list of list of int, optional) – List of component index groups to aggregate.
labels_combined (list of str, optional) – Labels for each group in combine.
deconvolve (bool, optional) – If True, subtracts the mean squared observational error (from self.data_err_keys) from the variance of each feature before computing the standard deviation. Assumes independent Gaussian errors. Requires self.data_err_keys to be defined and aligned with self.data_keys.
- Returns
Table summarising component statistics in original feature space.
- Return type
pd.DataFrame
- Raises
ValueError – If assignments haven’t been computed or combine/labels_combined lengths mismatch.
- plot_highdim_gaussian(x_key, y_key, z_score=2.0, full_survey_file=None, color_palette=None, xlim=None, ylim=None, deconvolve=False, legend=None)[source]¶
Visualize GMM component assignments in high-dimensional space for two selected features.
Generates a 2D scatter plot of stars colored by GMM component, with:
Confidence ellipses estimated from the empirical mean and covariance of each component
Marginal histograms with overlaid Gaussian fits
Optional background 2D histogram from a reference survey
Top-right bar chart showing GMM component weights
- Parameters
x_key (str) – Name of the column to plot on the x-axis.
y_key (str) – Name of the column to plot on the y-axis.
z_score (float, optional) – Controls the confidence interval of Gaussian ellipses. Default is 2.0 (~95%).
full_survey_file (str, optional) – Path to FITS file for background sample (e.g., full survey for density plot).
color_palette (list, optional) – List of custom colors for components. If None, uses Seaborn’s ‘husl’ palette.
xlim (tuple, optional) – x-axis limits (min, max).
ylim (tuple, optional) – y-axis limits (min, max).
deconvolve (bool, optional) – If True, subtracts the mean squared observational errors (from self.data_err_keys) from the empirical covariance matrix of each component before plotting the ellipses. This reveals the intrinsic spread of each GMM component, assuming independent Gaussian measurement errors in x and y.
legend (tuple, optional) – If provided, a tuple of (x, y) coordinates for the legend position.
- Raises
ValueError – If the assignment method hasn’t been run or if input keys aren’t found in the data table.
- Return type
None
- __dict__ = mappingproxy({'__module__': 'analysis.Reduced_GMM', '__doc__': '\n A pipeline for clustering stellar data using Gaussian Mixture Models (GMM) applied to UMAP-reduced feature space.\n\n Designed to identify and analyse structure in high-dimensional stellar datasets by first projecting them into a lower-dimensional manifold using UMAP, \n and then clustering in this reduced space. Subsequent interpretation is performed both in the low-dimensional space and in the original parameter space.\n\n The pipeline follows these key steps:\n\n 1. **Initialisation** (`__init__`):\n\n * Accepts stellar data in various formats: Astropy Table, NumPy recarray, or Pandas DataFrame.\n * Extracts the input features defined in `data_keys`.\n * Standardises the data and performs dimensionality reduction using UMAP (typically to 2D for visualisation and clustering).\n\n 2. **Gaussian Mixture Model Fitting** (`run_GMM`):\n\n - Applies GMM clustering in the UMAP-reduced space.\n - Runs GMM for a user-defined range of component numbers and initialisations.\n - Stores log-likelihood, BIC, AIC scores, model weights, means, covariances, and predicted labels.\n\n 3. **Model Comparison & Selection** (`compare_GMM`):\n\n - Compares fitted GMM models using BIC or AIC to select the optimal number of Gaussian components.\n - Allows filtering to evaluate a specific component count manually.\n - Assigns each star to a component based on the best (or filtered best) model.\n\n 4. **Cluster Visualisation** (`plot_GMM_umap`):\n\n - Generates a 2D scatter plot in UMAP space, coloured by GMM cluster assignments.\n - Overlays confidence ellipses around each Gaussian component.\n - Displays marginal histograms and Gaussian curves for UMAP axes.\n - Includes a bar chart summarising the weight of each component.\n\n 5. **High-Dimensional Interpretation** (`table_results_GMM`):\n\n - Computes and tabulates the mean and standard deviation of each original input feature per GMM cluster.\n - Supports custom cluster names and grouped/combined cluster analysis.\n - Helps relate low-dimensional clusters to their physical meaning in the original feature space.\n\n Parameters\n ----------\n star_data : Table, np.recarray, pd.DataFrame\n Input dataset containing stellar observations and features.\n data_keys : List[str]\n List of feature names (column keys) to be used for UMAP projection and back-analysis.\n error_data_keys : List[str]\n List of error feature names (column keys) corresponding to the data keys. Must be same length and order as `data_keys`.\n umap_dimensions : int\n Number of dimensions to project the data into using UMAP (default is 2).\n umap_n_neighbors : int\n Number of UMAP neighbors used for local structure preservation (default is 15).\n umap_min_dist : float\n Minimum distance between points in UMAP space; controls clustering tightness (default is 0.1).\n\n Attributes\n ----------\n star_data : Table\n The input dataset converted to an Astropy Table.\n feature_data : np.ndarray\n Extracted original feature values used for scaling and reference.\n feature_data_scaled : np.ndarray\n Standardised version of the original features used for UMAP.\n umap_data : np.ndarray\n Lower-dimensional representation of the data after UMAP projection.\n results_GMM : dict or None\n Stores GMM fitting results (log-likelihood, BIC/AIC scores, weights, means, covariances, labels).\n best_params : dict\n Parameters of the best GMM model selected using the chosen metric.\n filtered_best_params : dict or None\n Parameters from a user-filtered GMM model (e.g., fixed number of components).\n assignment_metric : str or None\n Indicates whether clustering assignments are from the "best" or "best filtered" model.\n\n Notes\n -----\n - All clustering is performed in the UMAP-reduced space.\n - Final cluster properties are summarised in both reduced and full feature spaces.\n - Supports flexible visualisation, label customisation, and component combination for interpretation.\n ', '__init__': <function ReducedGMMPipeline.__init__>, 'display_umap': <function ReducedGMMPipeline.display_umap>, 'run_GMM': <function ReducedGMMPipeline.run_GMM>, 'compare_GMM': <function ReducedGMMPipeline.compare_GMM>, 'plot_GMM_umap': <function ReducedGMMPipeline.plot_GMM_umap>, 'table_results_GMM': <function ReducedGMMPipeline.table_results_GMM>, 'plot_highdim_gaussian': <function ReducedGMMPipeline.plot_highdim_gaussian>, '__dict__': <attribute '__dict__' of 'ReducedGMMPipeline' objects>, '__weakref__': <attribute '__weakref__' of 'ReducedGMMPipeline' objects>, '__annotations__': {}})¶
- __module__ = 'analysis.Reduced_GMM'¶
- __weakref__¶
list of weak references to the object (if defined)
Exploring Data Reduction and Clustering Functions¶
- analysis.Dimreduce.investigate_umap(table_path, data_keys, label_column, labels_name, labels_color_map, n_neighbors_list=[15, 15, 15], min_dist_list=[0.1, 0.3, 1], cluster_method=None, n_components_gmm=5, min_cluster_size_hdbscan=30, min_samples_hdbscan=1, axis_label_fontsize=18, tick_fontsize=18, title_fontsize=19, legend_fontsize=15)[source]¶
Visualizes UMAP dimensionality reduction results for a high-dimensional dataset and optionally applies clustering (GMM or HDBSCAN) in the reduced space.
- Parameters
table_path (str) – Path to the FITS file containing the dataset.
data_keys (list of str) – List of column names to use as input features for UMAP.
label_column (str) – Column name containing original cluster assignments for coloring true label plots.
labels_name (dict) – Dictionary mapping numerical cluster IDs to string labels (e.g. {1: ‘GS/E’, 2: ‘Splash’}).
labels_color_map (dict) – Dictionary mapping string labels to matplotlib-compatible color codes.
n_neighbors_list (list of int, optional) – List of UMAP n_neighbors values, one per column of the plot grid.
min_dist_list (list of float, optional) – List of UMAP min_dist values, one per column of the plot grid.
cluster_method (str or None, optional) – If specified, applies unsupervised clustering in UMAP space. Options: - ‘GMM’: Gaussian Mixture Model clustering (requires n_components_gmm). - ‘HDBSCAN’: HDBSCAN clustering (requires min_cluster_size_hdbscan and min_samples_hdbscan). - None: disables clustering, shows only UMAP colored by original labels.
n_components_gmm (int, optional) – Number of clusters to fit for GMM if cluster_method=’GMM’. Default is 5.
min_cluster_size_hdbscan (int, optional) – Minimum cluster size for HDBSCAN. Only used if cluster_method=’HDBSCAN’.
min_samples_hdbscan (int, optional) – Minimum samples for HDBSCAN. Only used if cluster_method=’HDBSCAN’.
axis_label_fontsize (int, optional) – Font size for axis labels.
tick_fontsize (int, optional) – Font size for axis tick labels.
title_fontsize (int, optional) – Font size for row titles.
legend_fontsize (int, optional) – Font size for legend and text annotations.
- Returns
Displays matplotlib figures with UMAP projections and clustering overlays if enabled.
- Return type
None
- analysis.Dimreduce.investigate_tsne(table_path, data_keys, perplexities, learning_rates, label_column='max_gauss', labels_name=None, labels_color_map=None, axis_label_fontsize=14, tick_fontsize=12, legend_fontsize=10, title_fontsize=14)[source]¶
Visualizes t-SNE dimensionality reduction results across multiple configurations.
- Parameters
table_path (str) – Path to the FITS file containing the data table.
data_keys (list of str) – Column names to use as input features for dimensionality reduction.
perplexities (list of int) – List of perplexity values for each t-SNE configuration.
learning_rates (list of float) – List of learning rate values for each t-SNE configuration.
label_column (str, optional) – Column name representing true GMM cluster labels. Default is ‘max_gauss’.
labels_name (dict, optional) – Mapping from numeric GMM component indices to descriptive cluster names.
labels_color_map (dict, optional) – Mapping from descriptive cluster names to color codes.
axis_label_fontsize (int, optional) – Font size for axis labels.
tick_fontsize (int, optional) – Font size for axis ticks.
legend_fontsize (int, optional) – Font size for the legend.
title_fontsize (int, optional) – Font size for plot titles.
- Return type
None