clustering Package¶
clustering
Package¶
clustering
Module¶
SH : 2May2013 bdb lots of mods testing: pyfusion/examples/clusterDA.py Lots of duplication - needs rationalisation
-
pyfusion.clustering.clustering.
EM_GMM_calc_best_fit
(instance_array, weights)[source]¶ Calculates MLE approximate parameters for mean and kappa for the von Mises distribution. Can use a lookup table for the two Bessel functions, or a scipy optimiser if lookup=None
SH: 23May2013
-
pyfusion.clustering.clustering.
EM_GMM_clustering
(instance_array, n_clusters=9, sin_cos=0, number_of_starts=10, show_covariances=0, clim=None, covariance_type='diag')[source]¶
-
class
pyfusion.clustering.clustering.
EM_GMM_clustering_class
(instance_array, n_clusters=9, n_iterations=20, n_cpus=1, start='random', kappa_calc='approx', hard_assignments=0, kappa_converged=0.1, mu_converged=0.01, min_iterations=10, LL_converged=0.0001, verbose=0, seed=None)[source]¶ Expectation maximisation using von Mises with soft cluster assignments. instance_array : the input phases n_clusters : number of clusters to aim for n_iterations : number of iterations before giving up n_cpus : currently not implemented start: how to start the clusters off - recommend using ‘k_means’ bessel_lookup_table : how to calculate kappa, can use a lookup table or optimiser
kappa_calc : approx, lookup_table, optimize SH : 23May2013
-
pyfusion.clustering.clustering.
EM_GMM_clustering_wrapper
(instance_array, n_clusters=9, n_iterations=20, n_cpus=1, start='random', kappa_calc='approx', hard_assignments=0, kappa_converged=0.1, mu_converged=0.01, min_iterations=10, LL_converged=0.0001, verbose=0, number_of_starts=1)[source]¶
-
class
pyfusion.clustering.clustering.
EM_VMM_GMM_clustering_class
(instance_array, instance_array_amps, n_clusters=9, n_iterations=20, n_cpus=1, start='random', kappa_calc='approx', hard_assignments=0, kappa_converged=0.1, mu_converged=0.01, min_iterations=10, LL_converged=0.0001, verbose=0, seed=None)[source]¶ Bases:
pyfusion.clustering.clustering.clustering_object
This model is supposed to include a mixture of Gaussian and von Mises distributions to allow datamining of data that essentially consists of complex numbers (amplitude and phase) such as most Fourier based measurements. Supposed to be an improvement on the case of just using the phases between channels - more interested in complex modes such as HAE, and also looking at data that is more amplitude based such as line of sight chord through the plasma for interferometers and imaging diagnostics.
Note the amplitude data is included in misc_data_dict[‘mirnov_data’] from the stft-clustering extraction technique
Need to figure out a way to normalise it... so that shapes of different amplitudes will look the same Need to plumb this in somehow...
SH: 15June2013
-
pyfusion.clustering.clustering.
EM_VMM_GMM_clustering_wrapper
(instance_array, instance_array_amps, n_clusters=9, n_iterations=20, n_cpus=1, start='random', kappa_calc='approx', hard_assignments=0, kappa_converged=0.1, mu_converged=0.01, min_iterations=10, LL_converged=0.0001, verbose=0, number_of_starts=1)[source]¶
-
pyfusion.clustering.clustering.
EM_VMM_calc_best_fit
(z, N=None, lookup=None)[source]¶ Calculates MLE approximate parameters for mean and kappa for the von Mises distribution. Can use a lookup table for the two Bessel functions, or a scipy optimiser if lookup=None
SH: 23May2013
-
pyfusion.clustering.clustering.
EM_VMM_calc_best_fit_optimise
(z, lookup=None, N=None)[source]¶ Calculates MLE approximate parameters for mean and kappa for the von Mises distribution. Can use a lookup table for the two Bessel functions, or a scipy optimiser if lookup=None
SH: 23May2013
-
pyfusion.clustering.clustering.
EM_VMM_clustering
(instance_array, n_clusters=9, n_iterations=20, n_cpus=1, start='random', comment='')[source]¶
-
class
pyfusion.clustering.clustering.
EM_VMM_clustering_class
(instance_array, n_clusters=9, n_iterations=20, n_cpus=1, start='random', kappa_calc='approx', hard_assignments=0, kappa_converged=0.1, mu_converged=0.01, min_iterations=10, LL_converged=0.0001, verbose=0, seed=None)[source]¶ Expectation maximisation using von Mises with soft cluster assignments. instance_array : the input phases n_clusters : number of clusters to aim for n_iterations : number of iterations before giving up n_cpus : currently not implemented start: how to start the clusters off - recommend using ‘k_means’ bessel_lookup_table : how to calculate kappa, can use a lookup table or optimiser
kappa_calc : approx, lookup_table, optimize SH : 23May2013
-
pyfusion.clustering.clustering.
EM_VMM_clustering_soft
(instance_array, n_clusters=9, n_iterations=20, n_cpus=1, start='random', bessel_lookup_table=True)[source]¶ Expectation maximisation using von Mises with soft cluster assignments. instance_array : the input phases n_clusters : number of clusters to aim for n_iterations : number of iterations before giving up n_cpus : currently not implemented start: how to start the clusters off - recommend using ‘k_means’ bessel_lookup_table : how to calculate kappa, can use a lookup table or optimiser
SH : 23May2013
-
pyfusion.clustering.clustering.
EM_VMM_clustering_wrapper
(instance_array, n_clusters=9, n_iterations=20, n_cpus=1, start='random', kappa_calc='approx', hard_assignments=0, kappa_converged=0.1, mu_converged=0.01, min_iterations=10, LL_converged=0.0001, verbose=0, number_of_starts=1)[source]¶
-
class
pyfusion.clustering.clustering.
clusterer_wrapper
(feature_obj, method='k-means', **kwargs)[source]¶ Bases:
pyfusion.clustering.clustering.clustering_object
Wrapper around the EM_GMM_clustering function Decided to use a wrapper so that it can be used outside of this architecture if needed
method : k-means, EMM_GMM, k_means_periodic, EM_VMM pass settings as kwargs: these are the default settings: ‘k_means’: {‘n_clusters’:9, ‘sin_cos’:1, ‘number_of_starts’:30,’seed’:1,’use_scikit’:1} ‘EM_GMM’ : {‘n_clusters’:9, ‘sin_cos’:1, ‘number_of_starts’:30}, ‘k_means_periodic’ : {‘n_clusters’:9, ‘number_of_starts’:10, ‘n_cpus’:1, ‘distance_calc’:’euclidean’,’convergence_diff_cutoff’: 0.2, ‘iterations’: 40, ‘decimal_roundoff’:2}, ‘EM_VMM’ : {‘n_clusters’:9, ‘n_iterations’:20, ‘n_cpus’:1, ‘start’:’k_means’,
‘kappa_calc’:’approx’,’hard_assignments’:0,’kappa_converged’:0.2, ‘mu_converged’:0.01,’LL_converged’:1.e-4,’min_iterations’:10,’verbose’:1}}- ‘EM_GMM2’ : {‘n_clusters’:9, ‘n_iterations’:20, ‘n_cpus’:1, ‘start’:’k_means’,
- ‘kappa_calc’:’approx’,’hard_assignments’:0,’kappa_converged’:0.2, ‘mu_converged’:0.01,’LL_converged’:1.e-4,’min_iterations’:10,’verbose’:1}}
SH: 6May2013
-
class
pyfusion.clustering.clustering.
clustering_object
[source]¶ Generic clustering_object, this will have the following attributes instance_array : array of phase differences
SH : 6May2013
-
make_mode_list
(min_kappa=4, plot=False)[source]¶ Return a mode_list (such as used in new_mode_identify_script from a cluster_object, ps: this is another reason for make new_modeidentifier _script a class.
-
plot_VM_distributions
()[source]¶ Plot the vonMises distributions for each dimension for each cluster Also plot the histograms - these are overlayed with dashed lines
SH: 9May2013
-
plot_clusters_amp_lines
(decimation=1, linewidth=0.05)[source]¶ Plot all the phase lines for the clusters Good clusters will show up as dense areas of line
SH: 9May2013
-
plot_clusters_phase_lines
(decimation=4000, linewidth=0.05, colours=['r', 'k', 'b', 'y', 'm', 'g', 'c', 'orange', 'purple', 'lightgreen', 'lightgray'], xlabel_loc=0.5, ylabel_loc=3.2, yline=0, xlabel='')[source]¶ Plot all the phase lines for the clusters Good clusters will show up as dense areas of line if decimation > 2000, it is the number of points desired set yline to draw a constant y line for reference (or None)
SH: 9May2013
-
plot_dimension_histograms
(pub_fig=0, filename='plot_dim_hist.pdf', specific_dimensions=None, extra_txt_labels='', label_loc=[-2, 1.5], ylim=None)[source]¶ For each dimension in the data set, plot the histogram of the phase differences Overlay the vonMises mixture model along with the individual vonMises distributions from each cluster
SH: 9May2013
-
plot_dimension_histograms_amps
(pub_fig=0, filename='plot_dim_hist.pdf', specific_dimensions=None)[source]¶ For each dimension in the data set, plot the histogram of the phase differences Overlay the vonMises mixture model along with the individual vonMises distributions from each cluster
SH: 9May2013
-
plot_fft_amp_lines
(decimation=1)[source]¶ Plot all the phase lines for the clusters Good clusters will show up as dense areas of line
SH: 9May2013
-
plot_interferometer_channels
(interferometer_spacing=0.025, interferometer_start=0, include_both_sides=1, plot_phases=0)[source]¶ plot kh vs frequency for each cluster - i.e looking for whale tails The colouring of the points is based on the total phase along the array i.e a 1D indication of the clusters
SH: 9May2013
-
plot_kh_freq_all_clusters
(color_by_cumul_phase=1)[source]¶ plot kh vs frequency for each cluster - i.e looking for whale tails The colouring of the points is based on the total phase along the array i.e a 1D indication of the clusters
SH: 9May2013
-
plot_phase_vs_phase
(pub_fig=0, filename='phase_vs_phase.pdf', compare_dimensions=None, kappa_ave_cutoff=0, plot_means=0, alpha=0.05, decimation=1, limit=None, xlabel_loc=3, ylabel_loc=3, colours=None)[source]¶ SH: 9May2013
-
plot_single_kh
(cluster_list=None, kappa_cutoff=None, color_by_cumul_phase=1, sqrtne=None, plot_alfven_lines=1, xlim=None, ylim=None, pub_fig=0, filename=None, marker_size=100)[source]¶ plot kh vs frequency for each cluster - i.e looking for whale tails The colouring of the points is based on the total phase along the array i.e a 1D indication of the clusters
can provide a cluster_list to select which ones are plotted or can give a kappa_cutoff (takes precedent) otherwise, all will be plotted SH: 9May2013
-
-
pyfusion.clustering.clustering.
compare_several_clusters
(clusters, pub_fig=0, alpha=0.05, decimation=10, labels=None, filename='hello.pdf', kappa_ref_cutoff=0, plot_indices=[0, 1], colours=None, markers=None)[source]¶ Clusters contains a list of clusters Print out a comparison between two sets of clusters to compare datamining methods...
SH : 7May2013
-
pyfusion.clustering.clustering.
compare_several_kappa_values
(clusters, pub_fig=0, alpha=0.05, decimation=10, labels=None, plot_style_list=None, filename='extraction_comparison.pdf', xaxis='sigma_eq', max_cutoff_value=35, vline=None)[source]¶ xaxis can be sigma_eq, kappa_bar or sigma_bar
-
pyfusion.clustering.clustering.
compare_two_cluster_results
(cluster1, cluster2)[source]¶ Print out a comparison between two sets of clusters to compare datamining methods...
SH : 7May2013
-
pyfusion.clustering.clustering.
convert_DA_file
(filename, correspondence='indx, serial t_mid, time amp, RMS freq, freq p, p a12, a12, shot, shot k_h, kh, ne_1, ne1, ne_2, ne2 ne_3, ne3 ne_4, ne4 ne_5, ne5 ne_6, ne6 ne_7, ne7 b_0, b_0 p_rf, p_rf', debug=1, limit=None, Kilohertz=1, load_all=False, keysel=None)[source]¶ Converts a DA_datamining file to a form compatible with this package. returns(instance_array, misc_data) with names converted according to correspondence, input as pairs separated by spaces.
limit = 10000 selects ~10000 samples randowmly ( -10000 for repeatable sec) load_all = True will load all misc data, even those not in the correspondence list. Reverse conversion (Shauns to DA) features are built into DA-datamining e.g. fo = co23.feature_obj dd = dict(phases=fo.instance_array) dd.update(fo.misc_data_dict) # should check that the dimensions agree DA23 = DA(dd)
-
pyfusion.clustering.clustering.
convert_kappa_std
(kappa, deg=True)[source]¶ This converts kappa from the von Mises distribution into a standard deviation that can be used to generate a similar normal distribution
SH: 14June2013
-
class
pyfusion.clustering.clustering.
feature_object
(instance_array=None, misc_data_dict=None, filename=None)[source]¶ This is suposed to be the feature object SH : 6May2013
feature_object... this contains all of the raw data that is a result of feature extraction. It can be initialised by passing an instance_array and misc_data_dict dictionary, or alternatively, the filename of a pickle file that was saved with feature_object.dump_data()
-
cluster
(**kwargs)[source]¶ This method will perform clustering using one of the following methods: k-means (scikit-learn), Expectation maximising using a Gaussian mixture model EM_GMM (scikit-learn) k_means_periodic (SH implementation) Expecation maximising using a von Mises mixture model EM_VMM (SH implementation)
**kwargs: method : To determine which clustering algorithm to use can be : k-means, EMM_GMM, k_means_periodic, EM_VMM
Other kwargs to overide the following default settings for each clustering algorithmn
‘k_means’: {‘n_clusters’:9, ‘sin_cos’:1, ‘number_of_starts’:30}, ‘EM_GMM’ : {‘n_clusters’:9, ‘sin_cos’:1, ‘number_of_starts’:30}, ‘k_means_periodic’ : {‘n_clusters’:9, ‘number_of_starts’:10, ‘n_cpus’:1, ‘distance_calc’:’euclidean’,’convergence_diff_cutoff’: 0.2, ‘iterations’: 40, ‘decimal_roundoff’:2}, ‘EM_VMM’ : {‘n_clusters’:9, ‘n_iterations’:20, ‘n_cpus’:1}}
returns a cluster object that also gets appended to the self.clustered_objects list
SH: 6May2013
-
-
pyfusion.clustering.clustering.
generate_artificial_data
(n_clusters, n_dimensions, n_instances, prob=None, method='vonMises', means=None, variances=None, random_means_bounds=[-3.141592653589793, 3.141592653589793], random_var_bounds=[0.05, 5])[source]¶ Generate a dummy data set n_clusters : number of separate clusters n_dimensions : how many seperate phase signals per instance n_instances : number of instances - note this might be changed slightly depending on the probabilities
kwargs prob : 1D array, length n_clusters, probabilty of each cluster - (note must add to one...) if None, then all clusters will have equal probability method : distribution to draw points from - vonMises or Gaussian means, variances: arrays (n_clusters x n_dimensions) of the means and variances (kappa for vonMises) for the distributions if these are given, n_clusters and n_dimensions are ignored If only means or variances are given, then the missing one will be given random values Note for vonMises, 1/var is used as this is approximately kappa
SH : 14May2013
-
pyfusion.clustering.clustering.
k_means_clustering
(instance_array, n_clusters=9, sin_cos=1, number_of_starts=30, seed=None, use_scikit=1, **kwargs)[source]¶ This runs the k-means clustering algorithm as implemented in scipy - change to scikit-learn?
SH: 7May2013
-
pyfusion.clustering.clustering.
k_means_periodic
(instance_array, n_clusters=9, number_of_starts=10, n_cpus=1, distance_calc='euclidean', convergence_diff_cutoff=0.2, n_iterations=40, decimal_roundoff=2, seed_list=None, **kwargs)[source]¶
-
pyfusion.clustering.clustering.
make_grid_subplots
(n_subplots, sharex='all', sharey='all')[source]¶ This helper function generates the many subplots on a regular grid
SH: 23May2013
-
pyfusion.clustering.clustering.
modtwopi
(x, offset=3.141592653589793)[source]¶ return an angle in the range of offset +- 2pi>>> print(“{0:.3f}”.format(modtwopi( 7),offset=3.14))0.717 This simple strategy works when the number is near zero +- 2Npi, which is true for calculating the deviation from the cluster centre. does not attempt to make jumps small (use fix2pi_skips for that)
extract_features_scans
Module¶
-
pyfusion.clustering.extract_features_scans.
extract_data_by_picking_peaks
(current_shot, array_names, NFFT=1024, hop=256, n_pts=20, lower_freq=1500, ax=None, time_window=[0.004, 0.09])[source]¶
-
pyfusion.clustering.extract_features_scans.
filter_by_kappa_cutoff
(z, ave_kappa_cutoff=25, ax=None, prob_cutoff=None, cutoff_by='sigma_eq')[source]¶
-
pyfusion.clustering.extract_features_scans.
find_peaks
(data_fft, n_pts=20, lower_freq=1500, by_average=True, moving_ave=5, peak_cutoff=0)[source]¶
-
pyfusion.clustering.extract_features_scans.
get_array_data
(current_shot, array_name, time_window=None, new_timebase=None)[source]¶
-
pyfusion.clustering.extract_features_scans.
multi_stft
(shot_selection, array_names, n_cpus=1, NFFT=2048, perform_datamining=1, overlap=4, n_pts=20, lower_freq=1500, filter_cutoff=20, cutoff_by='sigma_eq')[source]¶
-
pyfusion.clustering.extract_features_scans.
multi_svd
(shot_selection, array_name, other_arrays=None, other_array_labels=None, meta_data=None, n_cpus=8, NFFT=2048, power_cutoff=0.05, min_svs=2, overlap=4)[source]¶ Runs through all the shots in shot_selection other_arrays is a list of the other arrays you want to get information from
-
pyfusion.clustering.extract_features_scans.
perform_data_datamining
(mirnov_angles, misc_data_dict, n_clusters=16, n_iterations=60)[source]¶
-
pyfusion.clustering.extract_features_scans.
return_non_freq_dependent
(tmp_array, good_indices, force_index=None)[source]¶
-
pyfusion.clustering.extract_features_scans.
return_values
(tmp_array, good_indices, force_index=None)[source]¶
-
pyfusion.clustering.extract_features_scans.
single_shot
(current_shot, array_names, NFFT, hop, n_pts, lower_freq, ax, start_time, end_time, perform_datamining, ave_kappa_cutoff, cutoff_by)[source]¶
-
pyfusion.clustering.extract_features_scans.
single_shot_fluc_strucs
(shot=None, array=None, other_arrays=None, other_array_labels=None, start_time=0.001, end_time=0.08, samples=1024, power_cutoff=0.1, n_svs=2, overlap=4, meta_data=None)[source]¶ This function will extract all the important information from a flucstruc and put it into the form that is useful for clustering using hte clustering module.
SH: 8June2013
modes
Module¶
-
class
pyfusion.clustering.modes.
Mode
(name, N, NN, cc, csd, id=None, threshold=None, shot_list=[], MP2010_trick=False)[source]¶ -
hist
(phases, first_std, NDim=None, n_bins=20, n_iters=10, histtype='bar', linewidth=None, equal_bins=False)[source]¶ This histogram can bin non-uniformly so that a uniform random distribution will have a uniform number of counts. (equal_bins=False)
histtype=’stepfilled doesn’t work Use linewidth=0 instead (now the default for nonequal bins)
-
old_std
(phase_array, mask=None)[source]¶ Return the standard deviation normalised to the cluster sds a point right on the edge of each sd would return 1 Need to include a mask to allow dead probes to be ignored
the following should return [1.,0.5] ml[0].std(array([ml[0].cc+ml[0].csd,ml[0].cc+0.5*ml[0].csd])) masl selects the channels, but only for the modes - the phases data is already selected.
-
one_rms
(phases)[source]¶ Return the standard deviation normalised to the cluster sds a point right on the edge of each sd would return 1
-
plot
(axes=None, label=None, csel=None, color=None, suptitle=None, **kwargs)[source]¶ plot a mode showing its SD as error bars
-
std
(phase_array, csel=None, mask=None)[source]¶ Return the standard deviation normalised to the cluster sds a point right on the edge of each sd would return 1 mask replaced by the mask matrix to allow dead probes to be ignored or compbined with others
the following should return [1.,0.5] ml[0].std(array([ml[0].cc+ml[0].csd,ml[0].cc+0.5*ml[0].csd])) mask selects the channels, but only for the modes - the phases data is already selected.
-
store
(phases, dd, inds, csel=None, mask=None, threshold=None, Nval=None, NNval=None, shot_list=None, quiet=0)[source]¶ - store coarse and fine mode (N, NN) numbers according to a threshold
- std a selection of channels (mask) and an optional shot_list.
If None the internal shot_list is used which would have been set or defaulted to [] at __init__
mask selects the probes - if None, select all.
-
modes2
Module¶
modes_old
Module¶
-
class
pyfusion.clustering.modes_old.
Mode
(name, N, NN, cc, csd, id=None, threshold=None, shot_list=[], MP2010_trick=False)[source]¶ -
hist
(phases, first_std, NDim=None, n_bins=20, n_iters=10, histtype='bar', linewidth=None, equal_bins=False)[source]¶ This histogram can bin non-uniformly so that a uniform random distribution will have a uniform number of counts. (equal_bins=False)
histtype=’stepfilled doesn’t work Use linewidth=0 instead (now the default for nonequal bins)
-
one_rms
(phases)[source]¶ Return the standard deviation normalised to the cluster sds a point right on the edge of each sd would return 1
-
std
(phase_array)[source]¶ Return the standard deviation normalised to the cluster sds a point right on the edge of each sd would return 1 Need to include a mask to allow dead probes to be ignored
the following should return [1.,0.5] ml[0].std(array([ml[0].cc+ml[0].csd,ml[0].cc+0.5*ml[0].csd]))
-
std_masked
(phase_array, mask=None)[source]¶ Return the standard deviation normalised to the cluster sds a point right on the edge of each sd would return 1 Innclude a mask to allow dead probes to be ignored
-
store
(dd, sel=None, threshold=None, Nval=None, NNval=None, shot_list=None, quiet=0, mask=None)[source]¶ - store coarse and fine mode (N, NN) numbers according to a threshold
- std a selection of channels and an optional shot_list.
If None the internal shot_list is used which would have been set or defaulted to [] at __init__
sel selects the probes - if None, select all.
-