darfix.core.dataset.Dataset#

class darfix.core.dataset.Dataset(_dir, data=None, raw_folder=None, first_filename=None, filenames=None, dims=None, transformation=None, in_memory=True, copy_files=False, isH5=False, title=None, metadata_url=None)[source]#

Bases: object

Class to define a dataset from a series of data files. The idea of this class is to make life easier for the user when using darfix. Most of the operations on darfix need to use all the data to be computed, having a wrapper like Dataset allows to execute the operations on that same data without having to load the data at each step. This is not a God object: Operations on the source data (e.g. blind source separation, hot pixel removal, shift correction) are implemented as pure transformations acting on this entity, such that there are no side effects and the operations can be chained together to define a computational workflow. Although it is a complex class due to the next things: - The data can be loaded on memory or on disk. When the data is on disk, the core operations are called using chunks of the data to reduce its size on memory. - A Dataset allows the user to use only part of the data for certain operations, which allows to have a much faster processing of those. This is done using the indices and bg_indices attributes. - It also allows, for certain operations and at certain cases, to stop a running operation. This is done with the attributes running_data and state_of_operations, which contain the Data object with the data being modified and a list of the operations to stop. - The dims attribute is a dictionary containing the dimensions that shape the data. Several operations can be applied on only part of the data depending on the dimensions shape.

Parameters:
  • _dir (str) – Global directory to use and save all the data in the different operations.

  • data (Data, optional) – If not None, sets the Data array with the data of the dataset to use, default None

  • raw_folder (Union[None,str], optional) – Path to the folder that contains the data, defaults to None

  • filenames (Union[Generator,Iterator,List], optional) – Ordered list of filenames, defaults to None (expected for EDF files).

  • dims (AcquisitionDims, optional) – Dimensions dictionary

  • transformation (ndarray, optional) – Axes to use when displaying the images

  • in_memory (bool, optional) – If True, data is loaded into memory, else, data is read in chunks depending on the algorithm to apply, defaults to False.

  • copy_files (bool, optional) – If True, creates a new treated data folder and doesn’t replace the directory files.

  • isH5 (bool) – True if the data is containined into HDF5 file

  • metadata_url (Union[str, DataUrl, None]) – Optional url to the metadata (in case of non-EDF acquisition)

  • title (Optional[str]) –

add_dim(axis, dim)[source]#

Adds a dimension to the dimension’s dictionary.

Parameters:
  • axis (int) – axis of the dimension.

  • dim (Dimension) – dimension to be added.

apply_background_subtraction(background=None, method='median', indices=None, step=None, chunk_shape=[100, 100], _dir=None)[source]#

Applies background subtraction to the data and saves the new data into disk.

Parameters:
  • background (Union[None, array_like, Dataset]) – Data to be used as background. If None, data with indices indices is used. If Dataset, data of the dataset is used. If array, use data with indices in the array.

  • method (Method) – Method to use to compute the background.

  • indices (Union[None, array_like]) – Indices of the images to apply background subtraction. If None, the background subtraction is applied to all the data.

  • step (int) – Distance between images to be used when computing the median. Parameter used only when flag in_memory is False and method is Method.median. If step is not None, all images with distance of step, starting at 0, will be loaded into memory for computing the median, if the data loading throws a MemoryError, the median computation is tried with step += 1.

  • chunk_shape (array_like) – Shape of the chunk image to use per iteration. Parameter used only when flag in_memory is False and method is Method.median.

Returns:

dataset with data of same size as self.data but with the modified images. The urls of the modified images are replaced with the new urls.

Return type:

Dataset

apply_binning(scale, _dir=None)[source]#
apply_fit(indices=None, int_thresh=None, chunk_shape=[100, 100], _dir=None)[source]#

Fits the data around axis 0 and saves the new data into disk.

Parameters:
  • indices (Union[None, array_like]) – Indices of the images to fit. If None, the fit is done to all the data.

  • int_thresh (Union[None, float]) – see mapping.fit_pixel

  • chunk_shape (array_like) – Shape of the chunk image to use per iteration. Parameter used only when flag in_memory is False.

Returns:

dataset with data of same size as self.data but with the modified images. The urls of the modified images are replaced with the new urls.

Return type:

Dataset

apply_hot_pixel_removal(kernel=3, indices=None, _dir=None)[source]#

Applies hot pixel removal to Data, and saves the new data into disk.

Parameters:
  • kernel (int) – size of the kernel used to find the hot pixels.

  • indices (Union[None, array_like]) – Indices of the images to apply background subtraction. If None, the hot pixel removal is applied to all the data.

Returns:

dataset with data of same size as self.data but with the modified images. The urls of the modified images are replaced with the new urls.

Return type:

Dataset

apply_mask_removal(mask, indices=None, _dir=None)[source]#

Applies mask to Data, and saves the new data into disk.

Parameters:
  • mask (nd.array) – Mask to apply with 0’s on the mask.

  • indices (Union[None, array_like]) – Indices of the images to apply background subtraction. If None, the hot pixel removal is applied to all the data.

Returns:

dataset with data of same size as self.data but with the modified images. The urls of the modified images are replaced with the new urls.

Return type:

Dataset

apply_moments(indices=None, chunk_shape=[500, 500])[source]#

Compute the COM, FWHM, skewness and kurtosis of the data for very dimension.

Parameters:
  • indices (Union[None, array_like], optional) – If not None, apply method only to indices of data, defaults to None

  • chunk_shape (array_like, optional) – Shape of the chunk image to use per iteration. Parameter used only when flag in_memory is False.

apply_roi(origin=None, size=None, center=None, indices=None, roi_dir=None)[source]#

Applies a region of interest to the data.

Parameters:
  • origin (Union[2d vector, None]) – Origin of the roi

  • size (Union[2d vector, None]) – [Height, Width] of the roi.

  • center (Union[2d vector, None]) – Center of the roi

  • indices (Union[None, array_like]) – Indices of the images to apply background subtraction. If None, the roi is applied to all the data.

  • roi_dir (str) – Directory path for the new dataset

Returns:

dataset with data with roi applied. Note: To preserve consistence of shape between images, if indices is not None, only the data modified is returned.

Return type:

Dataset

apply_shift(shift, dimension=None, shift_approach='fft', indices=None, callback=None, _dir=None)[source]#

Apply shift of the data or part of it and save new data into disk.

Parameters:
  • shift (array_like) – Shift per frame.

  • dimension (Union[None, tuple, array_like]) – Parametes with the position of the data in the reshaped array.

  • shift_approach (Union['fft', 'linear']) – Method to use to apply the shift.

  • indices (Union[None, array_like]) – Boolean index list with True in the images to apply the shift to. If None, the hot pixel removal is applied to all the data.

  • callback (Union[function, None]) – Callback

Returns:

dataset with data of same size as self.data but with the modified images. The urls of the modified images are replaced with the new urls.

apply_shift_along_dimension(shift, dimension, shift_approach='fft', indices=None, callback=None, _dir=None)[source]#
apply_threshold_removal(bottom=None, top=None, indices=None, _dir=None)[source]#

Applies bottom threshold to Data, and saves the new data into disk.

Parameters:
  • bottom (int) – bottom threshold to apply.

  • top (int) – top threshold to apply.

  • indices (Union[None, array_like]) – Indices of the images to apply background subtraction. If None, the hot pixel removal is applied to all the data.

Returns:

dataset with data of same size as self.data but with the modified images. The urls of the modified images are replaced with the new urls.

Return type:

Dataset

clear_dims()[source]#
compute_frames_intensity(kernel=(3, 3), sigma=20)[source]#

Returns for every image a number representing its intensity. This number is obtained by first blurring the image and then computing its variance.

compute_mosaicity_colorkey(dimensions=[0, 1], scale=100, indices=None, third_motor=None)[source]#

Computes a mosaicity colorkey from the dimensions, and returns also the orientation distribution image.

compute_rsm(Q, a, map_range, pixel_size, units, n, map_shape, energy=17, transformation=None)[source]#
compute_transformation(d, kind='magnification', rotate=False, topography=[False, 0], center=True)[source]#

Computes transformation matrix. Depending on the kind of transformation, computes either RSM or magnification axes to be used on future widgets.

Parameters:
  • d (float) – Size of the pixel

  • kind (str) – Transformation to apply, either ‘magnification’ or ‘rsm’

  • rotate (bool) – To be used only with kind=’rsm’, if True the images with transformation are rotated 90 degrees.

  • topography (bool) – To be used only with kind=’magnification’, if True obpitch values are divided by its sine.

property data#
property dims#
property dir#
find_and_apply_shift(dimension=None, steps=100, shift_approach='fft', indices=None, callback=None)[source]#

Find the shift of the data or part of it and apply it.

Parameters:
  • dimension (Union[None, tuple, array_like]) – Parametes with the position of the data in the reshaped array.

  • h_max (float) – See core.imageRegistration.shift_detection

  • h_step (float) – See core.imageRegistration.shift_detection

  • shift_approach (Union['fft', 'linear']) – Method to use to apply the shift.

  • indices (Union[None, array_like]) – Indices of the images to find and apply the shift to. If None, the hot pixel removal is applied to all the data.

  • callback (Union[function, None]) – Callback

Returns:

Dataset with the new data.

find_dimensions(kind, tolerance=1e-09)[source]#

Goes over all the headers from a given kind and finds the dimensions that move (have more than one value) along the data.

Note: Before, only the dimensions that could fit where shown, now it shows all the dimensions and let the user choose the valid ones.

Parameters:
  • kind (int) – Type of metadata to find the dimensions.

  • tolerance (float) – Tolerance that will be used to compute the unique values.

Return type:

None

find_shift(dimension=None, steps=50, indices=None)[source]#

Find shift of the data or part of it.

Parameters:
  • dimension (Union[None, tuple, array_like]) – Parameters with the position of the data in the reshaped array.

  • h_max (float) – See core.imageRegistration.shift_detection

  • h_step (float) – See core.imageRegistration.shift_detection

  • indices (Union[None, array_like]) – Boolean index list with True in the images to apply the shift to. If None, the hot pixel removal is applied to all the data.

Returns:

Array with shift per frame.

find_shift_along_dimension(dimension, steps=50, indices=None)[source]#
get_data(indices=None, dimension=None, return_indices=False)[source]#

Returns the data corresponding to certains indices and given some dimensions values. The data is always flattened to be a stack of images.

Parameters:
  • indices (array_like) – If not None, data is filtered using this array.

  • dimension (array_like) – If not None, return only the data corresponding to the given dimension. Dimension is a 2d vector, where the first component is a list of the axis and the second is a list of the indices of the values to extract. The dimension and value list length can be up to the number of dimensions - 1. The call get_data(dimension=[[1, 2], [2, 3]]) is equivalent to data[:, 2, 3] when data is in memory. The axis of the dimension is so that lower the axis, fastest the dimension (higher changing value).

Returns:

Array with the new data.

get_dimensions_values(indices=None)[source]#

Returns all the metadata values of the dimensions. The values are assumed to be numbers.

Returns:

array_like

get_metadata_values(kind, key, indices=None, dimension=None)[source]#
Return type:

ndarray

property in_memory#
property is_h5#
property nframes#

Return number of frames

nica(num_components, chunksize=None, num_iter=500, error_step=None, indices=None)[source]#

Compute Non-negative Independent Component Analysis on the data. The method, first converts, if not already, the data into an hdf5 file object with the images flattened in the rows.

Parameters:
  • num_components (Union[None, int]) – Number of components to find

  • chunksize (Union[None, int], optional) – Number of chunks for which the whitening must be computed, incrementally, defaults to None

  • num_iter (int, optional) – Number of iterations, defaults to 500

  • error_step – If not None, find the error every error_step and compares it to check for convergence. TODO: not able for huge datasets.

  • indices (Union[None, array_like], optional) – If not None, apply method only to indices of data, defaults to None

Returns:

(H, W): The components matrix and the mixing matrix.

nica_nmf(num_components, chunksize=None, num_iter=500, waterfall=None, error_step=None, vstep=100, hstep=1000, indices=None)[source]#

Applies both NICA and NMF to the data. The init H and W for NMF are the result of NICA.

nmf(num_components, num_iter=100, error_step=None, waterfall=None, H=None, W=None, vstep=100, hstep=1000, indices=None, init=None)[source]#

Compute Non-negative Matrix Factorization on the data. The method, first converts, if not already, the data into an hdf5 file object with the images flattened in the rows.

Parameters:
  • num_components (Union[None, int]) – Number of components to find

  • num_iter (int, optional) – Number of iterations, defaults to 100

  • error_step (Union[None, int], optional) – If not None, find the error every error_step and compares it to check for convergence, defaults to None TODO: not able for huge datasets.

  • waterfall (Union[None, array_like], optional) – If not None, NMF is computed using the waterfall method. The parameter should be an array with the number of iterations per sub-computation, defaults to None

  • H (Union[None, array_like], optional) – Init matrix for H of shape (n_components, n_samples), defaults to None

  • W (Union[None, array_like], optional) – Init matrix for W of shape (n_features, n_components), defaults to None

  • indices (Union[None, array_like], optional) – If not None, apply method only to indices of data, defaults to None

Returns:

(H, W): The components matrix and the mixing matrix.

partition_by_intensity(bins=None, bottom_bin=None, top_bin=None)[source]#

Function that computes the data from the set of urls. If the filter_data flag is activated it filters the data following the next: – First, it computes the intensity for each frame, by calculating the variance after passing a gaussian filter. – Second, computes the histogram of the intensity. – Finally, saves the data of the frames with an intensity bigger than a threshold. The threshold is set to be the second bin of the histogram.

Parameters:

num_bins (int) – Number of bins to use as threshold.

pca(num_components=None, chunk_size=500, indices=None, return_vals=False)[source]#

Compute Principal Component Analysis on the data. The method, first converts, if not already, the data into an hdf5 file object with the images flattened in the rows.

Parameters:
  • num_components (Union[None, int]) – Number of components to find. If None, it uses the minimum between the number of images and the number of pixels.

  • chunk_size – Number of chunks for which the whitening must be computed, incrementally, defaults to None

  • indices (Union[None, array_like], optional) – If not None, apply method only to indices of data, defaults to None

  • return_vals (bool, optional) – If True, returns only the singular values of PCA, else returns the components and the mixing matrix, defaults to False

Returns:

(H, W): The components matrix and the mixing matrix.

project_data(dimension, indices=None, _dir=None)[source]#

Applies a projection to the data. The new Dataset will have the same size as the chosen dimension, where the data is projected on.

Parameters:
  • dimension (array_like) – Dimensions to project the data onto

  • indices (Union[None, array_like]) – Indices of the images to use for the projection. If None, the projection is done using all data.

  • _dir (str) – Directory filename to save the new data

recover_weak_beam(n, indices=None)[source]#

Set to zero all pixels higher than n times the standard deviation across the stack dimension

Parameters:
  • n (float) – Increase or decrease the top threshold by this fixed value.

  • indices (Union[None, array_like]) – Indices of the images to use for the filtering. If None, the filtering is done using all data.

remove_dim(axis)[source]#

Removes a dimension from the dimension’s dictionary.

Parameters:

axis (int) – axis of the dimension.

reshape_data()[source]#

Function that reshapes the data to fit the dimensions.

stop_operation(operation)[source]#

Method used for cases where threads are created to apply functions to the dataset. If method is called, the flag concerning the stop is set to 0 so that if the concerned operation is running in another thread it knows to stop.

Parameters:

operation (int) – operation to stop

property title#
to_memory(indices)[source]#

Method to load only part of the data into memory. Returns a new dataset with the data corresponding to given indices into memory. The new indices array has to be given, if all the data has to be set into memory please set in_memory to True instead, this way no new dataset will be created.

Parameters:

indices (array_like) – Indices of the new dataset.

property transformation#
zsum(indices=None, dimension=None)[source]#