ioutils#

cryocat.utils.ioutils.ctffind4_read(file_path)#

This function reads in a ctffind4 file (typically .txt) and returns a pandas dataframe with the following columns: defocus1, defocus2, astigmatism, phase_shift, and defocus_mean. All defocii values are in micrometers.

Parameters:

file_path: str: Path to the ctffind4 file (typically in .txt format) that contains values for all tilts in the tilt series. The defocii values are converted to micrometers.

Returns:

pandas.DataFrame: A dataframe with defocus1, defocus2, astigmatism, phase_shift, and defocus_mean columnes. All defocii values are in micrometers.

cryocat.utils.ioutils.defocus_load(input_data, file_type='gctf')#

Load defocus data from various file types or a pandas DataFrame.

Parameters:

input_datapd.DataFrame or str or numpy.ndarray: The input data to load. If a pandas DataFrame is provided, it is assumed to already contain the defocus data. If a string is provided, it is assumed to be the path to a file containing the defocus data. If a numpy ndarray is provided, it is assumed to be a 2D array of shape Nx5 where N is number of tilts.
file_typestr, default=”gctf”: The type of file to load if input_data is a string. Supported file types are “gctf”, “ctffind4”, and “warp”. Default is “gctf”.

Returns:

defocus_dfpd.DataFrame: A pandas DataFrame containing the loaded defocus data with columns “defocus1”, “defocus2”, “astigmatism”, “phase_shift”, “defocus_mean”. All defocii values are in micrometers. The phase shift is in radians.

Raises:

ValueError: If the provided file_type is not supported.

cryocat.utils.ioutils.defocus_remove_file_entries(input_file, entries_to_remove, file_type='gctf', numbered_from_1=True, output_file=None)#

Remove specified entries from a file and optionally update a specification file.

Parameters:

input_filestr: The path to the input file from which entries will be removed.
entries_to_removestr, list, or numpy.ndarray: The entries to remove can be specified as a file path to a CSV file, a text file containing indices (one per line), or a list/array of indices. If a CSV file is provided, it is expected to have a column named “ToBeRemoved”.
file_typestr, default=’gctf’: The type of the input file. Can be ‘gctf’ or “ctffind4’. Defaults to ‘gctf’.
numbered_from_1bool=True: Indicates whether the entries in entries_to_remove are numbered from 1. Defaults to True.
output_filestr, optional: The path to the output file where the modified content will be saved. If None, the input_file will be overwritten. Defaults to None.

Returns:

None: The function modifies the input file and/or creates an output file as specified.

Notes

The function handles two file types: ‘gctf’ and ‘ctffind4’, applying different methods for removing lines based on the file type. The indices_load and indices_reset functions are used to manage the indices of entries to be removed and to reset them if necessary.

cryocat.utils.ioutils.df_load(input_data, header=None)#

Load data into a pandas DataFrame from various input types.

Parameters:

input_datapandas.DataFrame, str, or numpy.ndarray

The data to load. Can be:

pandas.DataFrame: returned as-is.
str: path to a CSV file that will be read with pandas.read_csv().
numpy.ndarray: converted to a DataFrame using the optional header.

headerlist of str, optional

Column names to assign when input_data is a numpy.ndarray. The length must match the number of columns (or the length of a 1-D array). If None, the DataFrame is returned without column names (integer index columns). Ignored for DataFrame and CSV inputs. Defaults to None.

Returns:

pandas.DataFrame: DataFrame representation of the input data.

Raises:

ValueError: If input_data is not a DataFrame, a string path, or a NumPy array.
ValueError: If header is provided and its length does not match the number of columns in the array.

Examples

>>> import pandas as pd
>>> import numpy as np
>>> df_load(pd.DataFrame({"a": [1, 2]}))
   a
0  1
1  2

>>> df_load("data.csv")  
...

>>> df_load(np.array([[1, 2], [3, 4]]), header=["x", "y"])
   x  y
0  1  2
1  3  4

cryocat.utils.ioutils.dict_load(input_data)#

Load a dictionary from a JSON string or copy an existing dictionary.

Parameters:

input_datastr or dict: The input data to load. This can be a JSON string or an existing dictionary.

Returns:

dict: A dictionary loaded from the JSON string or a deep copy of the provided dictionary.

Raises:

ValueError: If input_data is neither a string nor a dictionary.

Notes

If input_data is a JSON string and cannot be decoded, an empty dictionary is returned and an error message is printed.

Examples

>>> json_str = '{"key": "value"}'
>>> dict_load(json_str)
{'key': 'value'}

>>> original_dict = {'key': 'value'}
>>> new_dict = dict_load(original_dict)
>>> new_dict is original_dict
False

cryocat.utils.ioutils.dict_write(dict_data, file_name)#

Write the given dictionary to a file in JSON format.

Parameters:

dict_datadict: Dictionary containing the data to write to the file.
file_namestr: The name of the file where the dictionary will be written.

Returns:

None

cryocat.utils.ioutils.dimensions_load(input_dims, tomo_idx=None)#

Load and process tomogram dimensions from various input formats.

Parameters:

input_dimspd.DataFrame, str, list, or np.ndarray: Either a path to a file with the dimensions, array-like input or pandas.DataFrame. The shape of the input should be 1x3 (x y z) in case of one tomogram or Nx4 for multiple tomograms (tomo_id x y z). In case of file, the dimensions can be fetched from .com file (typically tilt.com file) from parameters FULLIMAGE (x,y) and THICKNESS (z), from .star file (relion5 >) or from general file with either 1x3 values on a single line or Nx4 values on N lines (separator is a space(s)).
tomo_idxstr or array-like, optional: Path to a file containing tomogram indices or an 1D array with the indices. It is used only if the input_dims do not contain 4 columns (i.e., do not have tomo_id). If provided, the function will replicate the 1x3 dimensions to the length of tomo_idx array and will add “tomo_id” column. Defaults to None.

Returns:

pd.DataFrame: A DataFrame containing the dimensions with columns adjusted based on the input shape. Columns will be named as [“x”, “y”, “z”] or [“tomo_id”, “x”, “y”, “z”].

Raises:

ValueError: If the dimensions do not conform to the expected shapes of 1x3 or Nx4 or if file does not exist.

Notes

If input_dims is a string ending with “.com”, it is assumed to be a path to a .com file and will be processed accordingly.
If input_dims is a string not ending with “.com”, it is treated as a path to a CSV file.
The function can handle reshaping of input dimensions if they are provided as a list or a one-dimensional numpy array.

cryocat.utils.ioutils.extract_defocus_data(df, u_col, v_col, angle_col, phase_col=None)#

Extracts and standardizes defocus-related data from a DataFrame.

Parameters:

dfpandas.DataFrame: The raw DataFrame parsed from a STAR file.
u_colstr: Column name for Defocus U (rlnDefocusU).
v_colstr: Column name for Defocus V (rlnDefocusV).
angle_colstr: Column name for astigmatism angle (rlnDefocusAngle).
phase_colstr or None, optional: Column name for phase shift (rlnPhaseShift). If None or missing, phase_shift is set to 0.

Returns:

pandas.DataFrame: A DataFrame with columns: defocus1, defocus2, astigmatism, phase_shift, defocus_mean in micrometers

cryocat.utils.ioutils.fileformat_replace_pattern(filename_format, input_number, test_letter, raise_error=True)#

Replace a pattern in a filename format string with a given number. If the pattern is longer than number of digits in the input number the pattern is pad with zeros.

Parameters:

filename_formatstr: The filename format string containing the pattern to be replaced. The pattern has to start with $ followed by arbitrary long sequence of test_letter. For instance some_text_$AAA_rest for test_letter “A” and input number 79 results in some_text_$079_rest.
input_numberint: The number to be inserted into the pattern.
test_letterstr: The letter used in the pattern to identify the sequence to be replaced.
raise_errorbool, default=True: Whether to raise a ValueError if the pattern is not found in the filename format string. Default is True.

Returns:

str: The filename format string with the pattern replaced by the input number and padded with zeros if the input number has less digits than the pattern.

Raises:

ValueError: If the pattern is not found in the filename format string and raise_error is True. If the input number has more digits than the pattern.

Examples

>>> fileformat_replace_pattern("file_/$AAA/$B.txt", 123, "A")
'file_123/$B.txt'

>>> fileformat_replace_pattern("some_text_$AAA_rest", 79, "A")
'some_text_079_rest'

>>> fileformat_replace_pattern("file_/$A/$B.txt", 123, "C")
ValueError: The format file_/$A/$B.txt does not contain any sequence of \$ followed by C.

>>> fileformat_replace_pattern("file_/$A/$B.txt", 12345, "A")
ValueError: Number '12345' has more digits than string '\$A'.

cryocat.utils.ioutils.fsc_read(input_path, pixel_size=None, box_size=None)#

Read an FSC curve from a CSV, XML, or TXT file into a DataFrame.

Parameters:

input_pathstr

Path to the FSC file. Supported extensions:

.csv: Must contain a column x and at least one FSC column (e.g. uncorrected_fsc, corrected_fsc).
.xml: ChimeraX-compatible FSC XML with <coordinate><x> / <y> children. Data are loaded as columns x and uncorrected_fsc.
.txt: Single-column file with one FSC value per line. Requires pixel_size and box_size to compute the x column (spatial frequency in 1/Å).

pixel_sizefloat, optional

Pixel size in Angstroms. Required for .txt input.

box_sizeint, optional

Box edge length in voxels. Required for .txt input.

Returns:

pandas.DataFrame: DataFrame with column x and one or more FSC columns.

Raises:

ValueError: If the extension is unsupported, or if pixel_size / box_size are missing for a .txt file.

cryocat.utils.ioutils.fsc_write(output_path, x_vals, y_vals, pixel_size=None)#

Write an FSC curve to a CSV or ChimeraX-compatible XML file.

Parameters:

output_pathstr

Destination file path. Extension selects format:

.csv: Two-column comma-separated file with columns x and fsc.
.xml: ChimeraX-compatible XML with <fsc> root containing <coordinate><x>/<y> children.

x_valsarray-like

X-axis values (Fourier shell index or spatial frequency in 1/Å).

y_valsarray-like

FSC correlation values.

pixel_sizefloat, optional

When provided, the XML xaxis attribute is set to "Resolution (1/A)"; otherwise "Fourier shell".

Raises:

ValueError: If the extension is neither .csv nor .xml.

cryocat.utils.ioutils.gctf_read(file_path)#

This function reads in a gctf starfile and returns a pandas dataframe with the following columns: defocus1, defocus2, astigmatism, phase_shift, and defocus_mean. All defocii values are in micrometers.

Parameters:

file_path: str: Path to the gctf star file that contains values for all tilts in the tilt series. The columns to be read in are “rlnDefocusU”, “rlnDefocusV”, “rlnDefocusAngle”, and “rlnPhaseShift” (if present, otherwise the phase shift is set to 0.0). The defocii values are converted to micrometers.

Returns:

pandas.DataFrame: A dataframe with defocus1, defocus2, astigmatism, phase_shift, and defocus_mean columnes. All defocii values are in micrometers.

cryocat.utils.ioutils.get_all_files_matching_pattern(filename_pattern, numeric_wildcards_only=False, return_wildcards=True)#

Get all files in a directory that match a specified filename pattern.

Parameters:

filename_patternstr: The pattern to match filenames against, which can include wildcards.
numeric_wildcards_onlybool, default=False: If True, only files with numeric wildcard parts will be included. Defaults to False.
return_wildcardsbool, default=True: If True, the function returns a tuple of (file_names, wildcards). If False, only file_names are returned. Defaults to True.

Returns:

list: A list of file paths that match the given pattern. If return_wildcards is True, a tuple of (file_names, wildcards) is returned, where wildcards are the parts of the filenames that matched the wildcard in the pattern.

Raises:

FileNotFoundError: If the specified directory does not exist.

Notes

The function uses regular expressions to match the filenames against the provided pattern. The ‘*’ character in the pattern is treated as a wildcard that can match any sequence of characters.

cryocat.utils.ioutils.get_data_from_warp_xml(xml_file_path, node_name, node_level=1)#

This function parses an XML file and extracts data based on the provided XPath expression. The function supports two levels of extraction: level 1 and level 2.

Parameters:

xml_file_path: str: The path to the XML file.
node_name: str: The XPath expression to find elements in the XML file.
node_level: int, default=1: The level of extraction. Default is 1 which works for nodes that have values without further tags (i.e., one value per line without a xml tag). The other allowed level is 2 which should be used for all tags that have values stored in Node tags (such as GridCTF).

Returns:

ndarray or None: The extracted data as a NumPy array if elements are found, otherwise None.

Raises:

Exception: If there is an error reading the XML file.
Value Error: If node_level isn’t 1 or 2

Examples

>>> data = get_data_from_warp_xml('path/to/xml/file.xml', 'GridCTF', node_level=2)

cryocat.utils.ioutils.get_file_encoding(file_path)#

Detects the encoding of a file by trying a list of common encodings.

Parameters:

file_pathstr: The path to the file for which the encoding needs to be determined.

Returns:

str: The name of the encoding if the file is successfully read.

Raises:

UnicodeEncodeError: If the file cannot be read with any of the tried encodings.

Examples

>>> get_file_encoding("example.txt")
'utf-8'

cryocat.utils.ioutils.get_filename_from_path(input_path, with_extension=True)#

Get the filename from the given input path.

Parameters:

input_path: str

The input path from which the filename is to be extracted.

with_extension: bool, default=True

Flag to indicate whether to include the file extension in the filename. Default is True.

Returns:

str: The extracted filename from the input path.

cryocat.utils.ioutils.get_files_prefix_suffix(dir_path, prefix='', suffix='')#

Retrieve files from a specified directory that start with a given prefix and end with a given suffix.

Parameters:

dir_pathstr: The path to the directory from which to retrieve files.
prefixstr, default=””: The prefix that the files should start with. If ommited, no filtering based on prefix will be done. Defaults to an empty string.
suffixstr, default=””: The suffix that the files should end with. If ommited, no filtering based on suffix will be done. Defaults to an empty string.

Returns:

list: A list of filenames that match the given prefix and suffix criteria.

Raises:

ValueError: If file does not exist or if specified from file_path is not readable

Examples

>>> get_files_prefix_suffix('/path/to/dir', prefix='test', suffix='.txt')
['test_file1.txt', 'test_file2.txt']

cryocat.utils.ioutils.get_number_of_lines_with_character(filename, character)#

Count the number of lines in a file that start with a specified character.

Parameters:

filenamestr: The path to the file to be read.
characterstr: The character to check at the start of each line.

Returns:

int: The number of lines starting with the specified character.

cryocat.utils.ioutils.imod_com_read(filename)#

Reads a file in IMOD’s .com format and returns a dictionary containing the data.

Parameters:

filenamestr: The name of the IMOC .com file to be read. All lines starting with # or $ are ignored, the rest is read in as dictionary. The keys are the first words of each line, and the values are the remaining words converted to the correct type.

Returns:

dict: A dictionary containing the data read from the file.

Notes

Lines starting with ‘#’ or ‘$’ are ignored.
Numeric values are converted to integers if they are digits, and to floats if they are floating-point numbers.
Non-numeric values are stored as strings.

cryocat.utils.ioutils.indices_load(input_data, numbered_from_1=True)#

Load indices from a specified input source.

Parameters:

input_datastr, list, or numpy.ndarray: The input data can be a file path to a CSV file, a text file containing indices (one per line), or a list/array of indices. If a CSV file is provided, it is expected to have a column named “ToBeRemoved”.
numbered_from_1bool, default=True: If True, the returned indices will be adjusted to be zero-based (i.e., subtracting 1 from each index). Defaults to True.

Returns:

numpy.ndarray: An array of indices, adjusted based on the input data and the numbered_from_1 flag.

Raises:

ValueError: If input data isn’t either a path to valid file either a list/array

cryocat.utils.ioutils.indices_reset(input_data)#

Reset the indices of a CSV file by modifying specific columns.

Parameters:

input_datastr: The path to the CSV file that needs to be processed.

Returns:

None

Notes

This function reads a CSV file into a DataFrame, checks for the presence of a “Removed” column, and updates it based on the “ToBeRemoved” column. It then resets the “ToBeRemoved” column to False and saves the modified DataFrame back to the original CSV file.

cryocat.utils.ioutils.is_float(value)#

Check if a value can be converted to a float.

Parameters:

valueany: The value to be checked.

Returns:

bool: True if the value can be converted to a float, False otherwise.

Examples

>>> is_float(3.14)
True

>>> is_float("hello")
False

cryocat.utils.ioutils.one_value_per_line_read(file_path, data_type=<class 'numpy.float32'>)#

This function reads in a file with one value per line and returns them as numpy ndarray. The values are expected to be in the format specified in data_type.

Parameters:

file_path: str: Path to the file where on each line there is expected to be a one value of the type specified by data_type.
data_type: dtype, default=np.float32: A typde of the data to be read in.

Returns:

numpy.ndarray: A ndarray with values of the type data_type.

Raises:

ValueError: If file does not exist or if specified from file_path is not readable

cryocat.utils.ioutils.relion_ctffind4_read(file_path)#

Reads a Relion ctffind4-style STAR file and extracts defocus data.

Parameters:

file_pathstr: Path to the STAR file.

Returns:

pandas.DataFrame: DataFrame with columns: defocus1, defocus2, astigmatism, phase_shift, defocus_mean

cryocat.utils.ioutils.remove_lines(filename, lines_to_remove, start_str_to_skip=None, number_start=0, output_file=None)#

Reads a file, removes specified lines while skipping those that start with given strings and returns/writes out the rest.

Parameters:

filenamestr: The name of the file to remove the lines from.
lines_to_remove: int or array-like: Array/list (or single int) with numbers of lines to be removed. If start_str_to_skip is empty, the indices corresponds to the line numbers.
start_str_to_skip: str or array-like: Array/list of strings (or single string). The lines starting with any of those strings will be ignored. The indices from lines_to_remove will be applied to filter only the remaining lines. Dafaults to None.
number_start: int. default=0: Whether the line numbers provied start counting at 0 or 1. Defaults to 0.
output_file: str: Path to a file to write out the content into. Defaults to None (no file will be written out).

Returns:

list: A list of lines that were kept.

cryocat.utils.ioutils.rot_angles_load(input_angles, angles_order='zxz')#

Load rotation angles from a file or numpy array and arrange them in a specified order.

Parameters:

input_anglesstr or numpy.ndarray: If a string, it should be the path to a CSV file containing the angles (three per line). If a numpy array, it should directly contain the angles.
angles_orderstr, default=”zxz”: The order of the angles in the output array. Default is “zxz” (phi, theta, psi). If “zzx”, the order will be adjusted to phi, psi, theta.

Returns:

anglesnumpy.ndarray

A numpy array of shape (N, 3) where n is the number of angle sets. Each row contains the angles phi, theta,: and psi in the specified order.

Raises:

ValueError: If input_angles is neither a string path to a CSV file nor a numpy array.

Examples

>>> rot_angles_load("path/to/angles.csv")
array([[phi1, theta1, psi1],
       [phi2, theta2, psi2],
       ...])

>>> rot_angles_load(numpy.array([[0, 45, 90], [90, 45, 0]]), "zzx")
array([[0, 90, 45],
       [90, 0, 45]])

cryocat.utils.ioutils.sort_files_by_idx(file_list, idx_list, order='ascending')#

Sorts a list of files based on corresponding indices.

Parameters:

file_listlist of str: A list of file names to be sorted.
idx_listlist of str: A list of indices as strings corresponding to the file names.
orderstr, default=’ascending’: The order in which to sort the files. Can be ‘ascending’ or ‘descending’. Defaults to ‘ascending’.

Returns:

numpy.ndarray: An array of file names sorted according to the specified order of indices.

Raises:

ValueError: If idx_list and file_list aren’t of list type. If idx_list doesn’t contain only integers, or if file_list doesn’t contain only strings.

Examples

>>> sort_files_by_idx(['file1.txt', 'file2.txt', 'file3.txt'], ['2', '1', '3'])
array(['file2.txt', 'file1.txt', 'file3.txt'])

>>> sort_files_by_idx(['file1.txt', 'file2.txt', 'file3.txt'], ['2', '1', '3'], order='descending')
array(['file3.txt', 'file1.txt', 'file2.txt'])

cryocat.utils.ioutils.tlt_load(input_tlt, sort_angles=True)#

This function loads in tilt angles in degrees and returns them as ndarray. The input can be either a path to the file or an ndarray of tilts. The function will check if the input is already an array, and if not it will read in the data from the specified file type.

Parameters:

input_tltstr or array-like: The input tilt data. If it is a numpy array, it will be returned as is. If it is a string, it can be a path to a mdoc file, a xml file (warp) or any file where the angles are stored one per line (e.g. tlt, rawtlt, csv, .txt file).
sort_anglesbool, default=True: Whether the tilts should be sorted from min to max tilt angle. Defaults to True.

Returns:

ndarray: The tilt angles (in degrees) in the form of a numpy array.

Raises:

ValueError: If the input_tlt is neither a numpy array nor a valid file path. If the input_tlt is an empty numpy array or an empty list.

cryocat.utils.ioutils.total_dose_load(input_dose, sort_mdoc=True)#

Load total dose for single tilt series that should be used for dose-filtering/weighting.

Parameters:

input_dosestr or array-like: The input dose. If ndarray, it is returned as is. If str, it can be a path to a csv, xml (warp), mdoc or a file with one value per line for each tilt image in the tilt series (any extension, typically .txt). The values should correspond to the total dose applied to each tilt image (i.e., low values for tilts acquired as first, large values for the tilt images acqured as last). If mdoc file is used the total dose is corrected either as PriorRecordDose + ExposureDose for each image, or as ExposureDose * (ZValue + 1) (starting from 1). The latter will work only if the ZValue corresponds to the order of acquisition, i.e, for tilt series that are not sorted from min to max tilt angle or are sorted with their ZValue unchanged.
sort_mdocbool, default=True: Whether the mdoc should be sorted by the tilt angles. This parameter is relevant only if the provided input is mdof file. If True mdoc will be sorted from min to max tilt angle however the ZValue will be kept as it was so the dose can still be computed correctly. Defaults to True.

Returns:

numpy.ndarray: The corrected dose.

Raises:

ValueError: If the input dose is neither ndarray or a valid path to a file with the total dose.

cryocat.utils.ioutils.warp_ctf_read(input_file)#

Reads CTF parameters from a WARP XML file.

Parameters:

input_file: str: Path to the input WARP XML file.
Returns:
pandas.DataFrame: DataFrame containing the columns “defocus1”, “defocus2”, “astigmatism”, “phase_shift”, “defocus_mean”. All defocii values are in micrometers. The phase shift is in radians.

cryocat.utils.ioutils.z_shift_load(input_shift)#

Loads tomogram z-shift from a file, number or numpy.ndarray.

Parameters:

input_dimsstr or number or pandas.DataFrame or array-like: Either a path to a file with z-shift, single number, pandas.DataFrame or numpy.ndarray. If the z-shift should be loaded for more than one tomogram and is different for each tomogram the shape of the input should be Nx2 where N is number of tomograms. In the first column should be tomogram id, in the second one corresponding z-shift. In case the input is an array in the file (typically with .txt extension but it does not matter), the file should have two values per line - tomo_id and z-shift. The separator is space(s). In case the input is read from IMOD’s .com files the second value from “SHIFT” parameter is used.

Returns:

pandas.DataFrame: Z-shift for a tomogram (with a single “z_shift” column) or for multiple tomograms (with columns “tomo_id”, “z_shift”).

Raises:

ValueError: Wrong size of the input, unsupported input type, or not existing filepath.