ioutils#
- cryocat.utils.ioutils.ctffind4_read(file_path)#
This function reads in a ctffind4 file (typically .txt) and returns a pandas dataframe with the following columns: defocus1, defocus2, astigmatism, phase_shift, and defocus_mean. All defocii values are in micrometers.
- Parameters:
- file_path: str
Path to the ctffind4 file (typically in .txt format) that contains values for all tilts in the tilt series. The defocii values are converted to micrometers.
- Returns:
- pandas.DataFrame
A dataframe with defocus1, defocus2, astigmatism, phase_shift, and defocus_mean columnes. All defocii values are in micrometers.
- cryocat.utils.ioutils.defocus_load(input_data, file_type='gctf')#
Load defocus data from various file types or a pandas DataFrame.
- Parameters:
- input_datapd.DataFrame or str or numpy.ndarray
The input data to load. If a pandas DataFrame is provided, it is assumed to already contain the defocus data. If a string is provided, it is assumed to be the path to a file containing the defocus data. If a numpy ndarray is provided, it is assumed to be a 2D array of shape Nx5 where N is number of tilts.
- file_typestr, default=”gctf”
The type of file to load if
input_datais a string. Supported file types are “gctf”, “ctffind4”, and “warp”. Default is “gctf”.
- Returns:
- defocus_dfpd.DataFrame
A pandas DataFrame containing the loaded defocus data with columns “defocus1”, “defocus2”, “astigmatism”, “phase_shift”, “defocus_mean”. All defocii values are in micrometers. The phase shift is in radians.
- Raises:
- ValueError
If the provided
file_typeis not supported.
- cryocat.utils.ioutils.defocus_remove_file_entries(input_file, entries_to_remove, file_type='gctf', numbered_from_1=True, output_file=None)#
Remove specified entries from a file and optionally update a specification file.
- Parameters:
- input_filestr
The path to the input file from which entries will be removed.
- entries_to_removestr, list, or numpy.ndarray
The entries to remove can be specified as a file path to a CSV file, a text file containing indices (one per line), or a list/array of indices. If a CSV file is provided, it is expected to have a column named “ToBeRemoved”.
- file_typestr, default=’gctf’
The type of the input file. Can be ‘gctf’ or “ctffind4’. Defaults to ‘gctf’.
- numbered_from_1bool=True
Indicates whether the entries in
entries_to_removeare numbered from 1. Defaults to True.- output_filestr, optional
The path to the output file where the modified content will be saved. If None, the input_file will be overwritten. Defaults to None.
- Returns:
- None
The function modifies the input file and/or creates an output file as specified.
Notes
The function handles two file types: ‘gctf’ and ‘ctffind4’, applying different methods for removing lines based on the file type. The
indices_loadandindices_resetfunctions are used to manage the indices of entries to be removed and to reset them if necessary.
- cryocat.utils.ioutils.df_load(input_data, header=None)#
Load data into a pandas DataFrame from various input types.
- Parameters:
- input_datapandas.DataFrame, str, or numpy.ndarray
The data to load. Can be:
pandas.DataFrame: returned as-is.str: path to a CSV file that will be read withpandas.read_csv().numpy.ndarray: converted to a DataFrame using the optional header.
- headerlist of str, optional
Column names to assign when input_data is a
numpy.ndarray. The length must match the number of columns (or the length of a 1-D array). IfNone, the DataFrame is returned without column names (integer index columns). Ignored for DataFrame and CSV inputs. Defaults to None.
- Returns:
- pandas.DataFrame
DataFrame representation of the input data.
- Raises:
- ValueError
If input_data is not a DataFrame, a string path, or a NumPy array.
- ValueError
If header is provided and its length does not match the number of columns in the array.
Examples
>>> import pandas as pd >>> import numpy as np >>> df_load(pd.DataFrame({"a": [1, 2]})) a 0 1 1 2
>>> df_load("data.csv") ...
>>> df_load(np.array([[1, 2], [3, 4]]), header=["x", "y"]) x y 0 1 2 1 3 4
- cryocat.utils.ioutils.dict_load(input_data)#
Load a dictionary from a JSON string or copy an existing dictionary.
- Parameters:
- input_datastr or dict
The input data to load. This can be a JSON string or an existing dictionary.
- Returns:
- dict
A dictionary loaded from the JSON string or a deep copy of the provided dictionary.
- Raises:
- ValueError
If
input_datais neither a string nor a dictionary.
Notes
If
input_datais a JSON string and cannot be decoded, an empty dictionary is returned and an error message is printed.Examples
>>> json_str = '{"key": "value"}' >>> dict_load(json_str) {'key': 'value'}
>>> original_dict = {'key': 'value'} >>> new_dict = dict_load(original_dict) >>> new_dict is original_dict False
- cryocat.utils.ioutils.dict_write(dict_data, file_name)#
Write the given dictionary to a file in JSON format.
- Parameters:
- dict_datadict
Dictionary containing the data to write to the file.
- file_namestr
The name of the file where the dictionary will be written.
- Returns:
- None
- cryocat.utils.ioutils.dimensions_load(input_dims, tomo_idx=None)#
Load and process tomogram dimensions from various input formats.
- Parameters:
- input_dimspd.DataFrame, str, list, or np.ndarray
Either a path to a file with the dimensions, array-like input or pandas.DataFrame. The shape of the input should be 1x3 (x y z) in case of one tomogram or Nx4 for multiple tomograms (tomo_id x y z). In case of file, the dimensions can be fetched from .com file (typically tilt.com file) from parameters FULLIMAGE (x,y) and THICKNESS (z), from .star file (relion5 >) or from general file with either 1x3 values on a single line or Nx4 values on N lines (separator is a space(s)).
- tomo_idxstr or array-like, optional
Path to a file containing tomogram indices or an 1D array with the indices. It is used only if the input_dims do not contain 4 columns (i.e., do not have tomo_id). If provided, the function will replicate the 1x3 dimensions to the length of tomo_idx array and will add “tomo_id” column. Defaults to None.
- Returns:
- pd.DataFrame
A DataFrame containing the dimensions with columns adjusted based on the input shape. Columns will be named as [“x”, “y”, “z”] or [“tomo_id”, “x”, “y”, “z”].
- Raises:
- ValueError
If the dimensions do not conform to the expected shapes of 1x3 or Nx4 or if file does not exist.
Notes
If
input_dimsis a string ending with “.com”, it is assumed to be a path to a .com file and will be processed accordingly.If
input_dimsis a string not ending with “.com”, it is treated as a path to a CSV file.The function can handle reshaping of input dimensions if they are provided as a list or a one-dimensional numpy array.
- cryocat.utils.ioutils.extract_defocus_data(df, u_col, v_col, angle_col, phase_col=None)#
Extracts and standardizes defocus-related data from a DataFrame.
- Parameters:
- dfpandas.DataFrame
The raw DataFrame parsed from a STAR file.
- u_colstr
Column name for Defocus U (rlnDefocusU).
- v_colstr
Column name for Defocus V (rlnDefocusV).
- angle_colstr
Column name for astigmatism angle (rlnDefocusAngle).
- phase_colstr or None, optional
Column name for phase shift (rlnPhaseShift). If None or missing, phase_shift is set to 0.
- Returns:
- pandas.DataFrame
A DataFrame with columns: defocus1, defocus2, astigmatism, phase_shift, defocus_mean in micrometers
- cryocat.utils.ioutils.fileformat_replace_pattern(filename_format, input_number, test_letter, raise_error=True)#
Replace a pattern in a filename format string with a given number. If the pattern is longer than number of digits in the input number the pattern is pad with zeros.
- Parameters:
- filename_formatstr
The filename format string containing the pattern to be replaced. The pattern has to start with $ followed by arbitrary long sequence of test_letter. For instance some_text_$AAA_rest for test_letter “A” and input number 79 results in some_text_$079_rest.
- input_numberint
The number to be inserted into the pattern.
- test_letterstr
The letter used in the pattern to identify the sequence to be replaced.
- raise_errorbool, default=True
Whether to raise a ValueError if the pattern is not found in the filename format string. Default is True.
- Returns:
- str
The filename format string with the pattern replaced by the input number and padded with zeros if the input number has less digits than the pattern.
- Raises:
- ValueError
If the pattern is not found in the filename format string and
raise_erroris True. If the input number has more digits than the pattern.
Examples
>>> fileformat_replace_pattern("file_/$AAA/$B.txt", 123, "A") 'file_123/$B.txt'
>>> fileformat_replace_pattern("some_text_$AAA_rest", 79, "A") 'some_text_079_rest'
>>> fileformat_replace_pattern("file_/$A/$B.txt", 123, "C") ValueError: The format file_/$A/$B.txt does not contain any sequence of \$ followed by C.
>>> fileformat_replace_pattern("file_/$A/$B.txt", 12345, "A") ValueError: Number '12345' has more digits than string '\$A'.
- cryocat.utils.ioutils.fsc_read(input_path, pixel_size=None, box_size=None)#
Read an FSC curve from a CSV, XML, or TXT file into a DataFrame.
- Parameters:
- input_pathstr
Path to the FSC file. Supported extensions:
.csvMust contain a column
xand at least one FSC column (e.g.uncorrected_fsc,corrected_fsc)..xmlChimeraX-compatible FSC XML with
<coordinate><x>/<y>children. Data are loaded as columnsxanduncorrected_fsc..txtSingle-column file with one FSC value per line. Requires pixel_size and box_size to compute the
xcolumn (spatial frequency in 1/Å).
- pixel_sizefloat, optional
Pixel size in Angstroms. Required for
.txtinput.- box_sizeint, optional
Box edge length in voxels. Required for
.txtinput.
- Returns:
- pandas.DataFrame
DataFrame with column
xand one or more FSC columns.
- Raises:
- ValueError
If the extension is unsupported, or if pixel_size / box_size are missing for a
.txtfile.
- cryocat.utils.ioutils.fsc_write(output_path, x_vals, y_vals, pixel_size=None)#
Write an FSC curve to a CSV or ChimeraX-compatible XML file.
- Parameters:
- output_pathstr
Destination file path. Extension selects format:
.csvTwo-column comma-separated file with columns
xandfsc..xmlChimeraX-compatible XML with
<fsc>root containing<coordinate><x>/<y>children.
- x_valsarray-like
X-axis values (Fourier shell index or spatial frequency in 1/Å).
- y_valsarray-like
FSC correlation values.
- pixel_sizefloat, optional
When provided, the XML
xaxisattribute is set to"Resolution (1/A)"; otherwise"Fourier shell".
- Raises:
- ValueError
If the extension is neither
.csvnor.xml.
- cryocat.utils.ioutils.gctf_read(file_path)#
This function reads in a gctf starfile and returns a pandas dataframe with the following columns: defocus1, defocus2, astigmatism, phase_shift, and defocus_mean. All defocii values are in micrometers.
- Parameters:
- file_path: str
Path to the gctf star file that contains values for all tilts in the tilt series. The columns to be read in are “rlnDefocusU”, “rlnDefocusV”, “rlnDefocusAngle”, and “rlnPhaseShift” (if present, otherwise the phase shift is set to 0.0). The defocii values are converted to micrometers.
- Returns:
- pandas.DataFrame
A dataframe with defocus1, defocus2, astigmatism, phase_shift, and defocus_mean columnes. All defocii values are in micrometers.
- cryocat.utils.ioutils.get_all_files_matching_pattern(filename_pattern, numeric_wildcards_only=False, return_wildcards=True)#
Get all files in a directory that match a specified filename pattern.
- Parameters:
- filename_patternstr
The pattern to match filenames against, which can include wildcards.
- numeric_wildcards_onlybool, default=False
If True, only files with numeric wildcard parts will be included. Defaults to False.
- return_wildcardsbool, default=True
If True, the function returns a tuple of (file_names, wildcards). If False, only file_names are returned. Defaults to True.
- Returns:
- list
A list of file paths that match the given pattern. If return_wildcards is True, a tuple of (file_names, wildcards) is returned, where wildcards are the parts of the filenames that matched the wildcard in the pattern.
- Raises:
- FileNotFoundError
If the specified directory does not exist.
Notes
The function uses regular expressions to match the filenames against the provided pattern. The ‘*’ character in the pattern is treated as a wildcard that can match any sequence of characters.
- cryocat.utils.ioutils.get_data_from_warp_xml(xml_file_path, node_name, node_level=1)#
This function parses an XML file and extracts data based on the provided XPath expression. The function supports two levels of extraction: level 1 and level 2.
- Parameters:
- xml_file_path: str
The path to the XML file.
- node_name: str
The XPath expression to find elements in the XML file.
- node_level: int, default=1
The level of extraction. Default is 1 which works for nodes that have values without further tags (i.e., one value per line without a xml tag). The other allowed level is 2 which should be used for all tags that have values stored in Node tags (such as GridCTF).
- Returns:
- ndarray or None
The extracted data as a NumPy array if elements are found, otherwise None.
- Raises:
- Exception
If there is an error reading the XML file.
- Value Error
If node_level isn’t 1 or 2
Examples
>>> data = get_data_from_warp_xml('path/to/xml/file.xml', 'GridCTF', node_level=2)
- cryocat.utils.ioutils.get_file_encoding(file_path)#
Detects the encoding of a file by trying a list of common encodings.
- Parameters:
- file_pathstr
The path to the file for which the encoding needs to be determined.
- Returns:
- str
The name of the encoding if the file is successfully read.
- Raises:
- UnicodeEncodeError
If the file cannot be read with any of the tried encodings.
Examples
>>> get_file_encoding("example.txt") 'utf-8'
- cryocat.utils.ioutils.get_filename_from_path(input_path, with_extension=True)#
Get the filename from the given input path.
- Parameters:
- input_path: str
The input path from which the filename is to be extracted.
- with_extension: bool, default=True
Flag to indicate whether to include the file extension in the filename. Default is True.
- Returns:
- str
The extracted filename from the input path.
- cryocat.utils.ioutils.get_files_prefix_suffix(dir_path, prefix='', suffix='')#
Retrieve files from a specified directory that start with a given prefix and end with a given suffix.
- Parameters:
- dir_pathstr
The path to the directory from which to retrieve files.
- prefixstr, default=””
The prefix that the files should start with. If ommited, no filtering based on prefix will be done. Defaults to an empty string.
- suffixstr, default=””
The suffix that the files should end with. If ommited, no filtering based on suffix will be done. Defaults to an empty string.
- Returns:
- list
A list of filenames that match the given prefix and suffix criteria.
- Raises:
- ValueError
If file does not exist or if specified from file_path is not readable
Examples
>>> get_files_prefix_suffix('/path/to/dir', prefix='test', suffix='.txt') ['test_file1.txt', 'test_file2.txt']
- cryocat.utils.ioutils.get_number_of_lines_with_character(filename, character)#
Count the number of lines in a file that start with a specified character.
- Parameters:
- filenamestr
The path to the file to be read.
- characterstr
The character to check at the start of each line.
- Returns:
- int
The number of lines starting with the specified character.
- cryocat.utils.ioutils.imod_com_read(filename)#
Reads a file in IMOD’s .com format and returns a dictionary containing the data.
- Parameters:
- filenamestr
The name of the IMOC .com file to be read. All lines starting with # or $ are ignored, the rest is read in as dictionary. The keys are the first words of each line, and the values are the remaining words converted to the correct type.
- Returns:
- dict
A dictionary containing the data read from the file.
Notes
Lines starting with ‘#’ or ‘$’ are ignored.
Numeric values are converted to integers if they are digits, and to floats if they are floating-point numbers.
Non-numeric values are stored as strings.
- cryocat.utils.ioutils.indices_load(input_data, numbered_from_1=True)#
Load indices from a specified input source.
- Parameters:
- input_datastr, list, or numpy.ndarray
The input data can be a file path to a CSV file, a text file containing indices (one per line), or a list/array of indices. If a CSV file is provided, it is expected to have a column named “ToBeRemoved”.
- numbered_from_1bool, default=True
If True, the returned indices will be adjusted to be zero-based (i.e., subtracting 1 from each index). Defaults to True.
- Returns:
- numpy.ndarray
An array of indices, adjusted based on the input data and the numbered_from_1 flag.
- Raises:
- ValueError
If input data isn’t either a path to valid file either a list/array
- cryocat.utils.ioutils.indices_reset(input_data)#
Reset the indices of a CSV file by modifying specific columns.
- Parameters:
- input_datastr
The path to the CSV file that needs to be processed.
- Returns:
- None
Notes
This function reads a CSV file into a DataFrame, checks for the presence of a “Removed” column, and updates it based on the “ToBeRemoved” column. It then resets the “ToBeRemoved” column to False and saves the modified DataFrame back to the original CSV file.
- cryocat.utils.ioutils.is_float(value)#
Check if a value can be converted to a float.
- Parameters:
- valueany
The value to be checked.
- Returns:
- bool
True if the value can be converted to a float, False otherwise.
Examples
>>> is_float(3.14) True
>>> is_float("hello") False
- cryocat.utils.ioutils.one_value_per_line_read(file_path, data_type=<class 'numpy.float32'>)#
This function reads in a file with one value per line and returns them as numpy ndarray. The values are expected to be in the format specified in data_type.
- Parameters:
- file_path: str
Path to the file where on each line there is expected to be a one value of the type specified by data_type.
- data_type: dtype, default=np.float32
A typde of the data to be read in.
- Returns:
- numpy.ndarray
A ndarray with values of the type data_type.
- Raises:
- ValueError
If file does not exist or if specified from file_path is not readable
- cryocat.utils.ioutils.relion_ctffind4_read(file_path)#
Reads a Relion ctffind4-style STAR file and extracts defocus data.
- Parameters:
- file_pathstr
Path to the STAR file.
- Returns:
- pandas.DataFrame
DataFrame with columns: defocus1, defocus2, astigmatism, phase_shift, defocus_mean
- cryocat.utils.ioutils.remove_lines(filename, lines_to_remove, start_str_to_skip=None, number_start=0, output_file=None)#
Reads a file, removes specified lines while skipping those that start with given strings and returns/writes out the rest.
- Parameters:
- filenamestr
The name of the file to remove the lines from.
- lines_to_remove: int or array-like
Array/list (or single int) with numbers of lines to be removed. If start_str_to_skip is empty, the indices corresponds to the line numbers.
- start_str_to_skip: str or array-like
Array/list of strings (or single string). The lines starting with any of those strings will be ignored. The indices from lines_to_remove will be applied to filter only the remaining lines. Dafaults to None.
- number_start: int. default=0
Whether the line numbers provied start counting at 0 or 1. Defaults to 0.
- output_file: str
Path to a file to write out the content into. Defaults to None (no file will be written out).
- Returns:
- list
A list of lines that were kept.
- cryocat.utils.ioutils.rot_angles_load(input_angles, angles_order='zxz')#
Load rotation angles from a file or numpy array and arrange them in a specified order.
- Parameters:
- input_anglesstr or numpy.ndarray
If a string, it should be the path to a CSV file containing the angles (three per line). If a numpy array, it should directly contain the angles.
- angles_orderstr, default=”zxz”
The order of the angles in the output array. Default is “zxz” (phi, theta, psi). If “zzx”, the order will be adjusted to phi, psi, theta.
- Returns:
- anglesnumpy.ndarray
- A numpy array of shape (N, 3) where n is the number of angle sets. Each row contains the angles phi, theta,
and psi in the specified order.
- Raises:
- ValueError
If
input_anglesis neither a string path to a CSV file nor a numpy array.
Examples
>>> rot_angles_load("path/to/angles.csv") array([[phi1, theta1, psi1], [phi2, theta2, psi2], ...])
>>> rot_angles_load(numpy.array([[0, 45, 90], [90, 45, 0]]), "zzx") array([[0, 90, 45], [90, 0, 45]])
- cryocat.utils.ioutils.sort_files_by_idx(file_list, idx_list, order='ascending')#
Sorts a list of files based on corresponding indices.
- Parameters:
- file_listlist of str
A list of file names to be sorted.
- idx_listlist of str
A list of indices as strings corresponding to the file names.
- orderstr, default=’ascending’
The order in which to sort the files. Can be ‘ascending’ or ‘descending’. Defaults to ‘ascending’.
- Returns:
- numpy.ndarray
An array of file names sorted according to the specified order of indices.
- Raises:
- ValueError
If idx_list and file_list aren’t of list type. If idx_list doesn’t contain only integers, or if file_list doesn’t contain only strings.
Examples
>>> sort_files_by_idx(['file1.txt', 'file2.txt', 'file3.txt'], ['2', '1', '3']) array(['file2.txt', 'file1.txt', 'file3.txt'])
>>> sort_files_by_idx(['file1.txt', 'file2.txt', 'file3.txt'], ['2', '1', '3'], order='descending') array(['file3.txt', 'file1.txt', 'file2.txt'])
- cryocat.utils.ioutils.tlt_load(input_tlt, sort_angles=True)#
This function loads in tilt angles in degrees and returns them as ndarray. The input can be either a path to the file or an ndarray of tilts. The function will check if the input is already an array, and if not it will read in the data from the specified file type.
- Parameters:
- input_tltstr or array-like
The input tilt data. If it is a numpy array, it will be returned as is. If it is a string, it can be a path to a mdoc file, a xml file (warp) or any file where the angles are stored one per line (e.g. tlt, rawtlt, csv, .txt file).
- sort_anglesbool, default=True
Whether the tilts should be sorted from min to max tilt angle. Defaults to True.
- Returns:
- ndarray
The tilt angles (in degrees) in the form of a numpy array.
- Raises:
- ValueError
If the input_tlt is neither a numpy array nor a valid file path. If the input_tlt is an empty numpy array or an empty list.
- cryocat.utils.ioutils.total_dose_load(input_dose, sort_mdoc=True)#
Load total dose for single tilt series that should be used for dose-filtering/weighting.
- Parameters:
- input_dosestr or array-like
The input dose. If ndarray, it is returned as is. If str, it can be a path to a csv, xml (warp), mdoc or a file with one value per line for each tilt image in the tilt series (any extension, typically .txt). The values should correspond to the total dose applied to each tilt image (i.e., low values for tilts acquired as first, large values for the tilt images acqured as last). If mdoc file is used the total dose is corrected either as PriorRecordDose + ExposureDose for each image, or as ExposureDose * (ZValue + 1) (starting from 1). The latter will work only if the ZValue corresponds to the order of acquisition, i.e, for tilt series that are not sorted from min to max tilt angle or are sorted with their ZValue unchanged.
- sort_mdocbool, default=True
Whether the mdoc should be sorted by the tilt angles. This parameter is relevant only if the provided input is mdof file. If True mdoc will be sorted from min to max tilt angle however the ZValue will be kept as it was so the dose can still be computed correctly. Defaults to True.
- Returns:
- numpy.ndarray
The corrected dose.
- Raises:
- ValueError
If the input dose is neither ndarray or a valid path to a file with the total dose.
- cryocat.utils.ioutils.warp_ctf_read(input_file)#
Reads CTF parameters from a WARP XML file.
- Parameters:
- input_file: str
Path to the input WARP XML file.
- Returns:
- pandas.DataFrame
DataFrame containing the columns “defocus1”, “defocus2”, “astigmatism”, “phase_shift”, “defocus_mean”. All defocii values are in micrometers. The phase shift is in radians.
- cryocat.utils.ioutils.z_shift_load(input_shift)#
Loads tomogram z-shift from a file, number or numpy.ndarray.
- Parameters:
- input_dimsstr or number or pandas.DataFrame or array-like
Either a path to a file with z-shift, single number, pandas.DataFrame or numpy.ndarray. If the z-shift should be loaded for more than one tomogram and is different for each tomogram the shape of the input should be Nx2 where N is number of tomograms. In the first column should be tomogram id, in the second one corresponding z-shift. In case the input is an array in the file (typically with .txt extension but it does not matter), the file should have two values per line - tomo_id and z-shift. The separator is space(s). In case the input is read from IMOD’s .com files the second value from “SHIFT” parameter is used.
- Returns:
- pandas.DataFrame
Z-shift for a tomogram (with a single “z_shift” column) or for multiple tomograms (with columns “tomo_id”, “z_shift”).
- Raises:
- ValueError
Wrong size of the input, unsupported input type, or not existing filepath.