File I/O Utilities¶

FileIO¶

class ml4ir.base.io.file_io.FileIO(logger: Optional[logging.Logger] = None)¶

Bases: object

Abstract class defining the file I/O handler methods

Constructor method to create a FileIO handler object

Parameters:	logger (Logger object, optional) – logging handler object to instantiate FileIO object with the ability to log progress updates

set_logger(logger: Optional[logging.Logger] = None)¶

Setter method to assign a logging handler to the FileIO object

Parameters:	logger (Logger object, optional) – logging handler object to be used with the FileIO object to log progress updates

log(string, mode=20)¶

Write specified string with preset logging object using the mode specified

Parameters:	string (str) – string text to be logged mode (int, optional) – One of the supported logging message types. Currently supported values are logging.INFO, DEBUG, ERROR

make_directory(dir_path: str, clear_dir: bool = False) → str¶

Create directory structure specified recursively

Parameters:	dir_path (str) – path for directory to be create clear_dir (bool, optional) – clear contents on existing directory
Returns:	path to the directory created
Return type:	str

read_df(infile: str, sep: str = ', ', index_col: int = None, **kwargs) → Optional[pandas.core.frame.DataFrame]¶

Load a pandas dataframe from a file

Parameters:	infile (str) – path to the csv input file sep (str, optional) – separator to use for loading file index_col (int, optional) – column to be used as index
Returns:	pandas dataframe loaded from specified path
Return type:	pandas.DataFrame

read_df_list(infiles, sep=', ', index_col=None, **kwargs) → pandas.core.frame.DataFrame¶

Load a pandas dataframe from a list of files by concatenating the individual dataframes from each file

Parameters:	infiles (list of str) – list of paths to the csv input files sep (str, optional) – separator to use for loading file index_col (int, optional) – column to be used as index Returns – pandas.DataFrame – pandas dataframe loaded from specified path

write_df(df, outfile: str = None, sep: str = ', ', index: bool = True)¶

Write a pandas dataframe to a file

Parameters:	df (pandas.DataFrame) – dataframe to be written outfile (str. optional) – path to the csv output file sep (str, optional) – separator to use for loading file index (bool, optional) – boolean specifying if index should be saved

read_text_file(infile) → str¶

Read text file and return as string

Parameters:	infile (str) – path to the text file
Returns:	file contents as a string
Return type:	str

read_json(infile) → dict¶

Read JSON file and return a python dictionary

Parameters:	infile (str) – path to the json file
Returns:	python dictionary loaded from JSON file
Return type:	dict

read_yaml(infile) → dict¶

Read YAML file and return a python dictionary

Parameters:	infile (str) – path to the YAML file
Returns:	python dictionary loaded from JSON file
Return type:	dict

write_json(json_dict: dict, outfile: str)¶

Write dictionary to a JSON file

Parameters:	json_dict (dict) – dictionary to be dumped to json file outfile (str) – path to the output file

path_exists(path: str) → bool¶

Check if a file path exists

Parameters:	path (str) – check if path exists
Returns:	True if path exists; False otherwise
Return type:	bool

get_files_in_directory(indir: str, extension='.csv', prefix='')¶

Get list of files in a directory

Parameters:	indir (str) – input directory to search for files extension (str, optional) – extension of the files to search for prefix (str, optional) – string file name prefix to narrow search
Returns:	list of file path strings
Return type:	list of str

clear_dir(dir_path: str)¶

Clear contents of existing directory

Parameters:	dir_path (str) – path to directory to be cleared

rm_dir(dir_path: str)¶

Delete existing directory

Parameters:	dir_path (str) – path to directory to be removed

rm_file(file_path: str)¶

Deletes existing file_path

Parameters:	file_path (str) – path to file to be removed

LocalIO¶

class ml4ir.base.io.local_io.LocalIO(logger: Optional[logging.Logger] = None)¶

Bases: ml4ir.base.io.file_io.FileIO

Class defining the file I/O handler methods for the local file system

Constructor method to create a FileIO handler object

Parameters:	logger (Logger object, optional) – logging handler object to instantiate FileIO object with the ability to log progress updates

make_directory(dir_path: str, clear_dir: bool = False)¶

Create directory structure specified recursively

Parameters:	dir_path (str) – path for directory to be create clear_dir (bool, optional) – clear contents on existing directory

read_df(infile: str, sep: str = ', ', index_col: int = None, **kwargs) → Optional[pandas.core.frame.DataFrame]¶

Load a pandas dataframe from a file

Parameters:	infile (str) – path to the csv input file; can be hdfs path sep (str, optional) – separator to use for loading file index_col (int, optional) – column to be used as index
Returns:	pandas dataframe loaded from file
Return type:	pandas.DataFrame

read_df_list(infiles, sep=', ', index_col=None, **kwargs) → pandas.core.frame.DataFrame¶

Load a pandas dataframe from a list of files

Parameters:	infiles (list of str) – paths to the csv input files; can be hdfs paths sep (str, optional) – separator to use for loading file index_col (int, optional) – column to be used as index
Returns:	pandas dataframe loaded from file
Return type:	pd.DataFrame

write_df(df, outfile: str = None, sep: str = ', ', index: bool = True) → str¶

Write a pandas dataframe to a file

Parameters:	df (pandas.DataFrame) – dataframe to be written outfile (str) – path to the csv output file; can NOT be hdfs path currently sep (str) – separator to use for loading file index (int) – boolean specifying if index should be saved
Returns:	dataframe in csv form if outfile is None
Return type:	str

read_json(infile) → dict¶

Read JSON file and return a python dictionary

Parameters:	infile (str) – path to the json file; can be hdfs path
Returns:	python dictionary loaded from file
Return type:	dict

read_yaml(infile) → dict¶

Read YAML file and return a python dictionary

Parameters:	infile (str) – path to the json file; can be hdfs path
Returns:	python dictionary loaded from file
Return type:	dict

write_json(json_dict: dict, outfile: str)¶

Write dictionary to a JSON file

Parameters:	json_dict (dict) – dictionary to be dumped to json file outfile (str) – path to the output file

path_exists(path: str) → bool¶

Check if a path exists

Parameters:	path (str) – check if path exists
Returns:	True if path exists; False otherwise
Return type:	bool

get_files_in_directory(indir: str, extension='.csv', prefix='')¶

Get list of files in a directory

Parameters:	indir (str) – input directory to search for files extension (str, optional) – extension of the files to search for prefix (str, optional) – string file name prefix to narrow search
Returns:	list of file path strings
Return type:	list of str

clear_dir_contents(dir_path: str)¶

Clear contents of existing directory

Parameters:	dir_path (str) – path to directory to be cleared

rm_dir(dir_path: str)¶

Delete existing directory

Parameters:	dir_path (str) – path to directory to be removed

rm_file(file_path: str)¶

Delete existing file_path

Parameters:	file_path (str) – path to file to be removed

save_numpy_array(np_array, file_path: str, allow_pickle=True, zip=True, **kwargs)¶

Save a numpy array to disk

Parameters:	np_array (numpy array or list of numpy arrays) – Array like numpy object to be saved file_path (str) – file path to save the object to allow_pickle (bool, optional) – Allow pickling of objects while saving zip (bool, optional,) – use np.savez to save the numpy arrays, allows passing in python list

Notes

Used to save individual model layer weights for transfer learning.

If using zip=True, the np_array has to be a python list tensorflow layer weights are lists of arrays. np.save() can not be used for saving list of numpy arrays directly as it tries to manually convert the list into a numpy array, leading to errors with numpy shape. savez allows us to save each list item in separate files and abstracts this step for end user.

load_numpy_array(file_path, allow_pickle=True, unzip=True, **kwargs)¶

Load a numpy array from disk

Parameters:	file_path (str) – file path to load the numpy object from allow_pickle (bool, optional) – Allow pickling of objects while loading unzip (bool, optional) – To unzip the numpy array saved as a zip file. Used when saved with zip=True
Returns:	python list of numpy arrays
Return type:	list of numpy arrays

Notes

Used to load individual model layer weights for transfer learning

SparkIO¶

class ml4ir.base.io.spark_io.SparkIO(logger: Optional[logging.Logger] = None)¶

Bases: ml4ir.base.io.file_io.FileIO

Class defining the file I/O handler methods for the HDFS file system using spark

Constructor method to create a FileIO handler object and set up spark session and hadoop file system handlers

Parameters:	logger (Logger object, optional) – logging handler object to instantiate FileIO object with the ability to log progress updates

get_path_from_str(file_path: str)¶

Get Path object from string

Parameters:	file_path (str) – string file path
Returns:	Hadoop Path object
Return type:	hadoop path

read_df(infile: str, sep: str = ', ', index_col: int = None, **kwargs) → Optional[pandas.core.frame.DataFrame]¶

Load a pandas dataframe from a file

Parameters:	infile (str) – path to the csv input file; can be hdfs path sep (str, optional) – separator to use for loading file index_col (int, optional) – column to be used as index
Returns:	pandas dataframe loaded from file
Return type:	pandas.DataFrame

read_df_list(infiles, sep=', ', index_col=None, **kwargs) → pandas.core.frame.DataFrame¶

Load a pandas dataframe from a list of files

Parameters:	infiles (list of str) – paths to the csv input files; can be hdfs paths sep (str, optional) – separator to use for loading file index_col (int, optional) – column to be used as index
Returns:	pandas dataframe loaded from list of files
Return type:	pandas.DataFrame

Notes

sep and index_col are not used in SparkIO

read_text_file(infile) → str¶

Read text file and return as string

Parameters:	infile (str) – path to the text file
Returns:	file contents as a string
Return type:	str

read_json(infile) → dict¶

Read JSON file and return a python dictionary

Parameters:	infile (str) – path to the json file; can be hdfs path
Returns:	python dictionary loaded from file
Return type:	dict

read_yaml(infile) → dict¶

Read YAML file and return a python dictionary

Parameters:	infile (str) – path to the json file; can be hdfs path
Returns:	python dictionary loaded from file
Return type:	dict

path_exists(path: str) → bool¶

Check if a path exists

Parameters:	path (str) – check if path exists
Returns:	True if path exists; False otherwise
Return type:	bool

rm_dir(dir_path: str)¶

Delete existing directory

Parameters:	dir_path (str) – path to directory to be removed

rm_file(file_path: str)¶

Deletes existing file_path

Parameters:	file_path (str) – path to file to be removed

copy_from_hdfs(src: str, dest: str)¶

Copy a directory/file from HDFS to local filesystem

Parameters:	src (str) – String path to source(on HDFS) dest (str) – String path to destination(on local file system)

copy_to_hdfs(src: str, dest: str, overwrite=True)¶

Copy a directory/file to HDFS from local filesystem

Parameters src : str

String path to source(on local file system)

dest : str: String path to destination(on HDFS)
overwrite : bool, optional: Boolean to specify whether existing destination files should be overwritten