File I/O Utilities

FileIO

class ml4ir.base.io.file_io.FileIO(logger: Optional[logging.Logger] = None)

Bases: object

Abstract class defining the file I/O handler methods

Constructor method to create a FileIO handler object

Parameters:logger (Logger object, optional) – logging handler object to instantiate FileIO object with the ability to log progress updates
set_logger(logger: Optional[logging.Logger] = None)

Setter method to assign a logging handler to the FileIO object

Parameters:logger (Logger object, optional) – logging handler object to be used with the FileIO object to log progress updates
log(string, mode=20)

Write specified string with preset logging object using the mode specified

Parameters:
  • string (str) – string text to be logged
  • mode (int, optional) – One of the supported logging message types. Currently supported values are logging.INFO, DEBUG, ERROR
make_directory(dir_path: str, clear_dir: bool = False) → str

Create directory structure specified recursively

Parameters:
  • dir_path (str) – path for directory to be create
  • clear_dir (bool, optional) – clear contents on existing directory
Returns:

path to the directory created

Return type:

str

read_df(infile: str, sep: str = ', ', index_col: int = None, **kwargs) → Optional[pandas.core.frame.DataFrame]

Load a pandas dataframe from a file

Parameters:
  • infile (str) – path to the csv input file
  • sep (str, optional) – separator to use for loading file
  • index_col (int, optional) – column to be used as index
Returns:

pandas dataframe loaded from specified path

Return type:

pandas.DataFrame

read_df_list(infiles, sep=', ', index_col=None, **kwargs) → pandas.core.frame.DataFrame

Load a pandas dataframe from a list of files by concatenating the individual dataframes from each file

Parameters:
  • infiles (list of str) – list of paths to the csv input files
  • sep (str, optional) – separator to use for loading file
  • index_col (int, optional) – column to be used as index
  • Returns
  • pandas.DataFrame – pandas dataframe loaded from specified path
write_df(df, outfile: str = None, sep: str = ', ', index: bool = True)

Write a pandas dataframe to a file

Parameters:
  • df (pandas.DataFrame) – dataframe to be written
  • outfile (str. optional) – path to the csv output file
  • sep (str, optional) – separator to use for loading file
  • index (bool, optional) – boolean specifying if index should be saved
read_text_file(infile) → str

Read text file and return as string

Parameters:infile (str) – path to the text file
Returns:file contents as a string
Return type:str
read_json(infile) → dict

Read JSON file and return a python dictionary

Parameters:infile (str) – path to the json file
Returns:python dictionary loaded from JSON file
Return type:dict
read_yaml(infile) → dict

Read YAML file and return a python dictionary

Parameters:infile (str) – path to the YAML file
Returns:python dictionary loaded from JSON file
Return type:dict
write_json(json_dict: dict, outfile: str)

Write dictionary to a JSON file

Parameters:
  • json_dict (dict) – dictionary to be dumped to json file
  • outfile (str) – path to the output file
path_exists(path: str) → bool

Check if a file path exists

Parameters:path (str) – check if path exists
Returns:True if path exists; False otherwise
Return type:bool
get_files_in_directory(indir: str, extension='.csv', prefix='')

Get list of files in a directory

Parameters:
  • indir (str) – input directory to search for files
  • extension (str, optional) – extension of the files to search for
  • prefix (str, optional) – string file name prefix to narrow search
Returns:

list of file path strings

Return type:

list of str

clear_dir(dir_path: str)

Clear contents of existing directory

Parameters:dir_path (str) – path to directory to be cleared
rm_dir(dir_path: str)

Delete existing directory

Parameters:dir_path (str) – path to directory to be removed
rm_file(file_path: str)

Deletes existing file_path

Parameters:file_path (str) – path to file to be removed

LocalIO

class ml4ir.base.io.local_io.LocalIO(logger: Optional[logging.Logger] = None)

Bases: ml4ir.base.io.file_io.FileIO

Class defining the file I/O handler methods for the local file system

Constructor method to create a FileIO handler object

Parameters:logger (Logger object, optional) – logging handler object to instantiate FileIO object with the ability to log progress updates
make_directory(dir_path: str, clear_dir: bool = False)

Create directory structure specified recursively

Parameters:
  • dir_path (str) – path for directory to be create
  • clear_dir (bool, optional) – clear contents on existing directory
read_df(infile: str, sep: str = ', ', index_col: int = None, **kwargs) → Optional[pandas.core.frame.DataFrame]

Load a pandas dataframe from a file

Parameters:
  • infile (str) – path to the csv input file; can be hdfs path
  • sep (str, optional) – separator to use for loading file
  • index_col (int, optional) – column to be used as index
Returns:

pandas dataframe loaded from file

Return type:

pandas.DataFrame

read_df_list(infiles, sep=', ', index_col=None, **kwargs) → pandas.core.frame.DataFrame

Load a pandas dataframe from a list of files

Parameters:
  • infiles (list of str) – paths to the csv input files; can be hdfs paths
  • sep (str, optional) – separator to use for loading file
  • index_col (int, optional) – column to be used as index
Returns:

pandas dataframe loaded from file

Return type:

pd.DataFrame

write_df(df, outfile: str = None, sep: str = ', ', index: bool = True) → str

Write a pandas dataframe to a file

Parameters:
  • df (pandas.DataFrame) – dataframe to be written
  • outfile (str) – path to the csv output file; can NOT be hdfs path currently
  • sep (str) – separator to use for loading file
  • index (int) – boolean specifying if index should be saved
Returns:

dataframe in csv form if outfile is None

Return type:

str

read_json(infile) → dict

Read JSON file and return a python dictionary

Parameters:infile (str) – path to the json file; can be hdfs path
Returns:python dictionary loaded from file
Return type:dict
read_yaml(infile) → dict

Read YAML file and return a python dictionary

Parameters:infile (str) – path to the json file; can be hdfs path
Returns:python dictionary loaded from file
Return type:dict
write_json(json_dict: dict, outfile: str)

Write dictionary to a JSON file

Parameters:
  • json_dict (dict) – dictionary to be dumped to json file
  • outfile (str) – path to the output file
path_exists(path: str) → bool

Check if a path exists

Parameters:path (str) – check if path exists
Returns:True if path exists; False otherwise
Return type:bool
get_files_in_directory(indir: str, extension='.csv', prefix='')

Get list of files in a directory

Parameters:
  • indir (str) – input directory to search for files
  • extension (str, optional) – extension of the files to search for
  • prefix (str, optional) – string file name prefix to narrow search
Returns:

list of file path strings

Return type:

list of str

clear_dir_contents(dir_path: str)

Clear contents of existing directory

Parameters:dir_path (str) – path to directory to be cleared
rm_dir(dir_path: str)

Delete existing directory

Parameters:dir_path (str) – path to directory to be removed
rm_file(file_path: str)

Delete existing file_path

Parameters:file_path (str) – path to file to be removed
save_numpy_array(np_array, file_path: str, allow_pickle=True, zip=True, **kwargs)

Save a numpy array to disk

Parameters:
  • np_array (numpy array or list of numpy arrays) – Array like numpy object to be saved
  • file_path (str) – file path to save the object to
  • allow_pickle (bool, optional) – Allow pickling of objects while saving
  • zip (bool, optional,) – use np.savez to save the numpy arrays, allows passing in python list

Notes

Used to save individual model layer weights for transfer learning.

If using zip=True, the np_array has to be a python list tensorflow layer weights are lists of arrays. np.save() can not be used for saving list of numpy arrays directly as it tries to manually convert the list into a numpy array, leading to errors with numpy shape. savez allows us to save each list item in separate files and abstracts this step for end user.

load_numpy_array(file_path, allow_pickle=True, unzip=True, **kwargs)

Load a numpy array from disk

Parameters:
  • file_path (str) – file path to load the numpy object from
  • allow_pickle (bool, optional) – Allow pickling of objects while loading
  • unzip (bool, optional) – To unzip the numpy array saved as a zip file. Used when saved with zip=True
Returns:

python list of numpy arrays

Return type:

list of numpy arrays

Notes

Used to load individual model layer weights for transfer learning

SparkIO

class ml4ir.base.io.spark_io.SparkIO(logger: Optional[logging.Logger] = None)

Bases: ml4ir.base.io.file_io.FileIO

Class defining the file I/O handler methods for the HDFS file system using spark

Constructor method to create a FileIO handler object and set up spark session and hadoop file system handlers

Parameters:logger (Logger object, optional) – logging handler object to instantiate FileIO object with the ability to log progress updates
get_path_from_str(file_path: str)

Get Path object from string

Parameters:file_path (str) – string file path
Returns:Hadoop Path object
Return type:hadoop path
read_df(infile: str, sep: str = ', ', index_col: int = None, **kwargs) → Optional[pandas.core.frame.DataFrame]

Load a pandas dataframe from a file

Parameters:
  • infile (str) – path to the csv input file; can be hdfs path
  • sep (str, optional) – separator to use for loading file
  • index_col (int, optional) – column to be used as index
Returns:

pandas dataframe loaded from file

Return type:

pandas.DataFrame

read_df_list(infiles, sep=', ', index_col=None, **kwargs) → pandas.core.frame.DataFrame

Load a pandas dataframe from a list of files

Parameters:
  • infiles (list of str) – paths to the csv input files; can be hdfs paths
  • sep (str, optional) – separator to use for loading file
  • index_col (int, optional) – column to be used as index
Returns:

pandas dataframe loaded from list of files

Return type:

pandas.DataFrame

Notes

sep and index_col are not used in SparkIO

read_text_file(infile) → str

Read text file and return as string

Parameters:infile (str) – path to the text file
Returns:file contents as a string
Return type:str
read_json(infile) → dict

Read JSON file and return a python dictionary

Parameters:infile (str) – path to the json file; can be hdfs path
Returns:python dictionary loaded from file
Return type:dict
read_yaml(infile) → dict

Read YAML file and return a python dictionary

Parameters:infile (str) – path to the json file; can be hdfs path
Returns:python dictionary loaded from file
Return type:dict
path_exists(path: str) → bool

Check if a path exists

Parameters:path (str) – check if path exists
Returns:True if path exists; False otherwise
Return type:bool
rm_dir(dir_path: str)

Delete existing directory

Parameters:dir_path (str) – path to directory to be removed
rm_file(file_path: str)

Deletes existing file_path

Parameters:file_path (str) – path to file to be removed
copy_from_hdfs(src: str, dest: str)

Copy a directory/file from HDFS to local filesystem

Parameters:
  • src (str) – String path to source(on HDFS)
  • dest (str) – String path to destination(on local file system)
copy_to_hdfs(src: str, dest: str, overwrite=True)

Copy a directory/file to HDFS from local filesystem

Parameters src : str

String path to source(on local file system)
dest : str
String path to destination(on HDFS)
overwrite : bool, optional
Boolean to specify whether existing destination files should be overwritten