File I/O Utilities¶
FileIO¶
-
class
ml4ir.base.io.file_io.
FileIO
(logger: Optional[logging.Logger] = None)¶ Bases:
object
Abstract class defining the file I/O handler methods
Constructor method to create a FileIO handler object
Parameters: logger (Logger object, optional) – logging handler object to instantiate FileIO object with the ability to log progress updates -
set_logger
(logger: Optional[logging.Logger] = None)¶ Setter method to assign a logging handler to the FileIO object
Parameters: logger (Logger object, optional) – logging handler object to be used with the FileIO object to log progress updates
-
log
(string, mode=20)¶ Write specified string with preset logging object using the mode specified
Parameters: - string (str) – string text to be logged
- mode (int, optional) – One of the supported logging message types. Currently supported values are logging.INFO, DEBUG, ERROR
-
make_directory
(dir_path: str, clear_dir: bool = False) → str¶ Create directory structure specified recursively
Parameters: - dir_path (str) – path for directory to be create
- clear_dir (bool, optional) – clear contents on existing directory
Returns: path to the directory created
Return type: str
-
read_df
(infile: str, sep: str = ', ', index_col: int = None, **kwargs) → Optional[pandas.core.frame.DataFrame]¶ Load a pandas dataframe from a file
Parameters: - infile (str) – path to the csv input file
- sep (str, optional) – separator to use for loading file
- index_col (int, optional) – column to be used as index
Returns: pandas dataframe loaded from specified path
Return type: pandas.DataFrame
-
read_df_list
(infiles, sep=', ', index_col=None, **kwargs) → pandas.core.frame.DataFrame¶ Load a pandas dataframe from a list of files by concatenating the individual dataframes from each file
Parameters: - infiles (list of str) – list of paths to the csv input files
- sep (str, optional) – separator to use for loading file
- index_col (int, optional) – column to be used as index
- Returns –
- pandas.DataFrame – pandas dataframe loaded from specified path
-
write_df
(df, outfile: str = None, sep: str = ', ', index: bool = True)¶ Write a pandas dataframe to a file
Parameters: - df (pandas.DataFrame) – dataframe to be written
- outfile (str. optional) – path to the csv output file
- sep (str, optional) – separator to use for loading file
- index (bool, optional) – boolean specifying if index should be saved
-
read_text_file
(infile) → str¶ Read text file and return as string
Parameters: infile (str) – path to the text file Returns: file contents as a string Return type: str
-
read_json
(infile) → dict¶ Read JSON file and return a python dictionary
Parameters: infile (str) – path to the json file Returns: python dictionary loaded from JSON file Return type: dict
-
read_yaml
(infile) → dict¶ Read YAML file and return a python dictionary
Parameters: infile (str) – path to the YAML file Returns: python dictionary loaded from JSON file Return type: dict
-
write_json
(json_dict: dict, outfile: str)¶ Write dictionary to a JSON file
Parameters: - json_dict (dict) – dictionary to be dumped to json file
- outfile (str) – path to the output file
-
path_exists
(path: str) → bool¶ Check if a file path exists
Parameters: path (str) – check if path exists Returns: True if path exists; False otherwise Return type: bool
-
get_files_in_directory
(indir: str, extension='.csv', prefix='')¶ Get list of files in a directory
Parameters: - indir (str) – input directory to search for files
- extension (str, optional) – extension of the files to search for
- prefix (str, optional) – string file name prefix to narrow search
Returns: list of file path strings
Return type: list of str
-
clear_dir
(dir_path: str)¶ Clear contents of existing directory
Parameters: dir_path (str) – path to directory to be cleared
-
rm_dir
(dir_path: str)¶ Delete existing directory
Parameters: dir_path (str) – path to directory to be removed
-
rm_file
(file_path: str)¶ Deletes existing file_path
Parameters: file_path (str) – path to file to be removed
-
LocalIO¶
-
class
ml4ir.base.io.local_io.
LocalIO
(logger: Optional[logging.Logger] = None)¶ Bases:
ml4ir.base.io.file_io.FileIO
Class defining the file I/O handler methods for the local file system
Constructor method to create a FileIO handler object
Parameters: logger (Logger object, optional) – logging handler object to instantiate FileIO object with the ability to log progress updates -
make_directory
(dir_path: str, clear_dir: bool = False)¶ Create directory structure specified recursively
Parameters: - dir_path (str) – path for directory to be create
- clear_dir (bool, optional) – clear contents on existing directory
-
read_df
(infile: str, sep: str = ', ', index_col: int = None, **kwargs) → Optional[pandas.core.frame.DataFrame]¶ Load a pandas dataframe from a file
Parameters: - infile (str) – path to the csv input file; can be hdfs path
- sep (str, optional) – separator to use for loading file
- index_col (int, optional) – column to be used as index
Returns: pandas dataframe loaded from file
Return type: pandas.DataFrame
-
read_df_list
(infiles, sep=', ', index_col=None, **kwargs) → pandas.core.frame.DataFrame¶ Load a pandas dataframe from a list of files
Parameters: - infiles (list of str) – paths to the csv input files; can be hdfs paths
- sep (str, optional) – separator to use for loading file
- index_col (int, optional) – column to be used as index
Returns: pandas dataframe loaded from file
Return type: pd.DataFrame
-
write_df
(df, outfile: str = None, sep: str = ', ', index: bool = True) → str¶ Write a pandas dataframe to a file
Parameters: - df (pandas.DataFrame) – dataframe to be written
- outfile (str) – path to the csv output file; can NOT be hdfs path currently
- sep (str) – separator to use for loading file
- index (int) – boolean specifying if index should be saved
Returns: dataframe in csv form if outfile is None
Return type: str
-
read_json
(infile) → dict¶ Read JSON file and return a python dictionary
Parameters: infile (str) – path to the json file; can be hdfs path Returns: python dictionary loaded from file Return type: dict
-
read_yaml
(infile) → dict¶ Read YAML file and return a python dictionary
Parameters: infile (str) – path to the json file; can be hdfs path Returns: python dictionary loaded from file Return type: dict
-
write_json
(json_dict: dict, outfile: str)¶ Write dictionary to a JSON file
Parameters: - json_dict (dict) – dictionary to be dumped to json file
- outfile (str) – path to the output file
-
path_exists
(path: str) → bool¶ Check if a path exists
Parameters: path (str) – check if path exists Returns: True if path exists; False otherwise Return type: bool
-
get_files_in_directory
(indir: str, extension='.csv', prefix='')¶ Get list of files in a directory
Parameters: - indir (str) – input directory to search for files
- extension (str, optional) – extension of the files to search for
- prefix (str, optional) – string file name prefix to narrow search
Returns: list of file path strings
Return type: list of str
-
clear_dir_contents
(dir_path: str)¶ Clear contents of existing directory
Parameters: dir_path (str) – path to directory to be cleared
-
rm_dir
(dir_path: str)¶ Delete existing directory
Parameters: dir_path (str) – path to directory to be removed
-
rm_file
(file_path: str)¶ Delete existing file_path
Parameters: file_path (str) – path to file to be removed
-
save_numpy_array
(np_array, file_path: str, allow_pickle=True, zip=True, **kwargs)¶ Save a numpy array to disk
Parameters: - np_array (numpy array or list of numpy arrays) – Array like numpy object to be saved
- file_path (str) – file path to save the object to
- allow_pickle (bool, optional) – Allow pickling of objects while saving
- zip (bool, optional,) – use np.savez to save the numpy arrays, allows passing in python list
Notes
Used to save individual model layer weights for transfer learning.
If using zip=True, the np_array has to be a python list tensorflow layer weights are lists of arrays. np.save() can not be used for saving list of numpy arrays directly as it tries to manually convert the list into a numpy array, leading to errors with numpy shape. savez allows us to save each list item in separate files and abstracts this step for end user.
-
load_numpy_array
(file_path, allow_pickle=True, unzip=True, **kwargs)¶ Load a numpy array from disk
Parameters: - file_path (str) – file path to load the numpy object from
- allow_pickle (bool, optional) – Allow pickling of objects while loading
- unzip (bool, optional) – To unzip the numpy array saved as a zip file. Used when saved with zip=True
Returns: python list of numpy arrays
Return type: list of numpy arrays
Notes
Used to load individual model layer weights for transfer learning
-
SparkIO¶
-
class
ml4ir.base.io.spark_io.
SparkIO
(logger: Optional[logging.Logger] = None)¶ Bases:
ml4ir.base.io.file_io.FileIO
Class defining the file I/O handler methods for the HDFS file system using spark
Constructor method to create a FileIO handler object and set up spark session and hadoop file system handlers
Parameters: logger (Logger object, optional) – logging handler object to instantiate FileIO object with the ability to log progress updates -
get_path_from_str
(file_path: str)¶ Get Path object from string
Parameters: file_path (str) – string file path Returns: Hadoop Path object Return type: hadoop path
-
read_df
(infile: str, sep: str = ', ', index_col: int = None, **kwargs) → Optional[pandas.core.frame.DataFrame]¶ Load a pandas dataframe from a file
Parameters: - infile (str) – path to the csv input file; can be hdfs path
- sep (str, optional) – separator to use for loading file
- index_col (int, optional) – column to be used as index
Returns: pandas dataframe loaded from file
Return type: pandas.DataFrame
-
read_df_list
(infiles, sep=', ', index_col=None, **kwargs) → pandas.core.frame.DataFrame¶ Load a pandas dataframe from a list of files
Parameters: - infiles (list of str) – paths to the csv input files; can be hdfs paths
- sep (str, optional) – separator to use for loading file
- index_col (int, optional) – column to be used as index
Returns: pandas dataframe loaded from list of files
Return type: pandas.DataFrame
Notes
sep and index_col are not used in SparkIO
-
read_text_file
(infile) → str¶ Read text file and return as string
Parameters: infile (str) – path to the text file Returns: file contents as a string Return type: str
-
read_json
(infile) → dict¶ Read JSON file and return a python dictionary
Parameters: infile (str) – path to the json file; can be hdfs path Returns: python dictionary loaded from file Return type: dict
-
read_yaml
(infile) → dict¶ Read YAML file and return a python dictionary
Parameters: infile (str) – path to the json file; can be hdfs path Returns: python dictionary loaded from file Return type: dict
-
path_exists
(path: str) → bool¶ Check if a path exists
Parameters: path (str) – check if path exists Returns: True if path exists; False otherwise Return type: bool
-
rm_dir
(dir_path: str)¶ Delete existing directory
Parameters: dir_path (str) – path to directory to be removed
-
rm_file
(file_path: str)¶ Deletes existing file_path
Parameters: file_path (str) – path to file to be removed
-
copy_from_hdfs
(src: str, dest: str)¶ Copy a directory/file from HDFS to local filesystem
Parameters: - src (str) – String path to source(on HDFS)
- dest (str) – String path to destination(on local file system)
-
copy_to_hdfs
(src: str, dest: str, overwrite=True)¶ Copy a directory/file to HDFS from local filesystem
Parameters src : str
String path to source(on local file system)- dest : str
- String path to destination(on HDFS)
- overwrite : bool, optional
- Boolean to specify whether existing destination files should be overwritten
-