File Registry Entry Reference

What are file registry entries?

A file registry entry is initialized with a Hash of data about a file. Depending on the details, this allows a given entry to be:

  • the destination of one job;
  • a source for another job; and
  • a lookup for yet another job

Creating file registry entries and referring to them when setting up your project’s jobs has two major benefits:

  • a given entry may be reused in different ways, in many jobs, without having to specify details about how to find and read/write the associated file each time it is used
  • Kiba::Extend can generate all non-supplied dependencies for any job automatically

Types of file registry entries

  • supplied entries: These entries represent files that are not created by jobs in your project application. Indicate that an entry is supplied by including supplied: true in the registry entry hash. Supplied entries can be used as sources and, depending on the type of file, as lookups in jobs. They cannot be used as destinations for jobs, since, by definition, files created by jobs are not supplied.
  • job entries: These entries represent files that are output as the destination by a job in your project application. A job entry hash must have a creator key, indicating the job that creates the file. The job pointed at by an entry’s creator must have that entry as its destination in the file config of the job.

File registry entries in your ETL application

File registry entries are defined as Ruby Hashes in your ETL application.

In the most basic Kiba::Extend project, these hashes will be manually entered in the YourProject::RegistryData.register_files method.

You can also write code to dynamically generate registry entry Hashes that follow a pattern. The kiba-extend-project repository provides some examples. See also common_patterns_tips_tricks.

Kiba::Extend converts these Hashes to Kiba::Extend::Registry::FileRegistryEntry classes when you call Kiba::Extend::Registry::FileRegistry#finalize or Kiba::Extend::Registry::FileRegistry#transform on your project’s registry.

File registry Hash format

The allowable Hash keys, expected Hash value formats, and expectations about them are described below.

NOTE: (Since 3.0.0) For all keys besides :dest_special_opts, you may pass a Proc that returns the expected value format when called. For :dest_special_opts, you may pass Procs as individual values within the option Hash. This can be useful if you need to pass in a value that depends on other project config that may not be loaded/set up when registry is initially populated. A publicly available example is in kiba-tms which sets destination initial headers based on the preferred name field for a given TMS client project, and whether they want to include “flipped” form as variant terms.

:path

[String] full or expandable relative path to the expected location of the file

  • default: nil
  • required if either :src_class or :dest_class requires a path

:src_class

[Class] the Ruby class used to read in data. This class must be defined in the Sources namespace or equivalent. Example: you should never use Kiba::Extend::Destinations::CSV as a src_classvalue.

  • required, but default supplied if not given
  • default: value of Kiba::Extend.source (Kiba::Extend::Sources::CSV unless overridden by your ETL app)

:src_opt

[Hash] file options used when reading in source

  • required, but default supplied if not given
  • if :src_class is Kiba::Extend::Sources::CSV:
  • if :src_class is Kiba::Extend::Sources::Marc:
    • default: nil
    • A hash of keyword parameters defined for MARC::Reader can be entered, for example: {external_encoding: "MARC-8", internal_encoding: "UTF-16LE"}

:dest_class

[Class] the Ruby class used to write out the data. This class must be defined in the Destinations namespace or equivalent. Example: you should never use Kiba::Extend::Sources::CSV as a :dest_class value.

:dest_opt

[Hash] file options used when writing data

:dest_special_opts

[Hash] additional options for writing out the data

Examples:

reghash = {
  path: '/path/to/file.csv',
  dest_class: Kiba::Extend::Destinations::CSV,
  dest_special_opts: { initial_headers: %i[objectnumber briefdescription] }
  }
reghash = {
  path: '/path/to/long_marc_records.mrc',
  dest_class: Kiba::Extend::Destinations::Marc,
  dest_special_opts: { allow_oversized: true }
  }

:creator

[Method, Module, Hash] Ruby method that generates this file

  • Used to run ETL jobs to create necessary files, if said files do not exist
  • Not required at all if file is supplied
  • If the method that runs the job is a module instance method named job, creator value can just be the Module containing the :job method
  • Otherwise, the creator value must be a Method (Pattern: Class::Or::Module::ConstantName.method(:name_of_method))
  • Sometimes you may need to call a job with arguments. This may be particularly useful if the same job logic can be reused many times with slightly different parameters. In this case creator may be a Hash with callee and args keys

NOTE: The default value for Kiba::Extend.default_job_method_name is :job. You can override this in your project’s base file as follows (since 2.7.2):

Kiba::Extend.config.default_job_method_name = :whatever

Module creator example (since 2.7.2)

This is valid because the default :job method is present in the module:

# in job definitions
module Project
  module Table
    module_function

    def job
	  Kiba::Extend::Jobs::Job.new(
	   ...
	  )
	end
  end
end

# in file registry
reghash = {
  path: '/project/working/objects_prep.csv',
  creator: Project::Table
}

Method creator example

Default :job method not present (or is not the method you need to call for this job).

# in job definitions
module Project
  module Table
    module_function

    def prep
	  Kiba::Extend::Jobs::Job.new(
	   ...
	  )
	end
  end
end

# in file registry
reghash = {
  path: '/project/working/objects_prep.csv',
  creator: Project::Table.method(:prep)
}

Hash creator example (since 2.7.2)

Hash keys:

  • callee: Method or Module (as described above)
  • args: Hash of keyword arguments to pass to the callee
# in your project's registry_data.rb
module Project
  module RegistryData
    module_function

    def register
      register_lookups
      register_files
      Project.registry.transform
      Project.registry.freeze
    end

    def normalized_lookup_type(type)
      type.downcase
        .gsub(' ', '_')
        .gsub('/', '_')
    end

    def register_lookups
      types = [
        'Accession Review Decision', 'Accession Type', 'Account Codes', 'ArchSite', 'Box', 'Budget Code',
        'Building', 'CityState', 'Cleaning', 'Condition Picks', 'Contact Type', 'Count Unit', 'Creator Type',
        'Cultural Affiliation', 'Department Code', 'Digitize Parameters', 'Digitizing Hardware',
        'Digitizing Software', 'Disposal Type', 'Exhibit Type', 'Format/Type', 'Genre', 'Image Resolution',
        'In Exhibit', 'Insured By', 'Loan Purpose', 'Material', 'Mount', 'NAGPRA Type', 'Owner Type',
        'Region', 'Room', 'Server Path', 'Technique', 'Treatment', 'Value'
      ]

      # This section dynamically registers a job for each of the above `types` values
      Project.registry.namespace('lkup') do
        types.each do |type|
          register Project::RegistryData.normalized_lookup_type(type).to_sym, {
            path: File.join(Project.datadir, 'working', "#{Project::RegistryData.normalized_lookup_type(type)}.csv"),
            creator: {callee: Project::Main::Lookups::Extract, args: {type: type}},
            tags: %i[lkup],
            lookup_on: :lookupvalueid
          }
        end
      end
    end

    def register files
	  ...
	end
  end
end

# in job definitions
module Project
  module Main
    module Lookups
      module Extract
        module_function

        def job(type:)
          Kiba::Extend::Jobs::Job.new(
            files: {
              source: :lkup__prep,
              destination: "lkup__#{Project::RegistryData.normalized_lookup_type(type).to_sym}".to_sym
            },
            transformer: xforms(type)
          )
        end

        def xforms(type)
          Kiba.job_segment do
            transform FilterRows::FieldEqualTo, action: :keep, field: :lookup_type, value: type
          end
        end
      end
    end
  end
end

:supplied

[true, false] whether the file/data is supplied from outside the ETL

  • default: false
  • Manually set to true for:
    • original data files from client
    • mappings/reconciliations to be merged into the ETL/migration
    • any other files created external to the ETL, which only need to be read from and never generated by the ETL process
    • entries where :src_class is Kiba::Extend::Sources::Marc

Both of the following are valid:

reghash = {
  path: '/project/working/objects_prep.csv',
  creator: Project::ClientData::ObjectTable.method(:prep)
}

reghash = {
  path: '/project/clientData/objects.csv',
  supplied: true
}

:lookup_on

[Symbol] column to use as keys in lookup table created from file data

  • required if file is used as a lookup source
  • You can register the same file multiple times under different file keys with different :lookup_on values if you need to use the data for different lookup purposes

Currently only the following types of registry entries can be used as lookups:

  • :supplied = true and :src_class returns row/record Hashes
  • :dest_class writes/returns row/record Hashes

Other types of registry entries should not define a :lookup_on value.

:desc

[String] description of what the file is/what it is used for. Used when post-processing reports results to STDOUT

  • optional

:tags

[Array (of Symbols)] list of arbitrary tags useful for categorizing data/jobs in your ETL

  • optional
  • If set, you can filter to run only jobs tagged with a given tag (or tags)1
  • Tags I commonly use:
    • :report_problems - reports that indicate something unexpected or that I need to do more work
    • :report_fyi - informational reports
    • :postmigcleanup - for reports I will need to generate for client after production migration is complete
    • :cspace or :ingest- final files ready to import