Module: Kiba::Extend::Mixins::IterativeCleanup

Defined in:
lib/kiba/extend/mixins/iterative_cleanup.rb,
lib/kiba/extend/mixins/iterative_cleanup/jobs.rb,
lib/kiba/extend/mixins/iterative_cleanup/jobs/final.rb,
lib/kiba/extend/mixins/iterative_cleanup/jobs/worksheet.rb,
lib/kiba/extend/mixins/iterative_cleanup/jobs/corrections.rb,
lib/kiba/extend/mixins/iterative_cleanup/jobs/cleaned_uniq.rb,
lib/kiba/extend/mixins/iterative_cleanup/jobs/base_job_cleaned.rb,
lib/kiba/extend/mixins/iterative_cleanup/jobs/returned_compiled.rb,
lib/kiba/extend/mixins/iterative_cleanup/known_worksheet_values.rb

Overview

Mixin module for setting up iterative cleanup based on a source table.

“Iterative cleanup” means the client may provide the worksheet more than once, or that you may need to produce a fresh worksheet for the client after a new database export is provided.

Your project must follow some setup/configuration conventions in order to use this mixin:

  • Each cleanup process must be configured in its own config module.
  • A config module is a Ruby module that responds to :config.

Refer to Kiba::Tms::AltNumsForObjTypeCleanup as an example config module extending this mixin module in a simple way. See Kiba::Tms::PlacesCleanupInitial for a more complex usage with default overrides and custom pre/post transforms.

Implementation details

Define before extending this module

These can be defined as Dry::Configurable settings or as public methods. The section below lists the method/setting name the extending module should respond to, each preceded by its YARD signature.

# @return [Symbol] registry entry job key for the job whose output
#   will be used as the base for generating the cleanup worksheet.
#   Iterations of cleanup will be layered over this output in the
#   auto-generated. **NOTE: This job's output should include a field
#   which combines/identifies the original values that may be
#   affected by the cleanup process. The default expectation is that
#   this field is named :fingerprint, but this can be overridden by
#   defining a custom `orig_values_identifier` method in the
#   extending module after extension. This field is used as a
#   matchpoint for merging cleaned up data back into the migration,
#   and identifying whether a given value in subsequent worksheet
#   iterations has been previously included in a worksheet**
# base_job
#
# @return [Array<Symbol>] fields included in the fingerprint value
# fingerprint_fields

Then, extend this module

extend Kiba::Extend::Mixins::IterativeCleanup

Methods that can be optionally overridden in extending module

Default values for the following methods are defined in this mixin module. If you want to override the values, define these methods in your config module after extending this module.

What extending this module does

Defines settings in the extending config module

These are empty settings with constructors that will use the values in a client-specific project config file to build the data expected for cleanup processing

  • :provided_worksheets - Array of filenames of cleanup worksheets provided to the client. Files should be listed oldest-to-newest. Assumes files are in the to_client subdirectory of the migration base directory. Define actual values in client config file.
  • :returned_files - Array of filenames of completed worksheets returned by client. Files should be listed oldest-to-newest. Assumes files are in the supplied subdirectory of the migration base directory. Define actual values in client config file.

Defines methods in the extending config module

See method documentation inline below.

Prepares registry entries for iterative cleanup jobs

When the project application loads, the method that registers the project’s registry entries calls Utils::IterativeCleanupJobRegistrar. This util class calls the #register_cleanup_jobs method of each config module extending this module, adding the cleanup jobs to the registry dynamically.

The jobs themselves (i.e. the sources, lookups, transforms) are defined in Jobs. See that module’s documentation for how to set up custom pre/post transforms to customize specific cleanup routines.

Since:

  • 4.0.0

Defined Under Namespace

Modules: Jobs Classes: KnownWorksheetValues

Class Method Summary collapse

Instance Method Summary collapse

Class Method Details

.datadir(mod) ⇒ Object



382
383
384
385
386
387
388
389
390
391
392
393
394
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 382

def self.datadir(mod)
  dir = nil
  parents = mod.module_parents

  until dir || parents.empty?
    parent = parents.shift
    dir = parent.datadir if parent.respond_to?(:datadir)
  end

  raise Kiba::Extend::ProjectSettingUndefinedError, :datadir unless dir

  dir
end

.extended(mod) ⇒ Object

Since:

  • 4.0.0



113
114
115
116
117
118
119
120
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 113

def self.extended(mod)
  check_required_settings(mod)
  unless mod.is_a?(Dry::Configurable)
    mod.extend(Dry::Configurable)
  end
  define_provided_worksheets_setting(mod)
  define_returned_files_setting(mod)
end

Instance Method Details

#all_collate_fieldsArray<Symbol>

Note:

Override at your peril

Ensures that orig_values_identifier is always included in collated fields

Returns:

  • (Array<Symbol>)

Since:

  • 4.0.0



318
319
320
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 318

def all_collate_fields
  [collate_fields, orig_values_identifier].flatten.uniq
end

#base_job_cleaned_job_keySymbol

Note:

Do not override

Returns the registry entry job key for the base job with cleanup merged in.

Returns:

  • (Symbol)

    the registry entry job key for the base job with cleanup merged in

Since:

  • 4.0.0



326
327
328
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 326

def base_job_cleaned_job_key
  "#{cleanup_base_name}__base_job_cleaned".to_sym
end

#clean_fingerprint_flag_ignore_fieldsnil, ...

Note:

Optional: override in extending module after extending

Field(s) included in fingerprint_fields setting that should be ignored when identifying changed/corrected values in returned worksheets. If a Symbol or Array of Symbols is given, these are passed as the value of ignore_fields when Transforms::Fingerprint::FlagChanged is called. DEFAULT VALUE: nil

This is included because of two situations that I’ve run into:

  • I accidentally included a field I shouldn’t have in the fingerprint and sent a worksheet to the client. For example, maybe I put :client_cleanup_process_notes in the worksheet_add_fields setting, told the client these notes are for their use only and will not be considered “corrections” or merged into the migration or future cleanup iterations, but I forgot to subtract this field from my fingerprint_fields setting.
  • I purposefully included a field (e.g. :rowid) present in my base_job in fingerprint_fields to ensure unique matchpoints, but didn’t want to include that field in the client worksheet. If I don’t ignore this field in flagging changes, :rowid in all returned worksheets is nil, which does not match the fingerprinted :rowid value, and thus every row is a changed row.

Returns:

  • (nil, Symbol, Array<Symbol>)

Since:

  • 4.0.0



269
270
271
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 269

def clean_fingerprint_flag_ignore_fields
  nil
end

#cleaned_uniq_job_keySymbol

Note:

Do not override

Returns the registry entry job key for the job that deduplicates the clean base job data.

Returns:

  • (Symbol)

    the registry entry job key for the job that deduplicates the clean base job data

Since:

  • 4.0.0



334
335
336
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 334

def cleaned_uniq_job_key
  "#{cleanup_base_name}__cleaned_uniq".to_sym
end

#cleanup_base_nameString

Note:

Optional: override in extending module after extending

Used as the namespace for auto-generated registry entries and the base for output file names. DEFAULT VALUE: the name of the extending module, converted to snake case.

Returns:

  • (String)

Since:

  • 4.0.0



131
132
133
134
135
136
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 131

def cleanup_base_name
  name.split("::")[-1]
    .gsub(/([A-Z])/, '_\1')
    .delete_prefix("_")
    .downcase
end

#cleanup_done?Boolean Also known as: cleanup_done

Note:

Do not override

Returns:

  • (Boolean)

Since:

  • 4.0.0



300
301
302
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 300

def cleanup_done?
  true unless returned_files.empty?
end

#collate_fieldsArray<:fingerprint>

Note:

Optional: override in extending module after extending

Fields from base_job_cleaned that will be deleted in cleaned_uniq, and then merged back into the deduplicated data of that job from base_job_cleaned. I.e., fields whose values will be collated into multivalued fields on the deduplicated values. DEFAULT VALUE: []

Note that :fingerprint (or your overridden orig_values_identifier) is added to these values by the #all_collate_fields method. That field should always be collated, or you will not be able to match final cleaned values back to original migration data.

An example of when you might want to add additional collate fields: For authority term cleanup, especially if we are breaking up subject headings into individual subdivisions, I like to provide the full subject heading from which the term was derived, for context. For example, :subdivision = “History”, :fullheading = “Ghana – History”. If you also have row with :subdivision = “Histories”, :fullheading = “Ghana – Histories”, and the client corrects “Histories” to “History” in that row, if you include :fullheading in collate_fields, a subsequently generated worksheet row with :subdivision = “History” will have :fullheading = “Ghana – History\\Ghana – Histories”.

It can also be useful for clients with large cleanup projects to provide the number of occurrences for each value in the project. Retain this information through multiple cleanup iterations by collating the occurrences field and adding an inline transform to split and sum the values in a custom cleaned_uniq_post_xforms method. See Tms::PlacesCleanupInitial for an example

Returns:

  • (Array<:fingerprint>)

Since:

  • 4.0.0



224
225
226
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 224

def collate_fields
  []
end

#collated_orig_values_id_fieldObject

Appends “s” to module’s orig_values_identifier. Used to manage joining, collating, and splitting/exploding on this value, while clarifying that any collated field in output is collated (not expected to be a single value.

Since:

  • 4.0.0



369
370
371
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 369

def collated_orig_values_id_field
  "#{orig_values_identifier}s".to_sym
end

#collation_delimString

Note:

Optional: override in extending module after extending

Delimiting string used to join collated-on-deduplication values. Should be distinct from normal application delimiters since the field values being joined/split may contain the normal application delimiters. DEFAULT VALUE: "////"

Returns:

  • (String)

Since:

  • 4.0.0



236
237
238
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 236

def collation_delim
  "////"
end

#corrections_job_keySymbol

Note:

Do not override

corrections job

Returns:

  • (Symbol)

    the registry entry job key for the compiled

Since:

  • 4.0.0



357
358
359
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 357

def corrections_job_key
  "#{cleanup_base_name}__corrections".to_sym
end

#final_job_keyObject

Since:

  • 4.0.0



361
362
363
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 361

def final_job_key
  "#{cleanup_base_name}__final".to_sym
end

#final_lookup_on_fieldSymbol

Note:

Optional: override in extending module after extending

Will be used to set the lookup_on field in job registry hash for cleanup_base_name__final, for merging cleaned-up data back into the rest of your migration. DEFAULT VALUE: value of orig_values_identifier

Returns:

  • (Symbol)

Since:

  • 4.0.0



281
282
283
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 281

def final_lookup_on_field
  orig_values_identifier
end

#job_tagsArray<Symbol>

Note:

Optional: override in extending module after extending

Tags assigned to all jobs generated by IterativeCleanup for this module. Tags allow retrieval and running of jobs via thor jobs:tagged, thor jobs:tagged_or, and thor jobs:tagged_and commands. DEFAULT VALUE: [] (empty array)

Returns:

  • (Array<Symbol>)

Since:

  • 4.0.0



161
162
163
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 161

def job_tags
  []
end

#orig_values_identifierSymbol

Note:

Optional: override in extending module after extending

Field in base job that combines/identifies the original field values entering the cleanup process. This field is used as a matchpoint for merging cleaned up data back into the migration, and identifying whether a given value in subsequent worksheet iterations has been previously included in a worksheet. DEFAULT VALUE: :fingerprint

Returns:

  • (Symbol)

Since:

  • 4.0.0



148
149
150
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 148

def orig_values_identifier
  :fingerprint
end

#register_cleanup_jobsObject

Since:

  • 4.0.0



438
439
440
441
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 438

def register_cleanup_jobs
  ns = build_namespace
  Kiba::Extend.registry.import(ns)
end

#returned_compiled_job_keySymbol

Note:

Do not override

corrections job

Returns:

  • (Symbol)

    the registry entry job key for the compiled

Since:

  • 4.0.0



349
350
351
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 349

def returned_compiled_job_key
  "#{cleanup_base_name}__returned_compiled".to_sym
end

#returned_file_jobsArray<Symbol>

Note:

Do not override

Returns supplied registry entry job keys corresponding to returned cleanup files.

Returns:

  • (Array<Symbol>)

    supplied registry entry job keys corresponding to returned cleanup files

Since:

  • 4.0.0



291
292
293
294
295
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 291

def returned_file_jobs
  returned_files.map.with_index do |filename, idx|
    "#{cleanup_base_name}__file_returned_#{idx}".to_sym
  end
end

#worksheet_add_fieldsArray<Symbol>

Note:

Optional: override in extending module after extending

Nil/empty fields to be added to worksheet. Note: values from these fields are retained from returned cleanup worksheets if these fields are included in fingerprint_fields. DEFAULT VALUE: []

Returns:

  • (Array<Symbol>)

Since:

  • 4.0.0



172
173
174
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 172

def worksheet_add_fields
  []
end

#worksheet_field_orderArray<Symbol>

Note:

Optional: override in extending module after extending

Order of fields (in worksheet output). Will be used to set destination special options/initial headers on the worksheet job. DEFAULT VALUE: []

Returns:

  • (Array<Symbol>)

Since:

  • 4.0.0



183
184
185
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 183

def worksheet_field_order
  []
end

#worksheet_job_keySymbol

Note:

Do not override

Returns the registry entry job key for the worksheet prep job.

Returns:

  • (Symbol)

    the registry entry job key for the worksheet prep job

Since:

  • 4.0.0



341
342
343
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 341

def worksheet_job_key
  "#{cleanup_base_name}__worksheet".to_sym
end

#worksheet_sent_not_done?Boolean

Note:

Do not override

Returns:

  • (Boolean)

Since:

  • 4.0.0



308
309
310
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 308

def worksheet_sent_not_done?
  true if !cleanup_done? && !provided_worksheets.empty?
end