Module: Kiba::Extend::Mixins::IterativeCleanup
- Defined in:
- lib/kiba/extend/mixins/iterative_cleanup.rb,
lib/kiba/extend/mixins/iterative_cleanup/jobs.rb,
lib/kiba/extend/mixins/iterative_cleanup/jobs/final.rb,
lib/kiba/extend/mixins/iterative_cleanup/jobs/worksheet.rb,
lib/kiba/extend/mixins/iterative_cleanup/jobs/corrections.rb,
lib/kiba/extend/mixins/iterative_cleanup/jobs/cleaned_uniq.rb,
lib/kiba/extend/mixins/iterative_cleanup/jobs/base_job_cleaned.rb,
lib/kiba/extend/mixins/iterative_cleanup/jobs/returned_compiled.rb,
lib/kiba/extend/mixins/iterative_cleanup/known_worksheet_values.rb
Overview
Mixin module for setting up iterative cleanup based on a source table.
“Iterative cleanup” means the client may provide the worksheet more than once, or that you may need to produce a fresh worksheet for the client after a new database export is provided.
Your project must follow some setup/configuration conventions in order to use this mixin:
- Each cleanup process must be configured in its own config module.
- A config module is a Ruby module that responds to
:config.
Refer to Kiba::Tms::AltNumsForObjTypeCleanup as an example config module extending this mixin module in a simple way. See Kiba::Tms::PlacesCleanupInitial for a more complex usage with default overrides and custom pre/post transforms.
Implementation details
Define before extending this module
These can be defined as Dry::Configurable settings or as public methods. The section below lists the method/setting name the extending module should respond to, each preceded by its YARD signature.
# @return [Symbol] registry entry job key for the job whose output
# will be used as the base for generating the cleanup worksheet.
# Iterations of cleanup will be layered over this output in the
# auto-generated. **NOTE: This job's output should include a field
# which combines/identifies the original values that may be
# affected by the cleanup process. The default expectation is that
# this field is named :fingerprint, but this can be overridden by
# defining a custom `orig_values_identifier` method in the
# extending module after extension. This field is used as a
# matchpoint for merging cleaned up data back into the migration,
# and identifying whether a given value in subsequent worksheet
# iterations has been previously included in a worksheet**
def base_job = :base_job__key
#
# @return [Array<Symbol>] fields included in the fingerprint value
def fingerprint_fields = %i[field1 field2]
#
# @return [Symbol] ONLY REQUIRED IF YOU ARE IMPLEMENTING AN FCAR
# CHUTE INCLUDING THIS CLEANUP PROCESS. Registry entry job key
# for the job that fully merges the output of this cleanup
# process back into your project data
def merge_job = :merge_job__key
Then, extend this module
extend Kiba::Extend::Mixins::IterativeCleanup
Methods that can be optionally overridden in extending module
Default values for the following methods are defined in this mixin module. If you want to override the values, define these methods in your config module after extending this module.
- #cleanup_base_name
- #orig_values_identifier
- #job_tags
- #worksheet_add_fields
- #worksheet_field_order
- #collate_fields
- #collation_delim
- #clean_fingerprint_flag_ignore_fields
- #final_lookup_on_field
- #final_lookup_sources
What extending this module does
Defines settings in the extending config module
These are empty settings with constructors that will use the values in a client-specific project config file to build the data expected for cleanup processing
- :provided_worksheets - Array of filenames of cleanup
worksheets provided to the client. Files should be listed
oldest-to-newest. Assumes files are in the
to_clientsubdirectory of the migration base directory. Define actual values in client config file. - :returned_files - Array of filenames of completed worksheets
returned by client. Files should be listed oldest-to-newest.
Assumes files are in the
suppliedsubdirectory of the migration base directory. Define actual values in client config file.
Defines methods in the extending config module
See method documentation inline below.
Prepares registry entries for iterative cleanup jobs
When the project application loads, the method that registers the project’s registry entries calls Utils::IterativeCleanupJobRegistrar. This util class calls the #register_cleanup_jobs method of each config module extending this module, adding the cleanup jobs to the registry dynamically.
The jobs themselves (i.e. the sources, lookups, transforms) are defined in Jobs. See that module’s documentation for how to set up custom pre/post transforms to customize specific cleanup routines.
Defined Under Namespace
Modules: Jobs Classes: KnownWorksheetValues
Class Method Summary collapse
Instance Method Summary collapse
-
#all_collate_fields ⇒ Array<Symbol>
Ensures that orig_values_identifier is always included in collated fields.
-
#base_job_cleaned_job_key ⇒ Symbol
The registry entry job key for the base job with cleanup merged in.
-
#clean_fingerprint_flag_ignore_fields ⇒ nil, ...
Field(s) included in
fingerprint_fieldssetting that should be ignored when identifying changed/corrected values in returned worksheets. -
#cleaned_uniq_job_key ⇒ Symbol
The registry entry job key for the job that deduplicates the clean base job data.
-
#cleanup_base_name ⇒ String
Used as the namespace for auto-generated registry entries and the base for output file names.
-
#cleanup_done? ⇒ Boolean
(also: #cleanup_done)
-
#collate_fields ⇒ Array<:fingerprint>
Fields from base_job_cleaned that will be deleted in cleaned_uniq, and then merged back into the deduplicated data of that job from base_job_cleaned.
-
#collated_orig_values_id_field ⇒ Object
Appends “s” to module’s
orig_values_identifier. -
#collation_delim ⇒ String
Delimiting string used to join collated-on-deduplication values.
-
#corrections_job_key ⇒ Symbol
corrections job.
-
#final_job_key ⇒ Object
-
#final_lookup_on_field ⇒ Symbol
Will be used to set the
lookup_onfield in job registry hash forcleanup_base_name__final, for merging cleaned-up data back into the rest of your migration. -
#final_lookup_sources ⇒ Array<Symbol>
Job keys of registered jobs to be used as lookup tables in the
cleanup_base_name__finaljob. -
#job_tags ⇒ Array<Symbol>
Tags assigned to all jobs generated by IterativeCleanup for this module.
-
#orig_values_identifier ⇒ Symbol
Field in base job that combines/identifies the original field values entering the cleanup process.
-
#register_cleanup_jobs ⇒ Object
-
#returned_compiled_job_key ⇒ Symbol
corrections job.
-
#returned_file_jobs ⇒ Array<Symbol>
Supplied registry entry job keys corresponding to returned cleanup files.
-
#worksheet_add_fields ⇒ Array<Symbol>
Nil/empty fields to be added to worksheet.
-
#worksheet_field_order ⇒ Array<Symbol>
Order of fields (in worksheet output).
-
#worksheet_job_key ⇒ Symbol
The registry entry job key for the worksheet prep job.
-
#worksheet_sent_not_done? ⇒ Boolean
Class Method Details
.datadir(mod) ⇒ Object
393 394 395 396 397 398 399 400 401 402 403 404 405 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 393 def self.datadir(mod) dir = nil parents = mod.module_parents until dir || parents.empty? parent = parents.shift dir = parent.datadir if parent.respond_to?(:datadir) end raise Kiba::Extend::ProjectSettingUndefinedError, :datadir unless dir dir end |
.extended(mod) ⇒ Object
120 121 122 123 124 125 126 127 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 120 def self.extended(mod) check_required_settings(mod) unless mod.is_a?(Dry::Configurable) mod.extend(Dry::Configurable) end define_provided_worksheets_setting(mod) define_returned_files_setting(mod) end |
Instance Method Details
#all_collate_fields ⇒ Array<Symbol>
Override at your peril
Ensures that orig_values_identifier is always included in collated fields
329 330 331 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 329 def all_collate_fields [collate_fields, orig_values_identifier].flatten.uniq end |
#base_job_cleaned_job_key ⇒ Symbol
Do not override
Returns the registry entry job key for the base job with cleanup merged in.
337 338 339 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 337 def base_job_cleaned_job_key :"#{cleanup_base_name}__base_job_cleaned" end |
#clean_fingerprint_flag_ignore_fields ⇒ nil, ...
Optional: override in extending module after extending
Field(s) included in fingerprint_fields setting that
should be ignored when identifying changed/corrected
values in returned worksheets. If a Symbol or Array of
Symbols is given, these are passed as the value of
ignore_fields when
Transforms::Fingerprint::FlagChanged is
called. DEFAULT VALUE: nil
This is included because of two situations that I’ve run into:
- I accidentally included a field I shouldn’t have in the
fingerprint and sent a worksheet to the client. For
example, maybe I put
:client_cleanup_process_notesin theworksheet_add_fieldssetting, told the client these notes are for their use only and will not be considered “corrections” or merged into the migration or future cleanup iterations, but I forgot to subtract this field from myfingerprint_fieldssetting. - I purposefully included a field (e.g.
:rowid) present in mybase_jobinfingerprint_fieldsto ensure unique matchpoints, but didn’t want to include that field in the client worksheet. If I don’t ignore this field in flagging changes,:rowidin all returned worksheets isnil, which does not match the fingerprinted:rowidvalue, and thus every row is a changed row.
276 277 278 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 276 def clean_fingerprint_flag_ignore_fields nil end |
#cleaned_uniq_job_key ⇒ Symbol
Do not override
Returns the registry entry job key for the job that deduplicates the clean base job data.
345 346 347 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 345 def cleaned_uniq_job_key :"#{cleanup_base_name}__cleaned_uniq" end |
#cleanup_base_name ⇒ String
Optional: override in extending module after extending
Used as the namespace for auto-generated registry entries and the base for output file names. DEFAULT VALUE: the name of the extending module, converted to snake case.
138 139 140 141 142 143 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 138 def cleanup_base_name name.split("::")[-1] .gsub(/([A-Z])/, '_\1') .delete_prefix("_") .downcase end |
#cleanup_done? ⇒ Boolean Also known as: cleanup_done
Do not override
311 312 313 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 311 def cleanup_done? true unless returned_files.empty? end |
#collate_fields ⇒ Array<:fingerprint>
Optional: override in extending module after extending
Fields from base_job_cleaned that will be deleted in
cleaned_uniq, and then merged back into the deduplicated
data of that job from base_job_cleaned. I.e., fields whose
values will be collated into multivalued fields on the
deduplicated values. DEFAULT VALUE: []
Note that :fingerprint (or your overridden orig_values_identifier)
is added to these values by the #all_collate_fields method. That
field should always be collated, or you will not be able to match
final cleaned values back to original migration data.
An example of when you might want to add additional collate
fields: For authority term cleanup, especially if we are
breaking up subject headings into individual subdivisions,
I like to provide the full subject heading from which the
term was derived, for context. For example, :subdivision
= “History”, :fullheading = “Ghana – History”. If you
also have row with :subdivision = “Histories”,
:fullheading = “Ghana – Histories”, and the client
corrects “Histories” to “History” in that row, if you
include :fullheading in collate_fields, a subsequently
generated worksheet row with :subdivision = “History”
will have :fullheading = “Ghana – History\\Ghana –
Histories”.
It can also be useful for clients with large cleanup
projects to provide the number of occurrences for each
value in the project. Retain this information through
multiple cleanup iterations by collating the occurrences
field and adding an inline transform to split and sum the
values in a custom cleaned_uniq_post_xforms method. See
Tms::PlacesCleanupInitial
for an example
231 232 233 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 231 def collate_fields [] end |
#collated_orig_values_id_field ⇒ Object
Appends “s” to module’s orig_values_identifier. Used to
manage joining, collating, and splitting/exploding on this
value, while clarifying that any collated field in output
is collated (not expected to be a single value.
380 381 382 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 380 def collated_orig_values_id_field :"#{orig_values_identifier}s" end |
#collation_delim ⇒ String
Optional: override in extending module after extending
Delimiting string used to join collated-on-deduplication
values. Should be distinct from normal application
delimiters since the field values being joined/split may
contain the normal application delimiters. DEFAULT VALUE: "////"
243 244 245 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 243 def collation_delim "////" end |
#corrections_job_key ⇒ Symbol
Do not override
corrections job
368 369 370 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 368 def corrections_job_key :"#{cleanup_base_name}__corrections" end |
#final_job_key ⇒ Object
372 373 374 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 372 def final_job_key :"#{cleanup_base_name}__final" end |
#final_lookup_on_field ⇒ Symbol
Optional: override in extending module after extending
Will be used to set the lookup_on field in job registry
hash for cleanup_base_name__final, for merging
cleaned-up data back into the rest of your migration.
DEFAULT VALUE: value of orig_values_identifier
288 289 290 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 288 def final_lookup_on_field orig_values_identifier end |
#final_lookup_sources ⇒ Array<Symbol>
Returns job keys of registered jobs to be used as
lookup tables in the cleanup_base_name__final job.
294 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 294 def final_lookup_sources = [] |
#job_tags ⇒ Array<Symbol>
Optional: override in extending module after extending
Tags assigned to all jobs generated by IterativeCleanup for
this module. Tags allow retrieval and running of jobs via
thor jobs:tagged, thor jobs:tagged_or, and thor
jobs:tagged_and commands. DEFAULT VALUE: [] (empty
array)
168 169 170 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 168 def [] end |
#orig_values_identifier ⇒ Symbol
Optional: override in extending module after extending
Field in base job that combines/identifies the original
field values entering the cleanup process. This field is
used as a matchpoint for merging cleaned up data back into
the migration, and identifying whether a given value in
subsequent worksheet iterations has been previously
included in a worksheet. DEFAULT VALUE: :fingerprint
155 156 157 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 155 def orig_values_identifier :fingerprint end |
#register_cleanup_jobs ⇒ Object
449 450 451 452 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 449 def register_cleanup_jobs ns = build_namespace Kiba::Extend.registry.import(ns) end |
#returned_compiled_job_key ⇒ Symbol
Do not override
corrections job
360 361 362 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 360 def returned_compiled_job_key :"#{cleanup_base_name}__returned_compiled" end |
#returned_file_jobs ⇒ Array<Symbol>
Do not override
Returns supplied registry entry job keys corresponding to returned cleanup files.
302 303 304 305 306 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 302 def returned_file_jobs returned_files.map.with_index do |filename, idx| :"#{cleanup_base_name}__file_returned_#{idx}" end end |
#worksheet_add_fields ⇒ Array<Symbol>
Optional: override in extending module after extending
Nil/empty fields to be added to worksheet. Note: values from these
fields are retained from returned cleanup worksheets if these fields
are included in fingerprint_fields. DEFAULT VALUE: []
179 180 181 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 179 def worksheet_add_fields [] end |
#worksheet_field_order ⇒ Array<Symbol>
Optional: override in extending module after extending
Order of fields (in worksheet output). Will be used to set
destination special options/initial headers on the
worksheet job. DEFAULT VALUE: []
190 191 192 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 190 def worksheet_field_order [] end |
#worksheet_job_key ⇒ Symbol
Do not override
Returns the registry entry job key for the worksheet prep job.
352 353 354 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 352 def worksheet_job_key :"#{cleanup_base_name}__worksheet" end |
#worksheet_sent_not_done? ⇒ Boolean
Do not override
319 320 321 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 319 def worksheet_sent_not_done? true if !cleanup_done? && !provided_worksheets.empty? end |