Module: Kiba::Extend::Mixins::IterativeCleanup
- Defined in:
- lib/kiba/extend/mixins/iterative_cleanup.rb,
lib/kiba/extend/mixins/iterative_cleanup/jobs.rb,
lib/kiba/extend/mixins/iterative_cleanup/jobs/final.rb,
lib/kiba/extend/mixins/iterative_cleanup/jobs/worksheet.rb,
lib/kiba/extend/mixins/iterative_cleanup/jobs/corrections.rb,
lib/kiba/extend/mixins/iterative_cleanup/jobs/cleaned_uniq.rb,
lib/kiba/extend/mixins/iterative_cleanup/jobs/base_job_cleaned.rb,
lib/kiba/extend/mixins/iterative_cleanup/jobs/returned_compiled.rb,
lib/kiba/extend/mixins/iterative_cleanup/known_worksheet_values.rb
Overview
Mixin module for setting up iterative cleanup based on a source table.
“Iterative cleanup” means the client may provide the worksheet more than once, or that you may need to produce a fresh worksheet for the client after a new database export is provided.
Your project must follow some setup/configuration conventions in order to use this mixin:
- Each cleanup process must be configured in its own config module.
- A config module is a Ruby module that responds to
:config
.
Refer to Kiba::Tms::AltNumsForObjTypeCleanup as an example config module extending this mixin module in a simple way. See Kiba::Tms::PlacesCleanupInitial for a more complex usage with default overrides and custom pre/post transforms.
Implementation details
Define before extending this module
These can be defined as Dry::Configurable settings or as public methods. The section below lists the method/setting name the extending module should respond to, each preceded by its YARD signature.
# @return [Symbol] registry entry job key for the job whose output
# will be used as the base for generating the cleanup worksheet.
# Iterations of cleanup will be layered over this output in the
# auto-generated. **NOTE: This job's output should include a field
# which combines/identifies the original values that may be
# affected by the cleanup process. The default expectation is that
# this field is named :fingerprint, but this can be overridden by
# defining a custom `orig_values_identifier` method in the
# extending module after extension. This field is used as a
# matchpoint for merging cleaned up data back into the migration,
# and identifying whether a given value in subsequent worksheet
# iterations has been previously included in a worksheet**
# base_job
#
# @return [Array<Symbol>] fields included in the fingerprint value
# fingerprint_fields
Then, extend this module
extend Kiba::Extend::Mixins::IterativeCleanup
Methods that can be optionally overridden in extending module
Default values for the following methods are defined in this mixin module. If you want to override the values, define these methods in your config module after extending this module.
- #cleanup_base_name
- #orig_values_identifier
- #job_tags
- #worksheet_add_fields
- #worksheet_field_order
- #collate_fields
- #collation_delim
- #clean_fingerprint_flag_ignore_fields
- #final_lookup_on_field
What extending this module does
Defines settings in the extending config module
These are empty settings with constructors that will use the values in a client-specific project config file to build the data expected for cleanup processing
- :provided_worksheets - Array of filenames of cleanup
worksheets provided to the client. Files should be listed
oldest-to-newest. Assumes files are in the
to_client
subdirectory of the migration base directory. Define actual values in client config file. - :returned_files - Array of filenames of completed worksheets
returned by client. Files should be listed oldest-to-newest.
Assumes files are in the
supplied
subdirectory of the migration base directory. Define actual values in client config file.
Defines methods in the extending config module
See method documentation inline below.
Prepares registry entries for iterative cleanup jobs
When the project application loads, the method that registers the project’s registry entries calls Utils::IterativeCleanupJobRegistrar. This util class calls the #register_cleanup_jobs method of each config module extending this module, adding the cleanup jobs to the registry dynamically.
The jobs themselves (i.e. the sources, lookups, transforms) are defined in Jobs. See that module’s documentation for how to set up custom pre/post transforms to customize specific cleanup routines.
Defined Under Namespace
Modules: Jobs Classes: KnownWorksheetValues
Class Method Summary collapse
Instance Method Summary collapse
-
#all_collate_fields ⇒ Array<Symbol>
Ensures that orig_values_identifier is always included in collated fields.
-
#base_job_cleaned_job_key ⇒ Symbol
The registry entry job key for the base job with cleanup merged in.
-
#clean_fingerprint_flag_ignore_fields ⇒ nil, ...
Field(s) included in
fingerprint_fields
setting that should be ignored when identifying changed/corrected values in returned worksheets. -
#cleaned_uniq_job_key ⇒ Symbol
The registry entry job key for the job that deduplicates the clean base job data.
-
#cleanup_base_name ⇒ String
Used as the namespace for auto-generated registry entries and the base for output file names.
-
#cleanup_done? ⇒ Boolean
(also: #cleanup_done)
-
#collate_fields ⇒ Array<:fingerprint>
Fields from base_job_cleaned that will be deleted in cleaned_uniq, and then merged back into the deduplicated data of that job from base_job_cleaned.
-
#collated_orig_values_id_field ⇒ Object
Appends “s” to module’s
orig_values_identifier
. -
#collation_delim ⇒ String
Delimiting string used to join collated-on-deduplication values.
-
#corrections_job_key ⇒ Symbol
corrections job.
-
#final_job_key ⇒ Object
-
#final_lookup_on_field ⇒ Symbol
Will be used to set the
lookup_on
field in job registry hash forcleanup_base_name__final
, for merging cleaned-up data back into the rest of your migration. -
#job_tags ⇒ Array<Symbol>
Tags assigned to all jobs generated by IterativeCleanup for this module.
-
#orig_values_identifier ⇒ Symbol
Field in base job that combines/identifies the original field values entering the cleanup process.
-
#register_cleanup_jobs ⇒ Object
-
#returned_compiled_job_key ⇒ Symbol
corrections job.
-
#returned_file_jobs ⇒ Array<Symbol>
Supplied registry entry job keys corresponding to returned cleanup files.
-
#worksheet_add_fields ⇒ Array<Symbol>
Nil/empty fields to be added to worksheet.
-
#worksheet_field_order ⇒ Array<Symbol>
Order of fields (in worksheet output).
-
#worksheet_job_key ⇒ Symbol
The registry entry job key for the worksheet prep job.
-
#worksheet_sent_not_done? ⇒ Boolean
Class Method Details
.datadir(mod) ⇒ Object
382 383 384 385 386 387 388 389 390 391 392 393 394 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 382 def self.datadir(mod) dir = nil parents = mod.module_parents until dir || parents.empty? parent = parents.shift dir = parent.datadir if parent.respond_to?(:datadir) end raise Kiba::Extend::ProjectSettingUndefinedError, :datadir unless dir dir end |
.extended(mod) ⇒ Object
113 114 115 116 117 118 119 120 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 113 def self.extended(mod) check_required_settings(mod) unless mod.is_a?(Dry::Configurable) mod.extend(Dry::Configurable) end define_provided_worksheets_setting(mod) define_returned_files_setting(mod) end |
Instance Method Details
#all_collate_fields ⇒ Array<Symbol>
Override at your peril
Ensures that orig_values_identifier is always included in collated fields
318 319 320 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 318 def all_collate_fields [collate_fields, orig_values_identifier].flatten.uniq end |
#base_job_cleaned_job_key ⇒ Symbol
Do not override
Returns the registry entry job key for the base job with cleanup merged in.
326 327 328 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 326 def base_job_cleaned_job_key "#{cleanup_base_name}__base_job_cleaned".to_sym end |
#clean_fingerprint_flag_ignore_fields ⇒ nil, ...
Optional: override in extending module after extending
Field(s) included in fingerprint_fields
setting that
should be ignored when identifying changed/corrected
values in returned worksheets. If a Symbol or Array of
Symbols is given, these are passed as the value of
ignore_fields
when
Transforms::Fingerprint::FlagChanged is
called. DEFAULT VALUE: nil
This is included because of two situations that I’ve run into:
- I accidentally included a field I shouldn’t have in the
fingerprint and sent a worksheet to the client. For
example, maybe I put
:client_cleanup_process_notes
in theworksheet_add_fields
setting, told the client these notes are for their use only and will not be considered “corrections” or merged into the migration or future cleanup iterations, but I forgot to subtract this field from myfingerprint_fields
setting. - I purposefully included a field (e.g.
:rowid
) present in mybase_job
infingerprint_fields
to ensure unique matchpoints, but didn’t want to include that field in the client worksheet. If I don’t ignore this field in flagging changes,:rowid
in all returned worksheets isnil
, which does not match the fingerprinted:rowid
value, and thus every row is a changed row.
269 270 271 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 269 def clean_fingerprint_flag_ignore_fields nil end |
#cleaned_uniq_job_key ⇒ Symbol
Do not override
Returns the registry entry job key for the job that deduplicates the clean base job data.
334 335 336 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 334 def cleaned_uniq_job_key "#{cleanup_base_name}__cleaned_uniq".to_sym end |
#cleanup_base_name ⇒ String
Optional: override in extending module after extending
Used as the namespace for auto-generated registry entries and the base for output file names. DEFAULT VALUE: the name of the extending module, converted to snake case.
131 132 133 134 135 136 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 131 def cleanup_base_name name.split("::")[-1] .gsub(/([A-Z])/, '_\1') .delete_prefix("_") .downcase end |
#cleanup_done? ⇒ Boolean Also known as: cleanup_done
Do not override
300 301 302 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 300 def cleanup_done? true unless returned_files.empty? end |
#collate_fields ⇒ Array<:fingerprint>
Optional: override in extending module after extending
Fields from base_job_cleaned that will be deleted in
cleaned_uniq, and then merged back into the deduplicated
data of that job from base_job_cleaned. I.e., fields whose
values will be collated into multivalued fields on the
deduplicated values. DEFAULT VALUE: []
Note that :fingerprint
(or your overridden orig_values_identifier)
is added to these values by the #all_collate_fields method. That
field should always be collated, or you will not be able to match
final cleaned values back to original migration data.
An example of when you might want to add additional collate
fields: For authority term cleanup, especially if we are
breaking up subject headings into individual subdivisions,
I like to provide the full subject heading from which the
term was derived, for context. For example, :subdivision
= “History”, :fullheading
= “Ghana – History”. If you
also have row with :subdivision
= “Histories”,
:fullheading
= “Ghana – Histories”, and the client
corrects “Histories” to “History” in that row, if you
include :fullheading
in collate_fields
, a subsequently
generated worksheet row with :subdivision
= “History”
will have :fullheading
= “Ghana – History\\Ghana –
Histories”.
It can also be useful for clients with large cleanup
projects to provide the number of occurrences for each
value in the project. Retain this information through
multiple cleanup iterations by collating the occurrences
field and adding an inline transform to split and sum the
values in a custom cleaned_uniq_post_xforms
method. See
Tms::PlacesCleanupInitial
for an example
224 225 226 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 224 def collate_fields [] end |
#collated_orig_values_id_field ⇒ Object
Appends “s” to module’s orig_values_identifier
. Used to
manage joining, collating, and splitting/exploding on this
value, while clarifying that any collated field in output
is collated (not expected to be a single value.
369 370 371 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 369 def collated_orig_values_id_field "#{orig_values_identifier}s".to_sym end |
#collation_delim ⇒ String
Optional: override in extending module after extending
Delimiting string used to join collated-on-deduplication
values. Should be distinct from normal application
delimiters since the field values being joined/split may
contain the normal application delimiters. DEFAULT VALUE: "////"
236 237 238 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 236 def collation_delim "////" end |
#corrections_job_key ⇒ Symbol
Do not override
corrections job
357 358 359 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 357 def corrections_job_key "#{cleanup_base_name}__corrections".to_sym end |
#final_job_key ⇒ Object
361 362 363 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 361 def final_job_key "#{cleanup_base_name}__final".to_sym end |
#final_lookup_on_field ⇒ Symbol
Optional: override in extending module after extending
Will be used to set the lookup_on
field in job registry
hash for cleanup_base_name__final
, for merging
cleaned-up data back into the rest of your migration.
DEFAULT VALUE: value of orig_values_identifier
281 282 283 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 281 def final_lookup_on_field orig_values_identifier end |
#job_tags ⇒ Array<Symbol>
Optional: override in extending module after extending
Tags assigned to all jobs generated by IterativeCleanup for
this module. Tags allow retrieval and running of jobs via
thor jobs:tagged
, thor jobs:tagged_or
, and thor
jobs:tagged_and
commands. DEFAULT VALUE: []
(empty
array)
161 162 163 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 161 def [] end |
#orig_values_identifier ⇒ Symbol
Optional: override in extending module after extending
Field in base job that combines/identifies the original
field values entering the cleanup process. This field is
used as a matchpoint for merging cleaned up data back into
the migration, and identifying whether a given value in
subsequent worksheet iterations has been previously
included in a worksheet. DEFAULT VALUE: :fingerprint
148 149 150 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 148 def orig_values_identifier :fingerprint end |
#register_cleanup_jobs ⇒ Object
438 439 440 441 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 438 def register_cleanup_jobs ns = build_namespace Kiba::Extend.registry.import(ns) end |
#returned_compiled_job_key ⇒ Symbol
Do not override
corrections job
349 350 351 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 349 def returned_compiled_job_key "#{cleanup_base_name}__returned_compiled".to_sym end |
#returned_file_jobs ⇒ Array<Symbol>
Do not override
Returns supplied registry entry job keys corresponding to returned cleanup files.
291 292 293 294 295 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 291 def returned_file_jobs returned_files.map.with_index do |filename, idx| "#{cleanup_base_name}__file_returned_#{idx}".to_sym end end |
#worksheet_add_fields ⇒ Array<Symbol>
Optional: override in extending module after extending
Nil/empty fields to be added to worksheet. Note: values from these
fields are retained from returned cleanup worksheets if these fields
are included in fingerprint_fields
. DEFAULT VALUE: []
172 173 174 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 172 def worksheet_add_fields [] end |
#worksheet_field_order ⇒ Array<Symbol>
Optional: override in extending module after extending
Order of fields (in worksheet output). Will be used to set
destination special options/initial headers on the
worksheet job. DEFAULT VALUE: []
183 184 185 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 183 def worksheet_field_order [] end |
#worksheet_job_key ⇒ Symbol
Do not override
Returns the registry entry job key for the worksheet prep job.
341 342 343 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 341 def worksheet_job_key "#{cleanup_base_name}__worksheet".to_sym end |
#worksheet_sent_not_done? ⇒ Boolean
Do not override
308 309 310 |
# File 'lib/kiba/extend/mixins/iterative_cleanup.rb', line 308 def worksheet_sent_not_done? true if !cleanup_done? && !provided_worksheets.empty? end |