Using the IterativeCleanup mixin

“Iterative cleanup” means the client may provide the worksheet more than once, or that you may need to produce a fresh worksheet for the client after a new database export is provided.

There is no reason you can’t use the pattern for expected one-round cleanup. How often does one round of cleanup turn into more, after all?

IterativeCleanup was added in added in v4.0.0.

Examples

kiba-extend-project has been updated to reflect usage of the IterativeCleanup mixin. If you have an existing project based off kiba-extend-project, this diff might help identify what you need to add to your project to use IterativeCleanup.

Refer to Kiba::Tms::AltNumsForObjTypeCleanup as an example config module extending IterativeCleanup in a simple way. See Kiba::Tms::PlacesCleanupInitial for a more complex usage with default overrides and custom pre/post transforms.

Project setup assumptions

Your project must follow some setup/configuration conventions in order to use this mixin:

Each cleanup process must be configured in its own config module

A config module is a Ruby module that responds to :config.

Extending Dry::Configurable adds a config method to a module:

module Project::NameCategorization
  module_function
  extend Dry::Configurable
end

Or you can manually define a config class method on the module:

module Project::PersonCleanup
  module_function

  def config
    true
  end
end

Kiba::Extend config_namespaces setting must be set from your project

After your project’s base file has called the project’s loader, it must set the Kiba::Extend.config.config_namespaces setting.

This setting lists the namespace(s) where your config modules live.

In most of my projects, all of my config modules are in one namespace. For example, for the above project, I would add:

Kiba::Extend.config.config_namespaces = [Project]

Note that the setting takes an array, so you can list multiple namespaces if you have organized your project differently and your configs are not all in one namespace. For example, a migration for a Tms client may have client specific cleanups in the client-specific migration code project (config namespace: TmsClientName). That code project will make use of the kiba-tms application, which also defines cleanup configs in the namespace Kiba::Tms. Such a project would do this at the bottom of lib/tms_client_name.rb:

Kiba::Extend.config.config_namespaces = [Kiba::Tms, TmsClientName]

Add cleanup job registration to your RegistryData registration method

Add the following to RegistryData.register (or whatever method triggers the registration of all your jobs):

Kiba::Extend::Utils::IterativeCleanupJobRegistrar.call

This line should be added before any registry.transform, registry.freeze, or registry.finalize methods.

config_namespaces setting is populated before RegistryData registration

Calling RegistryData.register (or whatever method triggers the registration of all your jobs) must be done after the config_namespaces are set.

Setup of an individual iterative cleanup process

The following explanation uses the demonstration places cleanup in kiba-extend-project as its main example.

:fingerprint vs. :clean_fingerprint and their functions in an iterative cleanup process

The fingerprint_fields setting required in a cleanup config module is only used inside the iterative cleanup process to add and decode :clean_fingerprint values.

In the iterative cleanup process, the function of :clean_fingerprint is:

  • Represent the original values of the editable fields of the cleanup worksheet, so that we can identify rows where the client made changes
  • Allow multiple rows corrected to the same value to be collapsed to one row for future iterations of review/cleanup

It follows that the IterativeCleanup-related fingerprint_fields used to create :clean_fingerprint should include all fields included in the worksheet that:

  • you expect to be edited
  • combine to uniquely identify a row (for example, if you have an :orig_name column with the original data, and a separate, initially blank :corrected_name column, you’d need to include both fields in fingerprint_fields, since the initially blank value of corrected_name does not uniquely identify the rows.)

The file associated with the iterative cleanup process’ base_job is expected to include a :fingerprint field by default. The name of this field can be changed in the cleanup config. The function of this field in the iterative cleanup is:

  • In subsequent iterations, determine if a row in the worksheet has been seen before or needs to be marked for review
  • Each cleaned row keeps track of the :fingerprint values of the original rows that have been collapsed/changed into the clean row

The complete difference in function for :fingerprint and :clean_fingerprint is why it is possible to override orig_values_identifier in your cleanup config module after extending IterativeCleanup.

The KeProject::PlacesCleanup process could probably work just fine by overriding orig_values_identifier to be :place, and not adding a :fingerprint field in KeProject::Jobs::Places::PrepForCleanup.

places config notes

Defines settings used in KeProject::Places::PrepForCleanup job (and, presumably, in a real project, other jobs.

Note that the value of KeProject::Places.fingerprint_fields is different from the value of KeProject::PlacesCleanup.fingerprint_fields. This works for the reasons outlined in the :fingerprint vs. :clean_fingerprint section.

places cleanup config notes

Required before extending IterativeCleanup: base_job

This job is created outside the iterative cleanup process, and serves as the base and starting point for a cleanup process.

The full registry entry key (e.g. places__prep_for_cleanup) must be set as the base_job setting in a cleanup config module prior to extending that module with Kiba::Extend::Mixins::IterativeCleanup. See lib/ke_project/places_cleanup.rb.

IMPORTANT: This job’s output must include a field which combines/identifies the original values that may be affected by the cleanup process. The default expectation is that this field is named :fingerprint, but this can be overridden by defining a custom orig_values_identifier method in the extending module after extension. This field is used as a matchpoint for merging cleaned up data back into the migration, and identifying whether a given value in subsequent worksheet iterations has been previously included in a worksheet.

TIP: You can add the :fingerprint by defining a base_job_cleaned_pre_xforms method in your cleanup module config. I’ve done this in several kiba-tms cleanup configs where the base tables are generated by other complicated automated processes, where it’d be a nightmare to work in the addition of the fingerprint field. See Kiba::Tms::AltNumsForRefmasterTypeCleanup as an example.

Required before extending IterativeCleanup: fingerprint_fields

The fields that will be hashed into the :clean_fingerprint value. See the :fingerprint vs. :clean_fingerprint section for more detail.

Usually you will want to include any worksheet_add_fields, plus any other fields that, in combination with the worksheet_add_fields, yield the full corrected value for the row.

Optional default method overrides

There are a number of overrideable methods. They are well-documented at Kiba::Extend::Mixins::IterativeCleanup. Look for the list under “Methods that can be optionally overridden in extending module”.

The ones used in the demo config are listed below. Look at the documentation linked above for the full list.

worksheet_add_fields

I want clients to be able to remove things like “near” and “(?)” from these place terms, recording proximity and uncertainty information in separate fields. So I add those fields for use.

job_tags

Allows retrieval and running of jobs via thor jobs:tagged, thor jobs:tagged_or, and thor jobs:tagged_and commands.

cleanup_base_name

This is an important one to understand. Our cleanup config module name is PlacesCleanup, so by default, cleanup_base_name will be set to "places_cleanup".

This is used as the namespace for registering the jobs associated with the cleanup process, for example :places_cleanup__worksheet.

You can override this if you want.

Custom transforms!

See Kiba::Extend::Mixins::IterativeCleanup::Jobs for documentation, and the kiba-tms PlacesInitialCleanup config module for use examples.

The process

Here is the default iterative cleanup process, represented in a flowchart. There’s also a higher-resolution PDF version, and the raw Mermaid source of the flowchart. The steps and settings are explained textually below the flowchart.

Flowchart

Right now, the best place to step through and check out the processing in a detailed way is to look at the following in the kiba-tms repository:

  • PlacesInitialCleanup and its detailed tests:
  • generation of initial worksheet (i.e. “when no cleanup done”)
  • merge of corrected data and generation of a second worksheet after first round of cleanup is returned (i.e. “when initial cleanup returned”)
  • after a new database export is received after an initial round of cleanup has been done (i.e. “when fresh data after initial cleanup”) - all previous cleanup retained; cleanup rows linked to now-deleted database data no longer appear; any new values in cleanup worksheet generated at this point get flagged “to_review”
  • verification of everything after worksheet based on fresh data is returned, including “final” job (i.e. “when second round of cleanup”)

BaseJobCleaned :cleanup_base_name__base_job_cleaned

If no cleanup worksheets returned

Adds any worksheet_add_fields you have specified.

Adds :clean_fingerprint field.

If any cleanup worksheets returned

Adds any worksheet_add_fields you have specified.

Merges corrections. See Corrections for details on how corrections are prepared for merge back into original data.

Adds :clean_fingerprint field.

CleanedUniq :cleanup_base_name__cleaned_uniq

Starts with BaseJobCleaned output.

Deletes :fingerprint (or your custom orig_values_identifier) and any custom collate_fields you specified.

Deduplicates on :clean_fingerprint field values. Now if four rows for “North Carolina”, “NC”, “N.C.”, and “N. Carolina” have all been changed to “North Carolina”, we only have one row for “North Carolina”.

Re-merges in the collate fields (including orig_values_identifier/:fingerprint field) as multi-valued fields (separated by collation_delim). This also pluralizes collate field names that don’t start with “s”. So our one row for “North Carolina” will have now have a :fingerprints field containing 4 fingerprint values from the 4 original rows.

Once all cleanup is done, this might be the appropriate source job for further jobs generating unique authority terms.

Worksheet :cleanup_base_name__worksheet

Starts with CleanedUniq output.

See bottom of KeProject::PlacesCleanup for example of recording provided_worksheets.

If you have recorded no provided_worksheets

Rows are just passed through as-is.

If you have recorded one or more provided_worksheets

Gets a list of known orig_values_identifier/:fingerprint values in provided worksheets. It does this by creating and calling a new Kiba::Extend::Mixins::IterativeCleanup::KnownWorksheetValues instance. This:

  • Reads each provided worksheet file
  • Gets the :fingerprints (or equivalent field) from each row and splits the multiple values in a single field
  • Compiles and deduplicates all the values

A blank :to_review field is added to the worksheet being prepared.

Now for each row we are going to output to this worksheet, we:

  • Split the values of the :fingerprints or equivalent field.
  • If all the fingerprint values for this row are in the list of known values, :to_review is left blank.
  • Otherwise, :to_review is set to “y”

Worksheet is given to client for completion

At this point, you should record this file in the provided_worksheets setting.

See bottom of KeProject::PlacesCleanup for example of recording provided_worksheets.

Client returns completed (or partially completed) worksheet

At this point, you should record the returned file in the returned_files setting.

See bottom of KeProject::PlacesCleanup for example of recording returned_files.

ReturnedCompiled :cleanup_base_name__returned_compiled

Reads in all rows from returned_files as data source. Note that these must be listed oldest to newest. They are read in as sources in that order, which is important when we get to merging corrections!

Deletes :to_review field if present.

Runs Kiba::Extend::Transforms::Fingerprint::FlagChanged, using :clean_fingerprint. Any custom clean_fingerprint_flag_ignore_fields are ignored. This:

  • Adds the decoded (original) fingerprint field values to new fields prefixed with “fp_”
  • Deletes :clean_fingerprint after it has been decoded
  • Adds a :corrected field.
  • Compares each original/fp_ field with its corresponding field in the returned file. For rows where any values of the fingerprint_fields was changed in the returned worksheet, the names of the fields with changed values are gathered in the :corrected field. For rows with no changes, the :corrected field is blank.

Deletes the fields prefixed with :fp_ derived during the FlagChanged process.

Runs Kiba::Extend::Transforms::Clean::EnsureConsistentFields to ensure all rows have the same fields.

Corrections :cleanup_base_name__corrections

Reads in the output of ReturnedCompiled.

Deletes rows where :corrections field is blank.

This leaves just rows where changes were made in a returned worksheet, from oldest to newest. Order is important!

Because this is an iterative cleanup process, we need to account for the fact that cleanup done in worksheet #2 may have been done on a single row that resulted from the cleanup of 4 rows in worksheet #1. Recall the “North Carolina” example in CleanedUniq.

For this reason, and because we merge all the corrections, from oldest-to-newest, back into BaseJobCleaned on the original :fingerprint, we run Kiba::Extend::Transforms::Explode::RowsFromMultivalField on that :fingerprint (or equivalent) field.

So, if, in round 1, the :state field values “NC”, “N.C.”, and “N. Carolina” were all changed to “North Carolina”, we have 3 rows in Corrections output with instructions to merge “North Carolina” into the :state field in rows with matching :fingerprint values. (The 4th “North Carolina” row had no change in round 1.)

Now, in round 2, the client noticed that the row with :state = “Ohio” also has :country = “USA”, and added “USA” as country in the row for “North Carolina”.

The Corrections output is now also going to have 4 rows with instructions to merge “USA” into the :country field in rows with matching :fingerprint values.

So:


| country | state          | corrected | fingerprint |
|---------+----------------+-----------+-------------|
|         | North Carolina | state     |           2 |
|         | North Carolina | state     |           3 |
|         | North Carolina | state     |           4 |
| USA     | North Carolina | country   |           1 |
| USA     | North Carolina | country   |           2 |
| USA     | North Carolina | country   |           3 |
| USA     | North Carolina | country   |           4 |

For lookup/merge back into BaseJobCleaned, those rows are gathered into a hash, with :fingerprint as the key:

{ 2=>[
  {country: nil, state: "North Carolina", corrected: "state", fingerprint: 2},
  {country: "USA", state: "North Carolina", corrected: "country", fingerprint: 2}
 ]
}

When the BaseJobCleaned merge process hits the row with :fingerprint = 2, it carries out the corrections per row, in order.

  • row[:state] = "North Carolina"
  • row[:country] = "USA"

Why does it do this so inefficiently? Why not just take the last cleanup row for each fingerprint and replace the field values?

I can’t tell you the details why but at some point I tried something like that and ended up with a mess. That was before I had worked out some of the stuff with having two separate fingerprints, and there were other complications with that one. But I worked out this process and it works, generally, across the board, so I’m leaving it for now.

Final :cleanup_base_name__final

Unless you define custom transforms for this one, it just returns BaseJobCleaned with :fingerprint (or your custom field defined in an override final_lookup_on_field method).

Use this as a lookup to get your cleaned data back into other places in the migration.