kiba-extend cookbook

A place to share patterns for handling common workflows.

Generate unique subject headings used in multiple fields, across multiple data sources

Client data includes multiple tables. One or more fields from each table will end up mapping to subject. We need a list of all unique values mapping to subject for client to review, and/or to use as a base for producing a cleanup worksheet. Sample data:

  • coll1.csv - mainsub, addtlsub fields contain subject values
  • coll2.csv - subject field contains subject values

Assume we have already registered supplied entries for these data sources:

  • :orig__coll1
  • :orig__coll2

I’m going to set this up so that it can easily scale.

Create extract job to get unique subject values from each field in a consistent way

First, create a job that accepts parameters. This will be used to isolate your subject field columns so they can be used as data sources. This job will drop rows with no subject values, and deduplicate each individual column, so our eventual combined list is smaller. It will also rename each source field to :subject.

# clientproject/lib/client/jobs/subjects/extract_field_vals.rb

module Client
  module Jobs
    module Subjects
      module ExtractFieldVals
        module_function

        # @param source [Symbol] registry entry key of source, without namespace
        # @param dest [Symbol] registry entry key of destination, without namespace
        # @param field [Symbol] field containing subject values
        def job(source:, dest:, field:)
          Kiba::Extend::Jobs::Job.new(
            files: {
              source: "orig__#{source}".to_sym,
              destination: "subjects__#{dest}".to_sym
            },
            transformer: xforms(field)
          )
        end

        def xforms(field)
          Kiba.job_segment do
            transform Delete::FieldsExcept,
              fields: field
            transform FilterRows::FieldPopulated,
              action: :keep,
              field: field
            transform Rename::Field,
              from: field,
              to: :subject
            transform Deduplicate::Table,
              field: field,
              delete_field: false
          end
        end
      end
    end
  end
end

TIP: Have multi-values in subject field?

Put this before you rename the field:

transform Explode::RowsFromMultivalField,
  field: field,
  delim: '|'

TIP: Have multi-values in subject field AND you want the unique subject subdivisions?

Note: This creates a 2-column output for each subject field. You get the whole subject field value, so there’s at least one example of the subdivision used in context.

Put this before you rename the field:

transform Explode::RowsFromMultivalField,
  field: field,
  delim: '|'

Put this after you rename the field:

transform Copy::Field,
  from: :subject,
  to: :subdivision
transform Explode::RowsFromMultivalField,
  field: :subdivision,
  delim: '--'
transform Clean::StripFields, fields: :subdivision

Then, change the field you deduplicate on to :subdivision.

Set up registry entries to dynamically extract and compile the field data based on config

Set up your registry stuff to call this job dynamically for whatever sources you give it:

Client.registry.namespace("subjects") do
  # Here is where you configure your subject value sources. If you have
  #   followed naming pattern/convention, then you don't have to touch
  #   this again
  src_cfg = [
    ["coll1", :mainsub],
    ["coll1", :addtlsub],
    ["coll2", :subject]
  ]
  # We'll use this as an argument passed to our compilation job. It just
  #   creates the full registry entry keys with namespaces that will be
  #   sources for that job
  subject_srcs = src_cfg.map do |cfg|
    "subjects__#{cfg[0]}_#{cfg[1]}".to_sym
  end

  # Dynamically create individual source field extract jobs
  src_cfg.each do |cfg|
    src = cfg[0]
    field = cfg[1]
    dest = "#{src}_#{field}".to_sym
    register dest, {
      path: File.join(Client.datadir, "working",
                      "subjects_from_#{dest}.csv"),
      creator: {
        callee: Client::Jobs::Subjects::ExtractFieldVals,
        args: {source: src, dest: dest, field: field}
      }
    }
  end
  # This is also called with args, so the sources can be based on
  #   the `src_cfg` you entered above
  register :compile, {
    path: File.join(Client.datadir, "working", "subjects_compiled.csv"),
    creator: {
      callee: Client::Jobs::Subjects::Compile,
      args: {sources: subject_srcs}
    }
  }
end

Write the compile job, taking its sources as an argument

And the compile job you’ll call to generate the list of all unique values. This is going to read in all the rows from all the single-column jobs and deduplicate the values:

module Client
  module Jobs
    module Subjects
      module Compile
        module_function

        # @param sources [Array<Symbol>]
        def job(sources:)
          Kiba::Extend::Jobs::Job.new(
            files: {
              source: sources,
              destination: :subjects__compile
            },
            transformer: xforms
          )
        end

        def xforms
          Kiba.job_segment do
            transform Deduplicate::Table,
              field: :subject,
              delete_field: false
          end
        end
      end
    end
  end
end

Run the compilation job

thor run job subjects__compile

Everything else runs automagically in the background.

Well, if any of your subject field sources ends up not writing a CSV file because there are no values in the field, this will fail. The easiest thing to do is remove that source from your src_cfg array.

If your project is more dynamic, you can register your compilation job like:

  register :compile, {
    path: File.join(Client.datadir, "working", "subjects_compiled.csv"),
    creator: {
      callee: Client::Jobs::Subjects::Compile,
      args: {sources: subject_srcs.select{ |key| Kiba::Extend::Job.output?(key) }}
    }
  }

Produce description field cleanup worksheet for client

This is old. Perhaps you would instead like to use the IterativeCleanup mixin instead?

Client has >10,000 source records. Their description field is free-text, but they have mostly entered data by cutting and pasting from the data entry guide for their previous system. However, some irregularities have crept in that they would like to clean up. Some of these irregularities may have been copied to many records by cloning records in the source system.

We want to provide an efficient way for the client to review and provide corrections for any irregular values.

We assume there will be other fields they may also want to clean up in a similar way, so we want to set up a pattern for this.

Create a namespace in the registry for cleanup worksheets. Register the output of the job that will create the description cleanup worksheet:

# in clientproject/lib/client/registry_data.rb

Client.registry.namespace("cleanup_wrkshts") do
  register :description, {
    path: File.join(Client.datadir, "to_client", "description_cleanup.csv"),
    creator: Client::Jobs::CleanupWrkshts::Description
  }
end

Here is the job to create the worksheet. The source data is in CSV format, with >10,000 rows and >30 columns:

# clientproject/lib/client/jobs/cleanup_wrkshts/description.rb

module Client
  module Jobs
    module CleanupWrkshts
      module Description
        module_function

        def job
          Kiba::Extend::Jobs::Job.new(
            files: {
              source: :base__data,
              destination: :cleanup_wrkshts__description
            },
            transformer: xforms
          )
        end

        def xforms
          Kiba.job_segment do
		    # We only care about the description values in the cleanup worksheet
            transform Delete::FieldsExcept,
              fields: :description
            transform FilterRows::FieldPopulated,
              action: :keep,
              field: :description
            transform Deduplicate::Table,
              field: :description,
              delete_field: false
		    # Create the field they'll enter any corrections into
            transform Append::NilFields,
              fields: %i[corrected]
          end
        end
      end
    end
  end
end

This produces a CSV with ~500 rows, with all unique description field values: a much more manageable cleanup project.

The cleanup instructions specify that the client should only edit the corrected field.

If the existing value needs correction, they should enter the correct value in corrected. If the existing value is ok, they should leave corrected blank.

Note: If there is a need for the client to sort/review the existing and corrected values together as part of the cleanup, I will normally generate an Excel workbook from the worksheet CSV and add a review/sort column with a formula like:

=if(isblank(b2),a2,b2)

(This assumes original value in column A, corrected in column B)

Merging client corrections back into migration

This is old. Perhaps you would instead like to use the IterativeCleanup mixin instead?

This assumes you have set things up as described in the Produce cleanup worksheet for client section above.

Client returns completed worksheet. If it was provided to them as Excel workbook, convert it back to CSV.

Make a registry entry for this supplied file:

# in clientproject/lib/client/registry_data.rb

Client.registry.namespace("cleanup_done") do
  register :description, {
    path: File.join(Client.datadir, "from_client", "description_cleanup.csv"),
    supplied: true
  }
end

We will also set up a job to keep only the corrections, so our lookup and merge of corrections into the migration can be faster (this gets us neglible gains in most migration projects, but is illustrative…):

# clientproject/lib/client/jobs/cleanup_prep/description.rb

module Client
  module Jobs
    module CleanupPrep
      module Description
        module_function

        def job
          Kiba::Extend::Jobs::Job.new(
            files: {
              source: :cleanup_done__description,
              destination: :cleanup_done__description_prep
            },
            transformer: xforms
          )
        end

        def xforms
          Kiba.job_segment do
            transform FilterRows::FieldPopulated,
              action: :keep,
              field: :corrected
          end
        end
      end
    end
  end
end

Make a registry entry for that job’s output. We are going to want to look up the corrected description value by the original description value, so we record that, if used as a lookup source, the output of the :description_prep job entry will be indexed/looked up on the description field. Our cleanup_done registry namespace now looks like:

# in clientproject/lib/client/registry_data.rb

Client.registry.namespace("cleanup_done") do
  register :description, {
    path: File.join(Client.datadir, "from_client", "description_cleanup.csv"),
    supplied: true
  }
    register :description_prep, {
    path: File.join(Client.datadir, "working", "description_cleanup.csv"),
    creator: Client::Jobs::CleanupPrep::Description,
	lookup_on: :description
  }
end

Finally, create and register a job that merges the corrected data:

Client.registry.namespace("cleanup_merge") do
  register :field_cleanup, {
    path: File.join(Client.datadir, "working", "cleaned_field_data_merged.csv"),
    creator: Client::Jobs::CleanupMerge::FieldCleanup
  }
end

# in clientproject/lib/client/jobs/cleanup_merge/field_cleanup.rb

module Client
  module Jobs
    module CleanupMerge
      module FieldCleanup
        module_function

        def job
          Kiba::Extend::Jobs::Job.new(
            files: {
              source: :base__data,
              destination: :cleanup_merge__field_cleanup,
              lookup: %i[
                         cleanup_done__description_prep
                         ]
            },
            transformer: xforms
          )
        end

        def xforms
          Kiba.job_segment do
            %i[description].each do |field|
              corrfield = "#{field}_corr".to_sym

              transform Merge::MultiRowLookup,
                lookup: send("cleanup_done__cdm1_#{field}".to_sym),
                keycolumn: field,
                fieldmap: {corrfield=>:corrected}

              transform do |row|
                corrval = row[corrfield]
                next row if corrval.blank?

                row[field] = corrval
                row
              end
              transform Delete::Fields, fields: corrfield
            end
          end
        end
      end
    end
  end
end

This looks a little weird, but if we later add a date cleanup that follows the same pattern, we can merge that in by making two minor changes:

lookup: %i[
           cleanup_done__description_prep
           ]

becomes:

lookup: %i[
           cleanup_done__description_prep
		   cleanup_done__date_prep
           ]

And

%i[description].each do |field|

becomes:

%i[description date].each do |field|

If you have a number of cleanup jobs like this, which follow the same pattern, you could also automate repetitive file registry and create the prep job (and maybe worksheet creation job) as jobs that take parameters.

Split one file into multiple files based on a field value

Use case example

You have provided a client with a list of terms and asked them to categorize terms into different authorities. You end up with something like:

| term                          | authority    |
|-------------------------------+--------------|
| Atlantic Ocean                | place        |
| Atlantis (ship)               | concept      |
| Automat                       | DROP         |
| Aux Galeries Lafayette (sign) | DROP         |
| Avenger (bomber)              | concept      |
| Avion (mobile home)           | concept      |
| Awaia Point                   | place        |
| Azima Odori (dance group)     | organization |
| Azteca Cafe                   | place        |

You need to create separate files for place, concept, and organization terms.

Create a job that takes arguments

module Project
  module Jobs
    module SubjectProper
      module Split
        module_function

        def job(authority:)
          Kiba::Extend::Jobs::Job.new(
            files: {
              source: :subject_proper__categorized,
              destination: "subject_proper__#{authority}".to_sym
            },
            transformer: xforms(authority)
          )
        end

        def xforms(authority)
          Kiba.job_segment do
            transform FilterRows::FieldEqualTo,
              action: :keep,
              field: :authority,
              value: authority
          end
        end
      end
    end
  end
end

Dynamically register the jobs

Project.registry.namespace("subject_proper") do
  ...

  %w[concept era event organization people person place].each do |authority|
    register authority.to_sym, {
      creator: {
        callee: Project::Jobs::SubjectProper::Split,
        args: {authority: authority}
      },
      path: File.join(Project.datadir, "working",
                      "subject_proper_#{authority}.csv"),
      tags: [:subject_proper, :split, "#{authority}s".to_sym]
    }
  end
end

Now you can run the split jobs: thor jobs tagged_and --tags subject_proper split