Class: Kiba::Extend::Transforms::Deduplicate::Table

Inherits:
Object
  • Object
show all
Defined in:
lib/kiba/extend/transforms/deduplicate/table.rb

Overview

Note:

This transform runs in memory, so for very large sources, it may take a long time or fail. In this case, use a combination of Flag and FilterRows::FieldEqualTo

Given a field on which to deduplicate, removes duplicate rows from table

Keeps the row with the first instance of the value in the deduplicating field

Tip: Use CombineValues::FromFieldsWithDelimiter or CombineValues::FullRecord to create a combined field on which to deduplicate

Input table:

| foo | bar | baz |  combined |
|-----------------------------|
| a   | b   | f   | a b       |
| c   | d   | g   | c d       |
| c   | e   | h   | c e       |
| c   | d   | i   | c d       |
| c   | d   | j   | c d       |

Used in pipeline as:

transform Deduplicate::Table, field: :combined, delete_field: true

Results in:

| foo | bar | baz |
|-----------------|
| a   | b   | f   |
| c   | d   | g   |
| c   | e   | h   |

Used in pipeline as:

transform Deduplicate::Table, field: :combined, delete_field: true,
  example_source_field: :baz, max_examples: 2,
  example_target_field: :ex, example_delim: ";"

Results in:

| foo | bar | baz | ex |
|-----------------|----|
| a   | b   | f   | f  |
| c   | d   | g   | g;i|
| c   | e   | h   | h  |

Used in pipeline as:

transform Deduplicate::Table, field: :combined, delete_field: true,
  example_source_field: :baz, max_examples: 2,
  example_target_field: :ex, example_delim: ";", include_occs: true

Results in:

| foo | bar | baz | ex | occurrences |
|-----------------|----|-------------|
| a   | b   | f   | f  | 1           |
| c   | d   | g   | g;i| 3           |
| c   | e   | h   | h  | 1           |

Since:

  • 2.2.0

Instance Method Summary collapse

Constructor Details

#initialize(field:, delete_field: false, example_source_field: nil, max_examples: 10, example_target_field: :examples, example_delim: Kiba::Extend.delim, include_occs: false, occs_target_field: :occurrences) ⇒ Table

Returns a new instance of Table.

Parameters:

  • field (Symbol)

    name of field on which to deduplicate

  • delete_field (Boolean) (defaults to: false)

    whether to delete the deduplication field after doing deduplication

  • example_source_field (nil, Symbol) (defaults to: nil)

    field containing values to be compiled as examples

  • max_examples (Integer) (defaults to: 10)

    maximum number of example values to return

  • example_target_field (Symbol) (defaults to: :examples)

    name of field in which to report example values

  • example_delim (String) (defaults to: Kiba::Extend.delim)

    used to join multiple example values

  • include_occs (Boolean) (defaults to: false)

    whether to report number of occurrences of each field value being deduplicated on

  • occs_target_field (Symbol) (defaults to: :occurrences)

    name of field in which to report occurrences

Since:

  • 2.2.0



104
105
106
107
108
109
110
111
112
113
114
115
116
117
# File 'lib/kiba/extend/transforms/deduplicate/table.rb', line 104

def initialize(field:, delete_field: false, example_source_field: nil,
  max_examples: 10, example_target_field: :examples,
  example_delim: Kiba::Extend.delim,
  include_occs: false, occs_target_field: :occurrences)
  @field = field
  @deduper = {}
  @delete = delete_field
  @example = example_source_field
  @max_examples = max_examples
  @ex_target = example_target_field
  @delim = example_delim
  @occs = include_occs
  @occ_target = occs_target_field
end

Instance Method Details

#closeObject

Since:

  • 2.2.0



130
131
132
133
134
135
136
137
138
# File 'lib/kiba/extend/transforms/deduplicate/table.rb', line 130

def close
  deduper.values.each do |hash|
    row = hash[:row]
    add_example_field(row, hash) if example
    row[occ_target] = hash[:occs] if occs
    row.delete(field) if delete
    yield row
  end
end

#process(row) ⇒ Object

Parameters:

  • row (Hash{ Symbol => String, nil })

Since:

  • 2.2.0



120
121
122
123
124
125
126
127
128
# File 'lib/kiba/extend/transforms/deduplicate/table.rb', line 120

def process(row)
  field_val = row.fetch(field, nil)
  return if field_val.blank?

  get_row(field_val, row)
  get_occ(field_val, row) if occs
  get_example(field_val, row) if example
  nil
end