Class: Kiba::Extend::Transforms::Deduplicate::Table

Inherits:
Object
  • Object
show all
Defined in:
lib/kiba/extend/transforms/deduplicate/table.rb

Overview

Note:

This transform runs in memory, so for very large sources, it may take a long time or fail. In this case, use a combination of Flag and FilterRows::FieldEqualTo

Given a field on which to deduplicate, removes duplicate rows from table

Keeps the row with the first instance of the value in the deduplicating field

Tip: Use CombineValues::FromFieldsWithDelimiter or CombineValues::FullRecord to create a combined field on which to deduplicate

Input table:

| foo | bar | baz |  combined |
|-----------------------------|
| a   | b   | f   | a b       |
| c   | d   | g   | c d       |
| c   | e   | h   | c e       |
| c   | d   | i   | c d       |

Used in pipeline as:

transform Deduplicate::Table, field: :combined, delete_field: true

Results in:

| foo | bar | baz |
|-----------------|
| a   | b   | f   |
| c   | d   | g   |
| c   | e   | h   |

Since:

  • 2.2.0

Instance Method Summary collapse

Constructor Details

#initialize(field:, delete_field: false) ⇒ Table

Returns a new instance of Table.

Parameters:

  • field (Symbol)

    name of field on which to deduplicate

  • delete_field (Boolean) (defaults to: false)

    whether to delete the deduplication field after doing deduplication

Since:

  • 2.2.0



50
51
52
53
54
# File 'lib/kiba/extend/transforms/deduplicate/table.rb', line 50

def initialize(field:, delete_field: false)
  @field = field
  @deduper = {}
  @delete = delete_field
end

Instance Method Details

#closeObject

Since:

  • 2.2.0



66
67
68
69
70
71
# File 'lib/kiba/extend/transforms/deduplicate/table.rb', line 66

def close
  @deduper.values.each do |row|
    row.delete(@field) if @delete
    yield row
  end
end

#process(row) ⇒ Object

Parameters:

  • row (Hash{ Symbol => String, nil })

Since:

  • 2.2.0



57
58
59
60
61
62
63
64
# File 'lib/kiba/extend/transforms/deduplicate/table.rb', line 57

def process(row)
  field_val = row.fetch(@field, nil)
  return if field_val.blank?
  return if @deduper.key?(field_val)

  @deduper[field_val] = row
  nil
end