Data Linking and Deduplication with ARC

Getting started

Starting a linking project with ARC is simple - you just need to load your data into a pair of Spark Dataframes, and pass them to the Autolinker object in a list.

However, ARC has many other optional arguments which can be used to fine tune its behaviour. These are detailed below, along with an explanation of when you would use them.

data: typing.Union[pyspark.sql.dataframe.DataFrame, list]

The input data to ARC. A single Dataframe will perform deduplication, a list of 2 dataframes will perform linking between the two datasets. If linking, for best performance ensure schemas are standardised prior to linking (i.e. the 2 input tables should share an identical schema). This is not required - ARC can handle mismatching schemas, but you will get better performance with identical schemas.

attribute_columns:list=None

Which columns should be used for evaluating if records are connected. In the case of linking 2 datasets, will only work with a identical schemas. If not provided, ARC will use all columns (except a unique ID column if provided) as input attributes to the model.

When to use

  • if you have very wide tables with lot of extraneous information which will not help the model in determining if records represent the same thing.

  • if you want to link at a coarser granularity than the data are. For example, if you have a table of people and addresses and your aim is to link addresses, you could either drop the columns containing the people information, or use the attribute_columns arg to specify only the address information columns.

unique_id:str=None

Which column contains a unique per record ID. ARC will ignore this for purposes of linking. If None, ARC will append a new column to the dataset called unique_id.

When to use - if your data has a unique id column. ARC will evaluate this column for linking if not, which at best will not help produce a good model.

comparison_size_limit:int=100000

The maximum number of pairs of records a blocking rule will be allowed to have. Blocking rules are heuristics used by Splink to control which records are potential duplicates. ARC auto-generates blocking rules for you, and uses this parameter to control which blocking rules are used by looking at the amount of potential duplicates each rule generates.

When to use

  • if you want to speed up the model training process, try setting a lower value, i.e. 50,000.

  • this parameter will directly impact the recall of the model - if it is too low, you risk missing pairs. It is better to err on the side of too big than too small.

  • it is generally unnecessary to change this value.

max_evals:int=5

The maximum number of evaluations ARC will do during its hyperparameter optimisation search. As a default value, 5 will give a taste of how ARC works, but internal testing showed good linking results from at least 100 runs.

When to use

  • set to 100 or more when training a model for proper evaluation - this will take a long time, but it is reflective of

the size of search space across which HyperOpt needs to explore to find the best set of arguments. - this should only be left as the default argument during initial testing and evaluation.

cleaning=”all”

Provide an option of “all” or “none”. ARC will lowercase and remove non-alphanumeric characters from all string columns if set to all.

When to use

  • if you don’t want ARC to do string cleaning for you

threshold:float=0.9

The probability threshold above which a pair is considered a match. This is used in the optimisation process. Requires a value between 0 and 1.

When to use

  • use this to balance precision and recall - higher number -> more emphasis on precision, less on recall. Lower value -> more emphasis on recall, less on precision.

true_label:str=None

The column name which contains the true record IDs. If provided, ARC will automatically score its model against the known true labels.

When to use

  • if you have an already deduplicated / linked set of data that you want to use as a benchmark to assess ARC’s performance.

random_seed:int=42 Random seed for controlling the psuedoRNG. Set for reproducibility between runs.

metric:str=”information_gain_power_ratio”

Which metric should ARC use for it’s optimisation process. This will be deprecated in future releases and should not be changed.

sample_for_blocking_rules=True

Downsample larger datasets to speed up blocking rule estimation.