Package 'r2dii.match' reference manual

Title:	Tools to Match Corporate Lending Portfolios with Climate Data
Description:	These tools implement in R a fundamental part of the software 'PACTA' (Paris Agreement Capital Transition Assessment), which is a free tool that calculates the alignment between financial portfolios and climate scenarios (<https://www.transitionmonitor.com/>). Financial institutions use 'PACTA' to study how their capital allocation decisions align with climate change mitigation goals. This package matches data from corporate lending portfolios to asset level data from market-intelligence databases (e.g. power plant capacities, emission factors, etc.). This is the first step to assess if a financial portfolio aligns with climate goals.
Authors:	Jacob Kastl [aut, cre, ctr] , Alex Axthelm [aut, ctr] , Jackson Hoffart [aut, ctr] , Mauro Lepore [aut, ctr] , Klaus Hagedorn [aut], Florence Palandri [aut], Evgeny Petrovsky [aut], RMI [cph, fnd]
Maintainer:	Jacob Kastl <[email protected]>
License:	MIT + file LICENSE
Version:	0.4.0.9000
Built:	2025-02-17 11:19:14 UTC
Source:	https://github.com/rmi-pacta/r2dii.match

Crucial `loanbook` columns for `match_name()`

Description

This is a helper to select the minimum loanbook columns you need to run match_name(). Using more columns may use too much time and memory.

Usage

crucial_lbk()
crucial_lbk()

Value

A character vector.

Examples

crucial_lbk()
crucial_lbk()

Data Dictionary

Description

A table of column names and descriptions of data frames used or exported by the functions in this package.

Usage

data_dictionary
data_dictionary

Format

`data_dictionary`

dataset: Name of the dataset
column: Name of the column
typeof: Type of the column
definition: Definition of the column

Examples

data_dictionary
data_dictionary

Match a loanbook to asset-based company data (abcd) by the `⁠name_*⁠` columns

Description

match_name() scores the match between names in a loanbook dataset (columns can be name_direct_loantaker, ⁠name_intermediate_parent*⁠ and name_ultimate_parent) with names in an asset-based company data (column name_company). The raw names are first internally transformed, and aliases are assigned. The similarity between aliases in each of the loanbook and abcd is scored using stringdist::stringsim().

Usage

match_name(
  loanbook,
  abcd,
  by_sector = TRUE,
  min_score = 0.8,
  method = "jw",
  p = 0.1,
  overwrite = NULL,
  join_id = NULL,
  sector_classification = default_sector_classification(),
  ...
)
match_name(
  loanbook,
  abcd,
  by_sector = TRUE,
  min_score = 0.8,
  method = "jw",
  p = 0.1,
  overwrite = NULL,
  join_id = NULL,
  sector_classification = default_sector_classification(),
  ...
)

Arguments

`loanbook`, `abcd`	data frames structured like r2dii.data::loanbook_demo and r2dii.data::abcd_demo.
`by_sector`	Should names only be compared if companies belong to the same `sector`?
`min_score`	A number between 0-1, to set the minimum `score` threshold. A `score` of 1 is a perfect match.
`method`	Method for distance calculation. One of `c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex")`. See stringdist::stringdist-metrics.
`p`	Prefix factor for Jaro-Winkler distance. The valid range for `p` is `0 <= p <= 0.25`. If `p=0` (default), the Jaro-distance is returned. Applies only to `method='jw'`.
`overwrite`	A data frame used to overwrite the `sector` and/or `name` columns of a particular direct loantaker or ultimate parent. To overwrite only `sector`, the value in the `name` column should be `NA` and vice-versa. This file can be used to manually match loanbook companies to abcd.
`join_id`	A join specification passed to `dplyr::inner_join()`. If a character string, it assumes identical join columns between `loanbook` and `abcd`. If a named character vector, it uses the name as the join column of `loanbook` and the value as the join column of `abcd`.
`sector_classification`	A data frame containing sector classifications in the same format as `r2dii.data::sector_classifications`. The default value is `r2dii.data::sector_classifications`.
`...`	Arguments passed on to `stringdist::stringsim()`.

Value

A data frame with the same groups (if any) and columns as loanbook, and the additional columns:

id_2dii - an id used internally by match_name() to distinguish companies
level - the level of granularity that the loan was matched at (e.g direct_loantaker or ultimate_parent)
sector - the sector of the loanbook company
sector_abcd - the sector of the abcd company
name - the name of the loanbook company
name_abcd - the name of the abcd company
score - the score of the match (manually set this to 1 prior to calling prioritize() to validate the match)
source - determines the source of the match. (equal to loanbook unless the match is from overwrite

The returned rows depend on the argument min_value and the result of the column score for each loan: * If any row has score equal to 1, match_name() returns all rows where score equals 1, dropping all other rows. * If no row has score equal to 1,match_name() returns all rows where score is equal to or greater than min_score. * If there is no match the output is a 0-row tibble with the expected column names – for type stability.

Assigning aliases

The transformation process used to compare names between loanbook and abcd datasets applies best practices commonly used in name matching algorithms:

Remove special characters.
Replace language specific characters.
Abbreviate certain names to reduce their importance in the matching.
Spell out numbers to increase their importance.

Handling grouped data

This function ignores but preserves existing groups.

Examples

## Not run: 
library(r2dii.data)
library(tibble)

# Small data for examples
loanbook <- head(loanbook_demo, 50)
abcd <- head(abcd_demo, 50)

match_name(loanbook, abcd)

match_name(loanbook, abcd, min_score = 0.9)

# match on LEI
loanbook <- tibble(
  sector_classification_system = "NACE",
  sector_classification_direct_loantaker = "D35.11",
  id_ultimate_parent = "UP15",
  name_ultimate_parent = "Won't fuzzy match",
  id_direct_loantaker = "C294",
  name_direct_loantaker = "Won't fuzzy match",
  lei_direct_loantaker = "LEI123"
)

abcd <- tibble(
  name_company = "alpine knits india pvt. limited",
  sector = "power",
  lei = "LEI123"
)

match_name(loanbook, abcd, join_id = c(lei_direct_loantaker = "lei"))

# Use your own `sector_classifications`
your_classifications <- tibble(
  sector = "power",
  borderline = FALSE,
  code = "D35.11",
  code_system = "XYZ"
)

loanbook <- tibble(
  sector_classification_system = "XYZ",
  sector_classification_direct_loantaker = "D35.11",
  id_ultimate_parent = "UP15",
  name_ultimate_parent = "Alpine Knits India Pvt. Limited",
  id_direct_loantaker = "C294",
  name_direct_loantaker = "Yuamen Xinneng Thermal Power Co Ltd"
)

abcd <- tibble(
  name_company = "alpine knits india pvt. limited",
  sector = "power"
)

match_name(loanbook, abcd, sector_classification = your_classifications)

# Cleanup
options(restore)

## End(Not run)
## Not run: 
library(r2dii.data)
library(tibble)

# Small data for examples
loanbook <- head(loanbook_demo, 50)
abcd <- head(abcd_demo, 50)

match_name(loanbook, abcd)

match_name(loanbook, abcd, min_score = 0.9)

# match on LEI
loanbook <- tibble(
  sector_classification_system = "NACE",
  sector_classification_direct_loantaker = "D35.11",
  id_ultimate_parent = "UP15",
  name_ultimate_parent = "Won't fuzzy match",
  id_direct_loantaker = "C294",
  name_direct_loantaker = "Won't fuzzy match",
  lei_direct_loantaker = "LEI123"
)

abcd <- tibble(
  name_company = "alpine knits india pvt. limited",
  sector = "power",
  lei = "LEI123"
)

match_name(loanbook, abcd, join_id = c(lei_direct_loantaker = "lei"))

# Use your own `sector_classifications`
your_classifications <- tibble(
  sector = "power",
  borderline = FALSE,
  code = "D35.11",
  code_system = "XYZ"
)

loanbook <- tibble(
  sector_classification_system = "XYZ",
  sector_classification_direct_loantaker = "D35.11",
  id_ultimate_parent = "UP15",
  name_ultimate_parent = "Alpine Knits India Pvt. Limited",
  id_direct_loantaker = "C294",
  name_direct_loantaker = "Yuamen Xinneng Thermal Power Co Ltd"
)

abcd <- tibble(
  name_company = "alpine knits india pvt. limited",
  sector = "power"
)

match_name(loanbook, abcd, sector_classification = your_classifications)

# Cleanup
options(restore)

## End(Not run)

Pick rows where `score` is 1 and `level` per loan is of highest `priority`

Description

When multiple perfect matches are found per loan (e.g. a match at direct_loantaker level and ultimate_parent level), we must prioritize the desired match. By default, the highest priority is the most granular match (i.e. direct_loantaker).

Usage

prioritize(data, priority = NULL)
prioritize(data, priority = NULL)

Arguments

data

A data frame like the validated output of match_name(). See Details on how to validate data.

priority

One of:

NULL: defaults to the default level priority as returned by prioritize_level().
A character vector giving a custom priority.
A function to apply to the output of prioritize_level(), e.g. rev.
A quosure-style lambda function, e.g. ~ rev(.x).

Details

How to validate data Write the output of match_name() into a .csv file with:

# Writting to current working directory
matched %>%
  readr::write_csv("matched.csv")

Compare, edit, and save the data manually:

Open matched.csv with any spreadsheet editor (Excel, Google Sheets, etc.).
Compare the columns name and name_abcd manually to determine if the match is valid. Other information can be used in conjunction with just the names to ensure the two entities match (sector, internal information on the company structure, etc.)
Edit the data:
- If you are happy with the match, set the score value to 1.
- Otherwise set or leave the score value to anything other than 1.
Save the edited file as, say, valid_matches.csv.

Re-read the edited file (validated) with:

# Reading from current working directory
valid_matches <- readr::read_csv("valid_matches.csv")

Value

A data frame with a single row per loan, where score is 1 and priority level is highest.

Handling grouped data

This function ignores but preserves existing groups.

Examples

library(dplyr)

# styler: off
matched <- tribble(
  ~sector, ~sector_abcd,  ~score, ~id_loan,                ~level,
   "coal",      "coal",       1,     "aa",     "ultimate_parent",
   "coal",      "coal",       1,     "aa",    "direct_loantaker",
   "coal",      "coal",       1,     "bb", "intermediate_parent",
   "coal",      "coal",       1,     "bb",     "ultimate_parent",
)
# styler: on

prioritize_level(matched)

# Using default priority
prioritize(matched)

# Using the reverse of the default priority
prioritize(matched, priority = rev)

# Same
prioritize(matched, priority = ~ rev(.x))

# Using a custom priority
bad_idea <- c("intermediate_parent", "ultimate_parent", "direct_loantaker")

prioritize(matched, priority = bad_idea)
library(dplyr)

# styler: off
matched <- tribble(
  ~sector, ~sector_abcd,  ~score, ~id_loan,                ~level,
   "coal",      "coal",       1,     "aa",     "ultimate_parent",
   "coal",      "coal",       1,     "aa",    "direct_loantaker",
   "coal",      "coal",       1,     "bb", "intermediate_parent",
   "coal",      "coal",       1,     "bb",     "ultimate_parent",
)
# styler: on

prioritize_level(matched)

# Using default priority
prioritize(matched)

# Using the reverse of the default priority
prioritize(matched, priority = rev)

# Same
prioritize(matched, priority = ~ rev(.x))

# Using a custom priority
bad_idea <- c("intermediate_parent", "ultimate_parent", "direct_loantaker")

prioritize(matched, priority = bad_idea)

Arrange unique `level` values in default order of `priority`

Description

Arrange unique level values in default order of priority

Usage

prioritize_level(data)
prioritize_level(data)

Arguments

data

A data frame, commonly the output of match_name().

Value

A character vector of the default level priority per loan.

Examples

matched <- tibble::tibble(
  level = c(
    "intermediate_parent_1",
    "direct_loantaker",
    "direct_loantaker",
    "direct_loantaker",
    "ultimate_parent",
    "intermediate_parent_2"
  )
)
prioritize_level(matched)
matched <- tibble::tibble(
  level = c(
    "intermediate_parent_1",
    "direct_loantaker",
    "direct_loantaker",
    "direct_loantaker",
    "ultimate_parent",
    "intermediate_parent_2"
  )
)
prioritize_level(matched)

Package 'r2dii.match'

Help Index

Crucial `loanbook` columns for `match_name()`

Description

Usage

Value

See Also

Examples

Data Dictionary

Description

Usage

Format

`data_dictionary`

Examples

Match a loanbook to asset-based company data (abcd) by the `⁠name_*⁠` columns

Description

Usage

Arguments

Value

Assigning aliases

Handling grouped data

See Also

Examples

Pick rows where `score` is 1 and `level` per loan is of highest `priority`

Description

Usage

Arguments

Details

Value

Handling grouped data

See Also

Examples

Arrange unique `level` values in default order of `priority`

Description

Usage

Arguments

Value

See Also

Examples

Package 'r2dii.match'

Help Index

Crucial loanbook columns for match_name()

Description

Usage

Value

See Also

Examples

Data Dictionary

Description

Usage

Format

data_dictionary

Examples

Match a loanbook to asset-based company data (abcd) by the ⁠name_*⁠ columns

Description

Usage

Arguments

Value

Assigning aliases

Handling grouped data

See Also

Examples

Pick rows where score is 1 and level per loan is of highest priority

Description

Usage

Arguments

Details

Value

Handling grouped data

See Also

Examples

Arrange unique level values in default order of priority

Description

Usage

Arguments

Value

See Also

Examples

Crucial `loanbook` columns for `match_name()`

`data_dictionary`

Match a loanbook to asset-based company data (abcd) by the `⁠name_*⁠` columns

Pick rows where `score` is 1 and `level` per loan is of highest `priority`

Arrange unique `level` values in default order of `priority`