Refactor link checker to enable exclusion by shoulder by sfisher · Pull Request #884 · CDLUC3/ezid

sfisher · 2025-05-27T20:46:28Z

This allows exclusion by regex, which would be the same as shoulder if using a regex like ^ark:/13030/c8 which anchors at the beginning of the identifier string. It allows us flexibility if we need to do more involved exclusion later but works for this use case.

I tried to follow the patterns already being used in these files.

I'm not sure about how to test this out, so I'll mark as a draft PR until we can discuss and I can run through that testing.

I'm also not sure about reporting from the link checker. I think this is an unknown task right now (about how reporting should be affected) and my belong in it's own task if we're overhauling it (and some users are already excluded and I'm not sure how that works with reporting).

…hich can be shoulders if anchoring at the first of the string with something like '^ark:/13030/c8' or can give additional flexibility if other excludes are needed in the future.

jsjiang · 2025-05-27T23:53:32Z

+    LINKCHECKER_ID_EXCLUSION_ENABLED = True
+    LINKCHECKER_ID_EXCLUSION_FILE = 'path/to/id_exclusion_file.txt'
+
+The id exclusion file should contain regular expression patterns, one per line, that


I feel a shoulder list is good enough. We may want to get inputs from Adam regarding on using regular expression in the shoulder exclusion file.

Another option is to use the Python build-in string functions such as startwith(). It supports matching multiple patterns using a tuple of strings.

Changed to use tuple instead of regex which gives us a bit less possible functionality, but likely does the job just fine.

sfisher · 2025-05-29T17:07:58Z

Please refer to this code for the speed as a test example. If anything a regular expression will execute faster. I put this in a file and ran it.

import shortuuid
import re
import time

# generate 100,000 random strings
strings_to_check = [str(shortuuid.uuid()) for _ in range(99999)]

strings_to_check.append('ark:/13030/c9f488d74')

shoulder_exclusions = ['ark:/13030/c8', 'ark:/13030/c9', 'ark:/13030/c7', 'ark:/13030/c6']

shoulder_regexes = [f"(?:^{x.strip()})" for x in shoulder_exclusions]
combined = "|".join(shoulder_regexes)
id_exclusion_regex = re.compile(combined, re.IGNORECASE)

output_list1 = []
output_list2 = []

start = time.perf_counter()

for string in strings_to_check:
    if id_exclusion_regex.search(string):
        print(f"Excluding: {string}")
    else:
        output_list1.append(string)

print(len(output_list1), "strings after exclusion using regex")

end = time.perf_counter()
print(f"Elapsed time: {end - start:.4f} seconds when using regex")


start = time.perf_counter()

for string in strings_to_check:
    for shoulder in shoulder_exclusions:
        ok_shoulder = True
        if string.startswith(shoulder):
            print(f"Excluding: {string}")
            ok_shoulder = False
            break

    if ok_shoulder:
        output_list2.append(string)

print(len(output_list1), "strings after exclusion using startwith")

end = time.perf_counter()
print(f"Elapsed time: {end - start:.4f} seconds when using startswith")

Output:

Excluding: ark:/13030/c9f488d74
99999 strings after exclusion using regex
Elapsed time: 0.0106 seconds when using regex
Excluding: ark:/13030/c9f488d74
99999 strings after exclusion using startwith
Elapsed time: 0.0417 seconds when using startswith

sfisher · 2025-05-29T18:23:48Z

It sounds like there are a number of changes:

move the caret ^ from the regular expression from the file to the code
Put in debugging print output to be sure items are being excluded (and create file with shoulders to exclude)
Add an example file for people to refer to
uc3-ops-puppet-modules may need a file added for the exclusions that we have

uc3-ops-puppet-modules/modules/uc3_script_management/files/uc3-ezid-ui folder. See my notes here: https://github.com/CDLUC3/ezid/issues/867#issuecomment-2847970990

… skipped identifiers for testing

sfisher · 2025-06-04T18:17:00Z

I've deployed this on dev and tested and it is working. (I had to change the log line to info level temporarily on dev to see the output.

My exclude file contents for testing:

$ cat ~/var/data/linkchecker_id_exclusion_list.txt
# put the list of ID prefixes (startswith, ie shoulders) here that should be excluded from link checking
ark:/12345/fk
ark:/13030/c7
doi:10.15697/FK
ark:/47881/m6
doi:10.5062/F4
ark:/99999/fk
doi:10.5060/D2
ark:/81983/s9
ark:/87278/s6

Sample of the output from the link checker:

INFO ezidapp.management.commands.proc-link-checker 139967038674752: Link checker exclusion enabled with file: /apps/ezid/var/data/link_check_excl
usion_list.txt
    INFO ezidapp.management.commands.proc-link-checker 139967038674752: Link checker ID exclusion enabled with file: /apps/ezid/var/data/linkchecker_
id_exclusion_list.txt
    INFO ezidapp.management.commands.proc-link-checker 139967038674752: exclusion file successfully loaded
    INFO ezidapp.management.commands.proc-link-checker 139967038674752: id exclusion file successfully loaded
    INFO ezidapp.management.commands.proc-link-checker 139967038674752: begin update table
    INFO ezidapp.management.commands.proc-link-checker 139967038674752: Skipping identifier ark:/12345/fk1234 due to ID shoulder exclusion
    INFO ezidapp.management.commands.proc-link-checker 139967038674752: Skipping identifier ark:/12345/fk1235 due to ID shoulder exclusion
    INFO ezidapp.management.commands.proc-link-checker 139967038674752: Skipping identifier ark:/12345/fk3 due to ID shoulder exclusion
    INFO ezidapp.management.commands.proc-link-checker 139967038674752: Skipping identifier ark:/12345/fk8mq0c due to ID shoulder exclusion
    INFO ezidapp.management.commands.proc-link-checker 139967038674752: Skipping identifier ark:/12345/fk1234 due to ID shoulder exclusion
    INFO ezidapp.management.commands.proc-link-checker 139967038674752: Skipping identifier ark:/12345/fk1235 due to ID shoulder exclusion
    INFO ezidapp.management.commands.proc-link-checker 139967038674752: Skipping identifier ark:/12345/fk3 due to ID shoulder exclusion
    INFO ezidapp.management.commands.proc-link-checker 139967038674752: Skipping identifier ark:/12345/fk8mq0c due to ID shoulder exclusion
    INFO ezidapp.management.commands.proc-link-checker 139967038674752: Skipping identifier ark:/13030/c72z12p53 due to ID shoulder exclusion
    INFO ezidapp.management.commands.proc-link-checker 139967038674752: Skipping identifier ark:/13030/c76m3337d due to ID shoulder exclusion
    INFO ezidapp.management.commands.proc-link-checker 139967038674752: Skipping identifier ark:/13030/c7833mx7t due to ID shoulder exclusion
    INFO ezidapp.management.commands.proc-link-checker 139967038674752: Skipping identifier ark:/13030/c7bc3sx0w due to ID shoulder exclusion

So this is working correctly for exclusions which I put in for testing in the puppet repo.

I'm going to go back and remove all the testing exclusions I put in to that file since Adam said he didn't have any real exclusions to put in yet.

jsjiang

Changes look good. Making string comparison case insensitive is a good approach as we may have DOIs in uppercase in the database.

Thank you Scott!

Jing

sfisher added 2 commits May 27, 2025 13:27

This creates configuration for ID exclusion based on REGEX patterns w…

cfc1ca6

…hich can be shoulders if anchoring at the first of the string with something like '^ark:/13030/c8' or can give additional flexibility if other excludes are needed in the future.

Add to documentation for ID exclusion

fe4e8a0

sfisher requested a review from jsjiang May 27, 2025 20:46

sfisher linked an issue May 27, 2025 that may be closed by this pull request

Refactor link checker to enable exclusion by shoulder #875

Closed

sfisher marked this pull request as draft May 27, 2025 20:46

jsjiang reviewed May 28, 2025

View reviewed changes

sfisher added 4 commits June 2, 2025 16:48

Moving Caret to anchor at first of string out of file

7be75b4

Change combined regex to ID exclusion tuple for using with startswith.

fccaae8

change comments to reflect code changes.

a97833f

Make the matching case-insensitive and also add log.debug message for…

10f536d

… skipped identifiers for testing

sfisher marked this pull request as ready for review June 4, 2025 18:25

jsjiang approved these changes Jun 5, 2025

View reviewed changes

sfisher changed the base branch from main to develop June 12, 2025 16:05

sfisher merged commit 8fa0aaa into develop Jun 12, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor link checker to enable exclusion by shoulder#884

Refactor link checker to enable exclusion by shoulder#884
sfisher merged 6 commits intodevelopfrom
875-refactor-link-checker-to-enable-exclusion-by-shoulder

sfisher commented May 27, 2025

Uh oh!

jsjiang May 27, 2025

Uh oh!

jsjiang May 30, 2025

Uh oh!

sfisher Jun 2, 2025

Uh oh!

sfisher Jun 3, 2025

Uh oh!

Uh oh!

sfisher commented May 29, 2025 •

edited

Loading

Uh oh!

sfisher commented May 29, 2025 •

edited

Loading

Uh oh!

sfisher commented Jun 4, 2025

Uh oh!

jsjiang left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sfisher commented May 27, 2025

Uh oh!

jsjiang May 27, 2025

Choose a reason for hiding this comment

Uh oh!

jsjiang May 30, 2025

Choose a reason for hiding this comment

Uh oh!

sfisher Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

sfisher Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sfisher commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfisher commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfisher commented Jun 4, 2025

Uh oh!

jsjiang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sfisher commented May 29, 2025 •

edited

Loading

sfisher commented May 29, 2025 •

edited

Loading