Skip to content

Refactor link checker to enable exclusion by shoulder#884

Merged
sfisher merged 6 commits intodevelopfrom
875-refactor-link-checker-to-enable-exclusion-by-shoulder
Jun 12, 2025
Merged

Refactor link checker to enable exclusion by shoulder#884
sfisher merged 6 commits intodevelopfrom
875-refactor-link-checker-to-enable-exclusion-by-shoulder

Conversation

@sfisher
Copy link
Copy Markdown
Contributor

@sfisher sfisher commented May 27, 2025

This allows exclusion by regex, which would be the same as shoulder if using a regex like ^ark:/13030/c8 which anchors at the beginning of the identifier string. It allows us flexibility if we need to do more involved exclusion later but works for this use case.

I tried to follow the patterns already being used in these files.

I'm not sure about how to test this out, so I'll mark as a draft PR until we can discuss and I can run through that testing.

I'm also not sure about reporting from the link checker. I think this is an unknown task right now (about how reporting should be affected) and my belong in it's own task if we're overhauling it (and some users are already excluded and I'm not sure how that works with reporting).

sfisher added 2 commits May 27, 2025 13:27
…hich can be shoulders

if anchoring at the first of the string with something like '^ark:/13030/c8' or can
give additional flexibility if other excludes are needed in the future.
@sfisher sfisher requested a review from jsjiang May 27, 2025 20:46
@sfisher sfisher linked an issue May 27, 2025 that may be closed by this pull request
@sfisher sfisher marked this pull request as draft May 27, 2025 20:46
LINKCHECKER_ID_EXCLUSION_ENABLED = True
LINKCHECKER_ID_EXCLUSION_FILE = 'path/to/id_exclusion_file.txt'

The id exclusion file should contain regular expression patterns, one per line, that
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel a shoulder list is good enough. We may want to get inputs from Adam regarding on using regular expression in the shoulder exclusion file.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is to use the Python build-in string functions such as startwith(). It supports matching multiple patterns using a tuple of strings.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to use tuple instead of regex which gives us a bit less possible functionality, but likely does the job just fine.

Comment thread ezidapp/management/commands/proc-link-checker.py Outdated
@sfisher
Copy link
Copy Markdown
Contributor Author

sfisher commented May 29, 2025

Please refer to this code for the speed as a test example. If anything a regular expression will execute faster. I put this in a file and ran it.

import shortuuid
import re
import time

# generate 100,000 random strings
strings_to_check = [str(shortuuid.uuid()) for _ in range(99999)]

strings_to_check.append('ark:/13030/c9f488d74')

shoulder_exclusions = ['ark:/13030/c8', 'ark:/13030/c9', 'ark:/13030/c7', 'ark:/13030/c6']

shoulder_regexes = [f"(?:^{x.strip()})" for x in shoulder_exclusions]
combined = "|".join(shoulder_regexes)
id_exclusion_regex = re.compile(combined, re.IGNORECASE)

output_list1 = []
output_list2 = []

start = time.perf_counter()

for string in strings_to_check:
    if id_exclusion_regex.search(string):
        print(f"Excluding: {string}")
    else:
        output_list1.append(string)

print(len(output_list1), "strings after exclusion using regex")

end = time.perf_counter()
print(f"Elapsed time: {end - start:.4f} seconds when using regex")


start = time.perf_counter()

for string in strings_to_check:
    for shoulder in shoulder_exclusions:
        ok_shoulder = True
        if string.startswith(shoulder):
            print(f"Excluding: {string}")
            ok_shoulder = False
            break

    if ok_shoulder:
        output_list2.append(string)

print(len(output_list1), "strings after exclusion using startwith")

end = time.perf_counter()
print(f"Elapsed time: {end - start:.4f} seconds when using startswith")

Output:

Excluding: ark:/13030/c9f488d74
99999 strings after exclusion using regex
Elapsed time: 0.0106 seconds when using regex
Excluding: ark:/13030/c9f488d74
99999 strings after exclusion using startwith
Elapsed time: 0.0417 seconds when using startswith

@sfisher
Copy link
Copy Markdown
Contributor Author

sfisher commented May 29, 2025

It sounds like there are a number of changes:

  • move the caret ^ from the regular expression from the file to the code
  • Put in debugging print output to be sure items are being excluded (and create file with shoulders to exclude)
  • Add an example file for people to refer to
  • uc3-ops-puppet-modules may need a file added for the exclusions that we have

uc3-ops-puppet-modules/modules/uc3_script_management/files/uc3-ezid-ui folder. See my notes here: https://github.com/CDLUC3/ezid/issues/867#issuecomment-2847970990

@sfisher
Copy link
Copy Markdown
Contributor Author

sfisher commented Jun 4, 2025

I've deployed this on dev and tested and it is working. (I had to change the log line to info level temporarily on dev to see the output.

My exclude file contents for testing:

$ cat ~/var/data/linkchecker_id_exclusion_list.txt
# put the list of ID prefixes (startswith, ie shoulders) here that should be excluded from link checking
ark:/12345/fk
ark:/13030/c7
doi:10.15697/FK
ark:/47881/m6
doi:10.5062/F4
ark:/99999/fk
doi:10.5060/D2
ark:/81983/s9
ark:/87278/s6

Sample of the output from the link checker:

INFO ezidapp.management.commands.proc-link-checker 139967038674752: Link checker exclusion enabled with file: /apps/ezid/var/data/link_check_excl
usion_list.txt
    INFO ezidapp.management.commands.proc-link-checker 139967038674752: Link checker ID exclusion enabled with file: /apps/ezid/var/data/linkchecker_
id_exclusion_list.txt
    INFO ezidapp.management.commands.proc-link-checker 139967038674752: exclusion file successfully loaded
    INFO ezidapp.management.commands.proc-link-checker 139967038674752: id exclusion file successfully loaded
    INFO ezidapp.management.commands.proc-link-checker 139967038674752: begin update table
    INFO ezidapp.management.commands.proc-link-checker 139967038674752: Skipping identifier ark:/12345/fk1234 due to ID shoulder exclusion
    INFO ezidapp.management.commands.proc-link-checker 139967038674752: Skipping identifier ark:/12345/fk1235 due to ID shoulder exclusion
    INFO ezidapp.management.commands.proc-link-checker 139967038674752: Skipping identifier ark:/12345/fk3 due to ID shoulder exclusion
    INFO ezidapp.management.commands.proc-link-checker 139967038674752: Skipping identifier ark:/12345/fk8mq0c due to ID shoulder exclusion
    INFO ezidapp.management.commands.proc-link-checker 139967038674752: Skipping identifier ark:/12345/fk1234 due to ID shoulder exclusion
    INFO ezidapp.management.commands.proc-link-checker 139967038674752: Skipping identifier ark:/12345/fk1235 due to ID shoulder exclusion
    INFO ezidapp.management.commands.proc-link-checker 139967038674752: Skipping identifier ark:/12345/fk3 due to ID shoulder exclusion
    INFO ezidapp.management.commands.proc-link-checker 139967038674752: Skipping identifier ark:/12345/fk8mq0c due to ID shoulder exclusion
    INFO ezidapp.management.commands.proc-link-checker 139967038674752: Skipping identifier ark:/13030/c72z12p53 due to ID shoulder exclusion
    INFO ezidapp.management.commands.proc-link-checker 139967038674752: Skipping identifier ark:/13030/c76m3337d due to ID shoulder exclusion
    INFO ezidapp.management.commands.proc-link-checker 139967038674752: Skipping identifier ark:/13030/c7833mx7t due to ID shoulder exclusion
    INFO ezidapp.management.commands.proc-link-checker 139967038674752: Skipping identifier ark:/13030/c7bc3sx0w due to ID shoulder exclusion

So this is working correctly for exclusions which I put in for testing in the puppet repo.

I'm going to go back and remove all the testing exclusions I put in to that file since Adam said he didn't have any real exclusions to put in yet.

@sfisher sfisher marked this pull request as ready for review June 4, 2025 18:25
Copy link
Copy Markdown
Contributor

@jsjiang jsjiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good. Making string comparison case insensitive is a good approach as we may have DOIs in uppercase in the database.

Thank you Scott!

Jing

@sfisher sfisher changed the base branch from main to develop June 12, 2025 16:05
@sfisher sfisher merged commit 8fa0aaa into develop Jun 12, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor link checker to enable exclusion by shoulder

2 participants