Refactor link checker to enable exclusion by shoulder#884
Conversation
…hich can be shoulders if anchoring at the first of the string with something like '^ark:/13030/c8' or can give additional flexibility if other excludes are needed in the future.
| LINKCHECKER_ID_EXCLUSION_ENABLED = True | ||
| LINKCHECKER_ID_EXCLUSION_FILE = 'path/to/id_exclusion_file.txt' | ||
|
|
||
| The id exclusion file should contain regular expression patterns, one per line, that |
There was a problem hiding this comment.
I feel a shoulder list is good enough. We may want to get inputs from Adam regarding on using regular expression in the shoulder exclusion file.
There was a problem hiding this comment.
Another option is to use the Python build-in string functions such as startwith(). It supports matching multiple patterns using a tuple of strings.
There was a problem hiding this comment.
Changed to use tuple instead of regex which gives us a bit less possible functionality, but likely does the job just fine.
|
Please refer to this code for the speed as a test example. If anything a regular expression will execute faster. I put this in a file and ran it. import shortuuid
import re
import time
# generate 100,000 random strings
strings_to_check = [str(shortuuid.uuid()) for _ in range(99999)]
strings_to_check.append('ark:/13030/c9f488d74')
shoulder_exclusions = ['ark:/13030/c8', 'ark:/13030/c9', 'ark:/13030/c7', 'ark:/13030/c6']
shoulder_regexes = [f"(?:^{x.strip()})" for x in shoulder_exclusions]
combined = "|".join(shoulder_regexes)
id_exclusion_regex = re.compile(combined, re.IGNORECASE)
output_list1 = []
output_list2 = []
start = time.perf_counter()
for string in strings_to_check:
if id_exclusion_regex.search(string):
print(f"Excluding: {string}")
else:
output_list1.append(string)
print(len(output_list1), "strings after exclusion using regex")
end = time.perf_counter()
print(f"Elapsed time: {end - start:.4f} seconds when using regex")
start = time.perf_counter()
for string in strings_to_check:
for shoulder in shoulder_exclusions:
ok_shoulder = True
if string.startswith(shoulder):
print(f"Excluding: {string}")
ok_shoulder = False
break
if ok_shoulder:
output_list2.append(string)
print(len(output_list1), "strings after exclusion using startwith")
end = time.perf_counter()
print(f"Elapsed time: {end - start:.4f} seconds when using startswith")Output: |
|
It sounds like there are a number of changes:
|
… skipped identifiers for testing
|
I've deployed this on dev and tested and it is working. (I had to change the log line to My exclude file contents for testing: Sample of the output from the link checker: So this is working correctly for exclusions which I put in for testing in the puppet repo. I'm going to go back and remove all the testing exclusions I put in to that file since Adam said he didn't have any real exclusions to put in yet. |
jsjiang
left a comment
There was a problem hiding this comment.
Changes look good. Making string comparison case insensitive is a good approach as we may have DOIs in uppercase in the database.
Thank you Scott!
Jing
This allows exclusion by regex, which would be the same as shoulder if using a regex like
^ark:/13030/c8which anchors at the beginning of the identifier string. It allows us flexibility if we need to do more involved exclusion later but works for this use case.I tried to follow the patterns already being used in these files.
I'm not sure about how to test this out, so I'll mark as a draft PR until we can discuss and I can run through that testing.
I'm also not sure about reporting from the link checker. I think this is an unknown task right now (about how reporting should be affected) and my belong in it's own task if we're overhauling it (and some users are already excluded and I'm not sure how that works with reporting).