Skip to content

1 rewrite of scan batch dir#2

Open
bdgregg wants to merge 22 commits intomasterfrom
1-rewrite-of-scan-batch-dir
Open

1 rewrite of scan batch dir#2
bdgregg wants to merge 22 commits intomasterfrom
1-rewrite-of-scan-batch-dir

Conversation

@bdgregg
Copy link
Contributor

@bdgregg bdgregg commented Feb 26, 2026

This PR addresses issue #1.

This is a full rewrite of the original scan-batch-dir script to allow it to be more modular so that changes needed in the future should be easier to implement. This also added the ability to use PDF files as newspaper issues.

@bdgregg bdgregg self-assigned this Feb 26, 2026
@bdgregg bdgregg linked an issue Feb 26, 2026 that may be closed by this pull request
Copy link
Member

@ctgraham ctgraham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial readthrough with some questions and suggestions.

@@ -1,4 +1,5 @@
#!/usr/bin/python3
#!/usr/bin/python3.12

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly depending on 3.12 should be fine as long as the machine running the script has 3.12 (in the future, some machines may have versions >3.12). It might help to add a warning like if sys.version_info() < (3, 12): logger.warn("unsupported python version")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way to not use the 3.6 default, but also not use an explicitly defined version would be to use #!/usr/bin/env python and then run the script within a virtual environment generated by python 3.12

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The choosing of what version of python being run has been an ongoing issue for a while. There are various ways this can be achieved. None seem to be the best way. RHEL 8 ships with 3.6, RHEL 9 ships with 3.9, and the Ansible version used to build our systems requires 3.12. It is recommended by RedHat to not remove the shipped version as it impacts their code. But 3.6 is too old for some of the python modules we need to use. How best to not have to worry about the required version? Your test of sys.version_info() will probably help but only if the OS can expose the correct version to the script at runtime even if 3.12 is installed. RHEL has the 'alternatives' command but again changing the default version of python may impact OS required scripts is not recommended. This is why I hardcoded the 3.12 version. Additionally, I don't believe logger.warn will work as the logger hasn't been setup at the time of the check which I would do very early in the script.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the default version of python for a single script does seem extreme. The issue here would be that if python 3.12 is not on a system, then the script does not work (it would work if it is run as python scan-batch-dir but not via ./scan-batch-dir. I think a reasonable way to resolve this would be to add "requires python 3.12" somewhere and optionally add what to do if 3.12 not found

print(f"Invalid content in YAML file: {path}.")
raise
except Exception as e:
print(f"An unexpected error occurred while reading the YAML file: {str(e)}")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log yaml error so it can be seen in log file if script is executed from non cli environment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ojas-uls-dev at the time of executing read_yaml_file the log hasn't been created yet for the function to write to the log. This is a catch 22 if we move the creation of the log file before the read_yaml_file as the config has the default log file settings. I'm not sure what the best approach is to resolve this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed the fact that the log_file is a yaml config argument, and thought that this was just an instance of something that was overlooked. print seems reasonable in this case. It might be better to print to stderr instead of stdout though. The syntax for that would be print("some str", file=sys.stderr)

headers["Authorization"] = f"Bearer {auth_token}"

try:
response = requests.get(url, headers=headers, params=params)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

response.raise_for_status() can also be used to handle 403s and 404s in same try except block

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ojas-uls-dev, are you suggesting halting the program if a 403/404 is encountered? And yes, we can add stanzas to handle 403/404 as well. Just wasn't sure you were indicating a failure should trigger a stop or just handle the 403/404 differently.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code is sufficient, but a common pattern is to cluster 4xx and 5xx http response codes with other errors that can be raised by requests.get. In this code, this would mean having 4xx and 5xx codes be handled by logic starting at line 750 rather than the logic starting at line 761.

Here is a code example clarifying my first statement

try:
    r = requests.get("https://www.example.com/") # can error for several reasons, for eg no internet connection
    r.raise_for_status() # errors if response code is 4xx or 5xx
except Exception as e:
    # 4xx and 5xx response code is handled by same code path as failure to get url
    print(f"failed to get url due to {e}")

A practical difference in this codebase would be that if a bearer token is invalid, a 403 would be raised, and error would be silently logged to log file with no feedback to user with current code. If 4xx codes are clustered with the requests.get errors, then in that case "Unexpected error" would be printed to have some visual feedback for the user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rewrite of scan-batch-dir

3 participants