Skip to content

Commit 4e66475

Browse files
DhanashreePetareDhanashreePetare
andauthored
Feat: Add on-the-fly compression conversion during download (Issue #18) (#45)
* Add on-the-fly compression conversion during download (Issue #18) * lint fixes * Fix PR #45 review feedback: Addresses all feedback from @Integer-Ctrl on PR #45 * Fix checksum validation order - validate BEFORE conversion --------- Co-authored-by: DhanashreePetare <dhanashreepetare8125@gmail.com>
1 parent 54c7ff1 commit 4e66475

6 files changed

Lines changed: 463 additions & 26 deletions

File tree

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
# project-specific
22
tmp/
3+
test-download/
34
vault-token.dat
45

56
# Byte-compiled / optimized / DLL files

README.md

Lines changed: 41 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -174,6 +174,10 @@ docker run --rm -v $(pwd):/data dbpedia/databus-python-client download $DOWNLOAD
174174
Note: Vault tokens are only required for certain protected Databus hosts (for example: `data.dbpedia.io`, `data.dev.dbpedia.link`). The client now detects those hosts and will fail early with a clear message if a token is required but not provided. Do not pass `--vault-token` for public downloads.
175175
- `--databus-key`
176176
- If the databus is protected and needs API key authentication, you can provide the API key with `--databus-key YOUR_API_KEY`.
177+
- `--convert-to`
178+
- Enables on-the-fly compression format conversion during download. Supported formats: `bz2`, `gz`, `xz`. Downloaded files will be automatically decompressed and recompressed to the target format. Example: `--convert-to gz` converts all downloaded compressed files to gzip format.
179+
- `--convert-from`
180+
- Optional filter to specify which source compression format should be converted. Use with `--convert-to` to convert only files with a specific compression format. Example: `--convert-to gz --convert-from bz2` converts only `.bz2` files to `.gz`, leaving other formats unchanged.
177181

178182
**Help and further information on download command:**
179183
```bash
@@ -186,23 +190,33 @@ docker run --rm -v $(pwd):/data dbpedia/databus-python-client download --help
186190
Usage: databusclient download [OPTIONS] DATABUSURIS...
187191

188192
Download datasets from databus, optionally using vault access if vault
189-
options are provided.
193+
options are provided. Supports on-the-fly compression format conversion
194+
using --convert-to and --convert-from options.
190195

191196
Options:
192-
--localdir TEXT Local databus folder (if not given, databus folder
193-
structure is created in current working directory)
194-
--databus TEXT Databus URL (if not given, inferred from databusuri,
195-
e.g. https://databus.dbpedia.org/sparql)
196-
--vault-token TEXT Path to Vault refresh token file
197-
--databus-key TEXT Databus API key to download from protected databus
198-
--all-versions When downloading artifacts, download all versions
199-
instead of only the latest
200-
--authurl TEXT Keycloak token endpoint URL [default:
201-
https://auth.dbpedia.org/realms/dbpedia/protocol/openid-
202-
connect/token]
203-
--clientid TEXT Client ID for token exchange [default: vault-token-
204-
exchange]
205-
--help Show this message and exit.
197+
--localdir TEXT Local databus folder (if not given, databus
198+
folder structure is created in current working
199+
directory)
200+
--databus TEXT Databus URL (if not given, inferred from
201+
databusuri, e.g.
202+
https://databus.dbpedia.org/sparql)
203+
--vault-token TEXT Path to Vault refresh token file
204+
--databus-key TEXT Databus API key to download from protected
205+
databus
206+
--all-versions When downloading artifacts, download all
207+
versions instead of only the latest
208+
--authurl TEXT Keycloak token endpoint URL [default:
209+
https://auth.dbpedia.org/realms/dbpedia/protocol
210+
/openid-connect/token]
211+
--clientid TEXT Client ID for token exchange [default: vault-
212+
token-exchange]
213+
--convert-to [bz2|gz|xz] Target compression format for on-the-fly
214+
conversion during download (supported: bz2, gz,
215+
xz)
216+
--convert-from [bz2|gz|xz] Source compression format to convert from
217+
(optional filter). Only files with this
218+
compression will be converted.
219+
--help Show this message and exit.
206220
```
207221
208222
#### Examples of using the download command
@@ -255,6 +269,18 @@ databusclient download 'PREFIX dcat: <http://www.w3.org/ns/dcat#> SELECT ?x WHER
255269
docker run --rm -v $(pwd):/data dbpedia/databus-python-client download 'PREFIX dcat: <http://www.w3.org/ns/dcat#> SELECT ?x WHERE { ?sub dcat:downloadURL ?x . } LIMIT 10' --databus https://databus.dbpedia.org/sparql
256270
```
257271
272+
**Download with Compression Conversion**: download files and convert them to a different compression format on-the-fly
273+
```bash
274+
# Convert all compressed files to gzip format
275+
databusclient download https://databus.dbpedia.org/dbpedia/mappings/mappingbased-literals/2022.12.01 --convert-to gz
276+
277+
# Convert only bz2 files to xz format, leaving other compressions unchanged
278+
databusclient download https://databus.dbpedia.org/dbpedia/mappings/mappingbased-literals --convert-to xz --convert-from bz2
279+
280+
# Download a collection and unify all files to bz2 format
281+
databusclient download https://databus.dbpedia.org/dbpedia/collections/dbpedia-snapshot-2022-12 --convert-to bz2
282+
```
283+
258284
<a id="cli-deploy"></a>
259285
### Deploy
260286

0 commit comments

Comments
 (0)