[SDXL ControlNet Training] Follow-up fixes#4188
Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
|
|
||
| # fingerprint used by the cache for the other processes to load the result | ||
| # details: https://github.com/huggingface/diffusers/pull/4038#discussion_r1266078401 | ||
| new_fingerprint = Hasher.hash(args) |
There was a problem hiding this comment.
Are the args going to actually be good enough to create a hash? Multiple runs of the script might have the same args. Is that ok? I'm not sure I haven't thought through well enough.
Ideally we can get the PID of the parent process if the parent process is accelerate and hash that. If the parent process is not accelerate, we don't have to pass any additional fingerprint
There was a problem hiding this comment.
Multiple runs of the script might have the same args. Is that ok? I'm not sure I haven't thought through well enough.
If that is the case, we would want to avoid the execution of the map fn and instead load from the cache no? Or will there be undesired consequences of that?
In any case, coming to your suggestion on
Ideally we can get the PID of the parent process if the parent process is accelerate and hash that. If the parent process is not accelerate, we don't have to pass any additional fingerprint
Are you thinking of something like:
with accelerator.main_process_first():
from datasets.fingerprint import Hasher
# fingerprint used by the cache for the other processes to load the result
# details: https://github.com/huggingface/diffusers/pull/4038#discussion_r1266078401
if accelerator.is_main_process:
pid = os.getpid()
new_fingerprint = Hasher.hash(pid)
train_dataset = train_dataset.map(compute_embeddings_fn, batched=True, new_fingerprint=new_fingerprint)There was a problem hiding this comment.
up to you if it's ok to reload from the cache when calling map, I'm not as familiar with the script :) if it is ok, a comment in the code would be nice on under what circumstances and why
I'm not familiar on the accelerate forking model and if one of the scripts themselves ends up being the parent process or if there's a separate accelerate script that forks into the children. If you need the parent pid (again depending on who actually gets forked), you would call os.getppid
There was a problem hiding this comment.
@lhoestq WDYT? IMO, it should be okay to
If that is the case, we would want to avoid the execution of the map fn and instead load from the cache no? Or will there be undesired consequences of that?
There was a problem hiding this comment.
If that is the case, we would want to avoid the execution of the map fn and instead load from the cache no? Or will there be undesired consequences of that?
Since the args are the same it will reload from cache in subsequent run instead of reprocessing the data :)
The resulting dataset doesn't depend on whether accelerate is used no ?
* hash computation. thanks to @lhoestq * disable dtype casting. * remove comments.
* hash computation. thanks to @lhoestq * disable dtype casting. * remove comments.
* hash computation. thanks to @lhoestq * disable dtype casting. * remove comments.
* hash computation. thanks to @lhoestq * disable dtype casting. * remove comments.
Fixes the issues from #4038:
Related to: #4089