[BEAM-57] Makes FileBasedSink use a temporary directory#1050
[BEAM-57] Makes FileBasedSink use a temporary directory#1050jkff wants to merge 2 commits intoapache:masterfrom
Conversation
|
R: @dhalperi |
|
R: @lukecwik |
dhalperi
left a comment
There was a problem hiding this comment.
My 2c: this seems an overly invasive change for what it's trying to do. I would just write the temp folders to a temp directory inside of the path.
Other comments:
- Don't use Path.resolve-like functions -- may not work on all filesystems, which is why we have IOChannelFactory.resolve
- Sibling resolution seems wrong and/or not guaranteed to work. Subdirectory seems simpler and more reliable.
Would like Luke's eyes to get some of the finer details, (e.g., does all this work on Windows?) and given that I've continued to be too randomized for this in the last 9 days Eugene could use the sharding.
| .apply(MapElements | ||
| .via((KV<String, Long> wordCount) -> wordCount.getKey() + ": " + wordCount.getValue()) | ||
| .withOutputType(TypeDescriptors.strings())) | ||
| .apply(TextIO.Write.to("gs://YOUR_OUTPUT_BUCKET/AND_OUTPUT_PREFIX")); |
|
Both subdirectories and siblings have issues but subdirectory seems to be the lesser of two evils as the fallback. There is also PipelineOptions.tempLocation but that may not work at all since the IOChannelFactory may not be equivalent or there will be other underlying storage differences like the GCS bucket may be part of a different GCP project. In the referenced example of the person reading the pipeline in later, paths that are ** will still encompass the temp files. Sibling fails when the output location is a GCS bucket since you can't just create arbitrary siblings at that level or in a local file system the path is "/". Does it make sense for the IOChannelFactory to own creating temp file locations? |
|
|
Ping. |
|
I think this is something we should bring back to the dev list since there isn't a clean solution either way and both have drawbacks. |
|
Discussion happening on https://www.mail-archive.com/dev@beam.incubator.apache.org/msg01726.html |
| "temp-beam-" | ||
| + baseOutputPath.getFileName() | ||
| + "-" | ||
| + Instant.now().toString(DateTimeFormat.forPattern("yyyy-MM-DD_HH-mm-ss"))) |
There was a problem hiding this comment.
This is highlighting a general confusion (on my part) between Path and URI. It seems like we make this mistake all over the place, so this need not block this PR.
cc: @peihe
Original thought, can be ignored: maybe this should be toURI().toString() ?
When writing to /path/to/foo, temporary files would be written to /path/too/foo-temp-$uid (or something like that), i.e. as siblings of the final output. That could lead to issues like http://stackoverflow.com/q/39822859/278042 Now, temporary files are written to a path like: /path/too/temp-beam-foo-$date/$uid.
1c34acd to
fa3b125
Compare
|
Rebased; per thread on beam-dev@ and per offline discussion with @dhalperi the approach the PR is taking is okay. |
|
Failures look flaky (connection errors with apache repos). |
| throw new UnsupportedOperationException(); | ||
| GcsPath parent = getParent(); | ||
| return (parent == null) ? fromUri(other) : parent.resolve(other); | ||
| } |
There was a problem hiding this comment.
unit tests of 3 new implementations?
|
LGTM, merged. |
This is a weaker but backward-compatible version of respective Beam changes: apache/beam#1050 apache/beam#1278
This is a weaker but backward-compatible version of respective Beam changes: apache/beam#1050 apache/beam#1278
When writing to
/path/to/foo, temporary files would be written to/path/to/foo-temp-$uid(or something like that), i.e. as siblings of the final output. That could lead to issues like http://stackoverflow.com/q/39822859/278042Now, temporary files are written to a path like:
/path/to/temp-beam-foo-$date/$uid. This way, the temporary files won't match the same glob as the final output files (even though they may still fail to be deleted due to eventual consistency issues).