Recently several people (myself included have had python multiprocessing timeout errors when processing large amounts of data (multiple visits, ccds, etc) with some -j option. I have tracked this behavior down to how we are setting the timeout value within CommandLineTask. Essentially, if a timeout is not specified to the argument parser, a default value of 9999 seconds is set. The processing pool then goes about its job of processing data, and starts a method which will eventually fetch the processing pool’s results. This method is supplied the timeout value, and begins to count down as soon as it is started. If the timeout is reached and the results of the pool’s processing are not available, a timeout error is thrown (this is all happening within the multiprocessing module).
The problem in our case arrises from the fact that some processing jobs may take longer than the default 9999s. We have two options to address this issue. One option is to scale the input timeout by the number of elements to process divided by the number of processors available. This will insure there is a timeout, but that it will scale with the workload. If we go this path, the documentation should be updated to indicate that user supplied timeout will be for a single analysis, and will be scale by the workload. The other option includes not setting a default timeout value. When none is passed as the timeout, the process is allowed to run forever. This has the benefit of never underestimating the time a workload will take to complete, but has the drawback that potentially something could happen and the process will never indicate to the user that it should be killed. If we go with unlimited runtime we should still change the user suppled timeout to be scaled by the amount of work the user has requested.
Are there any preferences on which route we go with? Or are there any alternative ideas?