Massive computation power and storage capacity of cloud computing systems allow scientists to deploy computation and data intensive applications without infrastructure investment, where large application datasets can be stored in the cloud. However, due to the pay-as-you-go model, the datasets should be strategically stored in order to reduce the overall application cost. In this paper, by utilising Data Dependency Graph (DDG) from data provenances in scientific applications, deleted datasets can be regenerated, and as such we develop a novel cost-effective datasets storage strategy that can automatically store appropriate datasets in the cloud. This strategy achieves a localised optimal trade-off between computation and storage, meanwhile also taking users' tolerance of data accessing delay into consideration. Simulations conducted on general (random) datasets and a specific astrophysics pulsar searching application with Amazon's cost model show that our strategy can reduce the application cost significantly.
Funding
An Integrated Geophysical Study of the Southern Oklahoma Aulacogen