Building the right architecture by making some strategic decisions up front will save a lot of headaches and money down the road. Begin the process by asking yourselves a few questions to decide what types of data you need to collect to achieve your objectives.
For example, does that data need to be collected and available in “real time” (where timeliness is essential), or are the observations less time-bound. Should the data be streamed or just sampled or is the data set too large so that it must be sampled? Do you need all of the details contained in the data or just metadata? Answers to these questions will lead you to a decision on what type of cloud to adopt.
If you’re sampling, using one of the public cloud vendors is probably fine. Smaller volumes of data with an implied lack of urgency lend themselves well to applications housed in one of the “big three” providers. By contrast, if you need to stream data in real time (large data sets, time-bound applications), you should find a local private cloud provider in a true Tier III data center to host the data for a number of reasons.
Streaming data needs to be rapidly available and more importantly delivered in an uninterrupted fashion, and therefore application up-time is critical for data integrity. For this reason, latency is also very important – the time it takes the data to arrive at the destination database – so if the source of your data is in Idaho and your database is in Atlanta, that’s probably not a good thing. Streaming data invariably creates a very large database, which ultimately will become too costly and unwieldy to maintain in the public cloud, probably exceeding the capabilities of the tools being utilized to extract value from the data.
It is advisable to use a free, unrestricted (libre) open source database, or things will get very expensive (licensing fees can be horrific), you may not be able to migrate the database later on and you could potentially lose the rights to that data. A local, private cloud operator will be able to make recommendations and help design a platform that will be cost-effective, vendor neutral and application-appropriate.
The decision regarding which cloud and what database to utilize are foundational to the application for other equally strategic reasons. Starting to collect data in one particular public cloud will tie you to that cloud going forward. You might be agreeable with that philosophically, but if the tools within that cloud are changed, discontinued or outgrown, you have a real problem because your data isn’t portable, and you may lose your hard-won data entirely. Similarly the cost of the bandwidth needed could exceed the available budget.
Finally, if you don’t have access to a network specialist, be sure sure to find one for your project. Ensuring a secure, reliable data flow means understanding what sort of on-site collection device you will use, how much bandwidth do you need and how to design in redundancy all based on the application.