Here is the guideline on how to best approach VM sizing for memory:
* When doing ad-hoc analysis the user will see green/yellow/red icons that tell her about the memory utilization on the backend. When yellow or red, clicking the icon will list temporary tables that can be deleted to free up memory. You can, of course, delete any other tables that you are not going to reference going forward. The tables can always be recreated if you need to at a later point.
* Ad hoc work could be for analysis or modeling. We cannot predict what resources the user may use (as that is the function of the complexity of the operation performed) or how they decide to reference previous tables. This is the very nature of interactive analysis or design. Hence, it is left to the user on how to manage memory.
* Xcalar Design provides hints to help the user to manage memory.
* Being out of memory does not crash the system. Xcalar is robust. It just prompts you to housekeep your tables.
* When a dataflow is run in a batch fashion (OLAP style), we can handle all memory management ourselves. This is because we know the entire query/model/dataflow end-to-end. We can optimize the graph and manage resources like any optimizer would, say, as Oracle or SQL Server would.
* Your memory consumption is going to be a function of the complexity of the operations performed on your initial datasets. Modeling is done in memory for any number of rows up to a trillion. It does not use disk or flash because that affects interactivity.
* Batch jobs utilize the memory hierarchy -- DRAM, flash, disk, and utilize a lot less memory than when modeling. Less memory than optimal will mean the job will page more to disk but will still complete.
* We recommend providing 4-8X (as a multiple of the dataset being processed) for the amount of memory for ad hoc modeling work. For batch jobs 3X should suffice.
* Whether doing ad hoc or running batch jobs, doubling the memory should speed things up by close to 2X for most jobs as Xcalar demonstrates near liner performance including for relational compute. This is a good way to decide on how many VMs you need for a particular size of dataset, for the level of complexity needed to process it, and the kind of interactivity you desire. The more complex the processing (say, matrix multiplication), the more the demands on dynamic memory consumption.
* Xcalar scales with metadata and not data. The primary factor that determines memory consumption is the complexity of the algorithms. You may double your dataset but only need 20% more memory. It all depends on what you are doing. This means that with Xcalar you will use a lot fewer resources, and do things faster and more efficiently.