Great question! We commonly have customers modeling in the 10-30 GB dataset size range, but that's not close to any sort of "limit". Xcalar Design was architected to handle much larger dataset sizes.
The answer is a little more nuanced that that, as there are a few admin configurations governing this in Xcalar. The implications of extremely large dataset size could be that you may wish to add more nodes to your cluster.
Configuration-wise, there are some cluster configuration changes you may need your Xcalar Admin to make, depending on your expected dataset size:
MaxInteractiveDatasetSize – This setting is a quiet limit to the maximum size of your dataset. Larger datasets are sampled down to this limit. Best practice is to increase it from the default of 10 GB to a practical limit, based on the work mandates of your group. When our customers operationalize their models by applying to multiple TB datasets, it is common to work with a sample larger than 10 GB.
DatasetPercentOfXdb – This setting governs what maximum percent of your cluster's RAM can be used for datasets versus modeling tables. When Xcalar is the only app on your cluster nodes, usually 70% of your total memory is earmarked for the combination of modeling and datasets. Of that 70%, the DatasetPercentofXdb setting indicates the maximum percent of that memory used for datasets. The default is 70%. 70% of 70% is 49%... so, if your dataset size is larger than 49% of the total memory of your cluster, then you, at least, need to increase this value. We would generally recommend that you scale out your cluster, as the best practice is to have 4 - 8x memory, as compared to your datasets.
We recommend you split up your dataset into multiple files before import to take full advantage of Xcalar Design's use of parallelization.
Did that answer your question?