To import data evenly and in parallel across all cores:
-
Create a new UDF by entering the following code in the UDF editor:
# This is a fake data generator that we use
from faker import Faker
# We just need to instantiate it once
fake = Faker()
import json
def importUdf(inp, ins):
# For Generated data target,
# parameter inp represents the number of rows you want to generate,
# it is set to the number you enter in the Data Source Path,
# parameter ins is a stringified json with two fields - numRows and startRow
inpStruct = json.loads(ins.read())
# number of rows that this XPU needs to generate
numRows = inpStruct["numRows"]
# starting id for this particular XPU
offset = inpStruct["startRow"]
for rid in range(numRows):
rowNum = offset + rid
yield {"uniqueId": rowNum,
"address": fake.address()}
This UDF is generating fake rows as an example but you can write your own UDF to read your data file and yield a row from that file instead of the fake address.
Create a new Data Target by selecting the Generated data target type.
- Import the data:
a. Click on the Import Data Source.
b. Select the newly created Generated data target and enter the number of rows you want to import. Click Next. You will see a table of starting rows and number of rows that will be imported by each core in your Xcalar cluster.
c. Enter a dataset/table name. Select the Custom Format, the name of the UDF module you just created, and the UDF function name. Click Create Dataset/Table. This starts the import process by invoking the UDF module once on each core of each node.