Hi again @JaneW,
You ask a tough question, primarily because you didn't give me a specific unstructured data example to start this process. Working with unstructured data isn't as cut and dry as working with structured or semi-structured data. Because there's no set structure to the data, we need to:
- Refine unstructured data sources to pull out analyzable chunks.
- Apply machine learning or other situation-specific algorithms to create data.
- Incorporate those new pieces of data into your other analyses.
Let's say that your unstructured data source is a small text file.
First, using XD, you can refine unstructured data sources. During the import data source process, Xcalar Design will recognize that the file is a text format and will guess at the appropriate import parameters. You can choose custom field and record delimiters, resulting in different content import, depending on the content of that text file:
- space – Each word or character string.
- \n - Each line or paragraph.
- \t – Each paragraph
- :, |, – Presumably not prose.
Depending on the file's contents, taking an educated guess at delimiter can pay off big. Once in Xcalar Design, you can split the content, by using the Map operation to apply additional delimiters through the split function.
Note: If you are unsure if your data source file uses quote characters in a disciplined fashion, do not apply a quote character.
Once the content is properly "chunked up", the data is either semi-structured, or fit for ML or other off the shelf or custom algorithm to process the content. Usually, a customer will apply one or more user-defined functions (UDFs) to perform custom analysis. For text, this could be sentiment analysis, NLP to identify atoms of data within the text, or a variety of other algorithms.
The resulting data is semi-structured or structured data. Modeling will allow you to pull out facets of the data, which can then be used to refine other structured data, or can be refined further by your structured data.
Glad for feedback on this post, and welcome folks to post their challenges working with unstructured data!