This article will highlight best practices for data ingestion. Data can be ingested into the Stratifyd platform in a variety of ways:
- Local file upload
- Data Connector(s)
- API Request
Flattened Data
With any of these methods, Stratifyd will automatically work to flatten the data structure in a way that you can analyze the information. Not all sources will contain the same level of information. The “flattening” process is our attempt to produce all fields for a given level of data.
For example, consider Zendesk data connector:
- Our Zendesk connector pulls Zendesk ticket data:
- Top-level = Ticket (which contain the following levels)
- Sub-level: Comments (there can be multiple comments per ticket)
- Sub-level: Account (there can be account information that exists for the related ticket)
- Sub-level: Satisfaction scoring (there can be survey results based on the ticket)
- Top-level = Ticket (which contain the following levels)
- To pull all the relevant sub-level information you may see different column names:
- Example: “comments.body” - this is the body of the verbatim for the comment sub-level.
Example: “id” - this is the ID for the ticket top-level (notice there is not . to separate any field name).
The updated data JSON parser will condense the data to two levels so the data underneath the following will be captured:
-
-
- Ticket.Comments.X
- Ticket.Account.X
- Ticket.Satisfaction.X
-
If malformed data structures exist, it’s necessary to clean them out before ingesting into our platform. Connectors will clean the data automatically. For flat files, sub-fields containing different levels of blank data can cause issues.
Caching
In addition to data flattening, our system will cache data during the ingestion process. This means we will store the data in memory so our platform can call on the data quickly. The caching process will take time to complete and is dependent on the following:
- Width of the data (how many columns)
- Length of the data (how many records)
- Size of the verbatims
- Size of the metadata
- Number of analyses per stream
- Analysis type:
- NLU most computationally expensive
- For Taxonomies, the more levels and more patterns the more computationally expensive
Data Processing
Take the amount and complexity of data into consideration to estimate how long it will take to process.
Example:
- Speech data (large verbatims + many columns) @ 10,000 records with NLU and a Taxonomy could take ~45 minutes to reprocess
- Product reviews (small verbaims + few columns) @ 100,000 records with NLU and a Taxonomy could take ~45 minutes to process
Tips
- When selecting data connectors make sure you have a good understanding of the functionality:
-
Not all connectors will return data at the same rate
-
- When using the file upload:
- DO NOT exceed 200 columns + DO NOT exceed the requested stream record count in the table from the previous section
- Excel data has no concept of parent structure. Everything is flat, unlike JSON data. See the difference in the Excel format this screenshot:
- Inspect docs before importing into our platform
- You may wish to preview your data in a JSON Beautifier by pasting the text file.
Further questions?
We're here to help! Don't hesitate to reach out.
Comments
0 comments
Please sign in to leave a comment.