Design principles and guiding forces for achieving unified analytics in the world of distributed data sources can vary. I thought it might be a good idea to just digitize some of my thoughts & where does it make sense to bring them all together and what are the trade-offs in doing so. In a multi-part series we will explore some approaches and then analyse what parameters are necessary to measure in order to pick an approach.
The common goal which we are always trying to search is geared towards any one of all of the following in combination:
- Optimize Query performance
- Common Query Language
- Central data model for business analysts
- Fast access to data insights
Analyst Use Cases
While you can have many ways of looking at analytics , I generally tend classify things in two buckets to keep it simple.
Report or Dashboard View
Non-Dashboard view
- Combine data to generate insights, or to do data scientific activities which drive marketing behavior
- Process for pulling data together in the lake to analyze and create marketing pipelines.
- Online / offline share, share on complaints and sample orders
- Combining touchpoints => is customer searching on site, then contacting customer care and placing an order via e-mail => what kind of product?
Single Physical Data Store Approach
This approach requires you house all your data in place and follows the paradigm of building analytics on top of single warehouse technology
Data Ingestion Approach
Data ingestion approach is driven by adopting custom connectors or ELT architecture that allows you to get the data to your central data store
Custom Connectors
This is a very traditional approach where developers internally work using any programming language to run batch mode extractors and bring a highly developer centric approach to developing / deploying extractors. It lacks standards of extraction architecture and does not follow a templated based connector architecture paradigm. This approach comes with least flexibility and agility into ingesting data with agility into your store. The custom connectors basically serve as ELT pipelines and are prone to continuous upkeep.
Standard ELT Pipelines
Industry offers many standard ELT pipelines and these platforms are standardized as architecture approach to provide for wide variety of connectors. Two most popular ELT architecture platforms are Stitch and FiveTran
There would be more and other ways , I am not contesting that but trying to just convey the pipeline flow and certain things can be achieved.
Stitch
It has been around for quite some time in market and now has been acquired by Talend , it is a blend of following traits
- Provides for standard connectors certified by Stitch, these are around 90+
- Provides for standard Tap-Target architecture which is open source. Read more about it at singer.io
- Offers Schematization in standard as well open architecture development
- Has limited exposure of Google related connectors and meta information
- You can control historical import of information
- Fosters open source development
- Great community support
- Has got a good User Experience
- It is now backed by a world leader in data pipeline architecture
Skillset requirements
As a Data Analyst you can deal with this Stitch easily, while if you do not have a connector then you can develop one using Pyhton skillset using a standards-based approach as offered by Singer and get certified by Stitch. Using Stitch Import API in conjunction to Lambda functions also allows you to send data to Stitch.
Stitch Approaches Summarized
Using Stitch’s Standard Connectors
Stitch supports more than 90 integrations
Using Stitch’s Import API
If building and maintaining a script to retrieve and push data from Google Search Console is feasible for you or your team, Stitch’s Import API integration can be used as a receiving point for JSON or TRANSIT posts that would then be loaded to the destination warehouse connected to your Stitch account.
Singer Development
Stitch also works with integrations built as part of our open source project Singer. Singer includes a specification that allows integrations to be built by anyone, and our team provides support to members of the community who want to contribute code to the project via the Singer Slack group.
Any community built integrations can also be submitted by their creator to our team for review and potential inclusion in Stitch as a Community Integration using this form. Otherwise, integrations can be run within your infrastructure with the Stitch target and directed toward an Import API integration within your account.
If you or a member of your team is interested in building a Singer integration, for Google Search Console or otherwise, I would recommend checking out our getting started guide, and bringing any development-focused questions to the Singer Slack group.
Sponsored Development
If this is especially time-sensitive or building an in-house solution isn’t feasible for your team, Stitch’s Enterprise plan can include arrangements for the development of custom integrations that are built to ensure your requirements are met.
Typical Architecture
FiveTran
This is a new age ELT pipeline platform that focused on bring rich schematization & large connector architecture base to its users. It is blend of following traits
- Provides very large connector base that covers almost all tools available
- Is continuously evolving
- Offers rich Schematization
- Boasts of handling very large dataset with optimality
- Is highly optimized to Snowflake
- Comes with multiple destinations architecture
- Provides for event stream ingestion architecture
- API driven economy is available but evolving
- Has in-depth exposure of Google related data stores
Skillset requirements
As a Data Analyst you can deal Fivetran easily. It is touted more as a data analyst friendly platform and while developers can get involved using cloud functions architecture, this is not something that is considered as an open source standard, you need to define and architect it as per the needs of the FiveTran platform.
Using FiveTran’s Standard Connectors
Leverage 100+ connectors from FiveTran
Using FiveTran’s Cloud Function Architecture
This is achieved using cloud function architecture where you need to deploy cloud functions on their platform and make that connector available for consumption
Sponsored Development
This is possible using enterprise contract
Typical Architecture
Summary
I explained data centralization approaches using above facts and in next part I will continue to talk about virtual datawarehouse architecture and what kind of benefits it might entail.
Leave a Reply