Tag Archives: Stitch

Scaling Analytics With Agility Part II

This is last in the two part series where I have have tried to explain approaches to achieving agility with data. If you have not already gone through part I , then follow this link.

Reminder of what we are trying to achieve by adopting any one or hybrid approach is as follows: 

  • Optimize Query performance 
  • Common Query Language 
  • Central data model for business analysts
  • Fast access to data insights 

The part – I of this series helped us understand the single Physical Data Store approach and now we are going to talk about Logical Data Store Approach

Logical Data Store Approach

In this approach we do not execute a Load of data to single store but tend to hand off more directly to data analysts ability to construct logical view or data models across various data sources without the need of lifting and shifting the data. There is a need to construct logical data models and to a large extent removes the need of developers to get involved straight up in any process.

capture

The above landscape tells us that Single Data Store architecture does provide some inhibitions to agility at the end of the day and this is something which logical data ware house architecture is looking to address.

Typical Architecture

The main theme here is that we are centralizing the data models as opposed to the data itself.

Let us now summarize the approaches across both major themes to achieve agility:

Considerations to Single Physical Data Store Approach

Pros

  • Brings data to one place and then use the store to do transformations

  • Takes an approach where the lake contains all relevant information in raw state post ingestion on continuous basis to cater to multiple personas

  • If used in conjunction to ELT architecture, it provides for a fine balance between developer and analyst community. The schematization of raw data is helpful and allows analysts to create logical data models post transformation within the store

  • Extent of development required depends on choice of ELT infrastructure adopted

  • It is not a hard choice or decision of CTO’s organization and in essence with less engineering resources you may still achieve quite a lot

Cons

  • It is dependent on the architecture that the teams would have followed in bringing data to a single store, implying that if customer connector architecture or ETL approach has been adopted with wrong choices then, the friction to get data in the store will remain very high
  • Storage of data and connecting to DWH will determine pricing of bring it all together along with other investments to standardize the ingestion pipeline architecture

Considerations to Logical Data Store Approach

Pros

  • It centralizes the data modelling and not the actual raw data store
  • It centralizes the modeled data for BI exposure
  • It provides for more self-service BI architecture

Cons

  • Maturity of organization and type of skill set to operate this kind of infrastructure
  • At what size should this be recommended?
  • How much help would be required for multiple businesses become self-serve on this model?
  • The CTO organization can make a choice for this but would need Data Ops to work alongside BI for creating & enabling data models that allow you to operate and leverage the power or else this can get reduced to being just another ELT infra that may not justify its deployment

Summary

Through this mini-series , one would get general idea of various methods by which agility can be achieved to unlock the golden joins ( as I call it ) that drives maximum value for the organization and provides data when it is needed most. 

According to me in order to make a choice , try to introspect and define the maturity index of following three parameters

  • Analyst Org
  • Engineering Org
  • Current DWH infrastructure
  • Budget 
  • Data set sizes 

In addition to this also be reminded that hybrid approach will always bean option if the organization is quite large and centralization in general to drive all the personas might not fit through one or the other working model.

 

Scaling Analytics With Agility Part I

Design principles and guiding forces for achieving unified analytics in the world of distributed data sources can vary. I thought it might be a good idea to just digitize some of my thoughts & where does it make sense to bring them all together and what are the trade-offs in doing so. In a multi-part series we will explore some approaches and then analyse what parameters are necessary to measure in order to pick an approach.

The common goal which we are always trying to search is geared towards any one of all of the following in combination:

  • Optimize Query performance 
  • Common Query Language 
  • Central data model for business analysts
  • Fast access to data insights 

Analyst Use Cases

While you can have many ways of looking at analytics , I generally tend classify things in two buckets to keep it simple. 

Report or Dashboard View

Capture

Non-Dashboard view

  • Combine data to generate insights, or to do data scientific activities which drive marketing behavior
  • Process for pulling data together in the lake to analyze and create marketing pipelines.
  • Online / offline share, share on complaints and sample orders
  • Combining touchpoints => is customer searching on site, then contacting customer care and placing an order via e-mail => what kind of product?

Single Physical Data Store Approach

This approach requires you house all your data in place and follows the paradigm of building analytics on top of single warehouse technology

Data Ingestion Approach

Data ingestion approach is driven by adopting custom connectors or ELT architecture that allows you to get the data to your central data store

Custom Connectors

This is a very traditional approach where developers internally work using any programming language to run batch mode extractors and bring a highly developer centric approach to developing / deploying extractors. It lacks standards of extraction architecture and does not follow a templated based connector architecture paradigm. This approach comes with least flexibility and agility into ingesting data with agility into your store. The custom connectors basically serve as ELT pipelines and are prone to continuous upkeep.

Standard ELT Pipelines

Industry offers many standard ELT pipelines and these platforms are standardized as architecture approach to provide for wide variety of connectors. Two most popular ELT architecture platforms are Stitch and FiveTran

There would be more and other ways , I am not contesting that but trying to just convey the pipeline flow and certain things can be achieved.

Stitch

It has been around for quite some time in market and now has been acquired by Talend , it is a blend of following traits

  • Provides for standard connectors certified by Stitch, these are around 90+
  • Provides for standard Tap-Target architecture which is open source. Read more about it at singer.io
  • Offers Schematization in standard as well open architecture development
  • Has limited exposure of Google related connectors and meta information
  • You can control historical import of information
  • Fosters open source development
  • Great community support
  • Has got a good User Experience
  • It is now backed by a world leader in data pipeline architecture
Skillset requirements

As a Data Analyst you can deal with this Stitch easily, while if you do not have a connector then you can develop one using Pyhton skillset using a standards-based approach as offered by Singer and get certified by Stitch. Using Stitch Import API in conjunction to Lambda functions also allows you to send data to Stitch.

Stitch Approaches Summarized
Using Stitch’s Standard Connectors

Stitch supports more than 90 integrations

Using Stitch’s Import API

If building and maintaining a script to retrieve and push data from Google Search Console is feasible for you or your team, Stitch’s Import API integration can be used as a receiving point for JSON or TRANSIT posts that would then be loaded to the destination warehouse connected to your Stitch account.

Singer Development

Stitch also works with integrations built as part of our open source project Singer. Singer includes a specification that allows integrations to be built by anyone, and our team provides support to members of the community who want to contribute code to the project via the Singer Slack group.

Any community built integrations can also be submitted by their creator to our team for review and potential inclusion in Stitch as a Community Integration using this form. Otherwise, integrations can be run within your infrastructure with the Stitch target and directed toward an Import API integration within your account.

If you or a member of your team is interested in building a Singer integration, for Google Search Console or otherwise, I would recommend checking out our getting started guide, and bringing any development-focused questions to the Singer Slack group.

Sponsored Development

If this is especially time-sensitive or building an in-house solution isn’t feasible for your team, Stitch’s Enterprise plan can include arrangements for the development of custom integrations that are built to ensure your requirements are met.

Typical Architecture

Capture

FiveTran

This is a new age ELT pipeline platform that focused on bring rich schematization & large connector architecture base to its users. It is blend of following traits

  • Provides very large connector base that covers almost all tools available
  • Is continuously evolving
  • Offers rich Schematization
  • Boasts of handling very large dataset with optimality
  • Is highly optimized to Snowflake
  • Comes with multiple destinations architecture
  • Provides for event stream ingestion architecture
  • API driven economy is available but evolving
  • Has in-depth exposure of Google related data stores
Skillset requirements

As a Data Analyst you can deal Fivetran easily. It is touted more as a data analyst friendly platform and while developers can get involved using cloud functions architecture, this is not something that is considered as an open source standard, you need to define and architect it as per the needs of the FiveTran platform.

Using FiveTran’s Standard Connectors

Leverage 100+ connectors from FiveTran

Using FiveTran’s Cloud Function Architecture

This is achieved using cloud function architecture where you need to deploy cloud functions on their platform and make that connector available for consumption

Sponsored Development

This is possible using enterprise contract

Typical Architecture

capture-2.png

Summary

I explained data centralization approaches using above facts and in next part I will continue to talk about virtual datawarehouse architecture and what kind of benefits it might entail.