StreamSets is used for data transformation rather than ETL processes. It focuses on transforming data directly from sources without handling the extraction part of the process. The transformed data is loaded into Amazon Redshift or other data warehousing solutions.
IBM StreamSets
IBM SoftwareExternal reviews
External reviews are not included in the AWS star rating for the product.
Useful for data transformation and helps with column encryption
What is our primary use case?
What is most valuable?
The best thing about StreamSets is its plugins, which are very useful and work well with almost every data source. It's also easy to use, especially if you're comfortable with SQL. You can customize it to do what you need. Many other tools have started to use features similar to those introduced by StreamSets, like automated workflows that are easy to set up.
What needs improvement?
We often faced problems, especially with SAP ERP. We struggled because many columns weren't integers or primary keys, which StreamSets couldn't handle. We had to restructure our data tables, which was painful. Also, pipeline failures were common, and data drifting wasn't addressed, which made things worse. Licensing was another issue we encountered.
For how long have I used the solution?
I have been working with the product for five years.
What do I think about the scalability of the solution?
The tool's flexibility and performance are good. It allows for task dependency management so others won't be affected if one task fails. It can handle large volumes of data and supports features like change data capture for tracking changes.
Around six months ago, many people in my company were using StreamSets. In the US team, about 42 people across different projects were using it. Similarly, in 2021, there were around 43 users. About 16-18 people in Mumbai used it in my previous company.
How are customer service and support?
The tool's support is good.
How was the initial setup?
Installing StreamSets can take time because it has two versions: a data controller and a data transformer. The data controller is easier to install, but the transformer is more complicated and requires more steps, like setting up tasks and configurations.
It would be best to ensure the environment was ready, including that it worked well with other servers. The process can be both easy and difficult, but if you follow the documentation, it should be manageable.
What was our ROI?
Whether the tool is worth the money depends on the situation. If you don't want to spend a lot on competing products like Databricks or Glue, then StreamSets might be a better option. It's particularly valuable if you prefer not to invest heavily in training your team on new technologies. If your ETL developers or data engineers are comfortable with StreamSets, it can be worth the money.
What's my experience with pricing, setup cost, and licensing?
The licensing is expensive, and there are other costs involved too. I know from using the software that you have to buy new features whenever there are new updates, which I don't really like. But initially, it was very good.
What other advice do I have?
We use various tools and alerting systems to notify us of pipeline errors or failures. StreamSets supports data governance and compliance by allowing us to encrypt incoming data based on specified rules. We can easily encrypt columns by providing the column name and hash key.
If you're considering using StreamSets for the first time, I would advise first understanding why you want to use it and how it will benefit you. If you're dealing with change tracking or handling large amounts of data, it could be cost-effective compared to services like Amazon. It's easy to schedule and manage tasks with the tool, and you can enhance your skills as an ETL developer. You can easily migrate traditional pipelines built on platforms like Informatica or Talend to StreamSets. I rate the overall solution an eight out of ten.
Real time data process
StreamSets make data pipelining seamless
A good ETL tool for real time data streaming
Good
Streamsets Review & ratings
Best Enterprise Grade Modern Data Integration Platform
1. JDBC to ADLS data transfer based on source refresh frequency.
2. Kafka to GCS.
3. Kafka to Azure Event.
4. Hub HDFS to ADLS data transfer.
5. Schema generation to generate Avro.
The easy to design Canvas, Scheduling Jobs, Fragment creation and utilization, an inbuilt wide range of Stage availability makes it an even more favorable tool for me to design data engineering pipelines.