Databricks: a perfect data platform for python users
What do you like best about the product?
The UI is build keeping ML and python users in mind, it is very intuitive to use.
What do you dislike about the product?
The speed of processing is slow and could be improved
What problems is the product solving and how is that benefiting you?
Easy integration with python notebooks and big data. The pipeline got much efficient.
Ahead of the competition in building data ecosystems, but needs to improve ease-of-use
What is our primary use case?
I worked with Databricks pretty recently. The particular design processes involved in Databricks were also a part of that specific design/architectural process.
We have used the solution for the overall data foundation ecosystem for processing and storage on a Delta format. We have also seen use cases where we were trying to establish advanced analytics models and data sharing where we leverage the Delta Sharing capabilities from Databricks.
What is most valuable?
A very valuable feature is the data processing, and the solution is specifically good at using the Spark ecosystem.
What needs improvement?
There are some aspects of Databricks, like generative AI, where they are positioning things like DALL-E. They're a little bit late to the game, but I think there are some things that they are working on. Generative AI is catching up in areas like data governance and enterprise flavor. Hence, these are places where Databricks has to be faster, and even though they are fast, I'm not sure how they'll catch up and get adopted because there are strong players in the market.
Databricks is coming up with a few good things in terms of integration. But I have to put one point forward that covers multiple aspects, which is the ease of use for the end user while operating this particular tool. For example, a tool like ADS gives you a GUI-based development, which is good for the end user who does development or maintenance. Looking at the complexities of data integration, a GUI might not be easy, but Databricks should embrace something on the graphical user development front because it is currently notebook-driven. Also, in terms of accessing the data for the end user, Databricks has an SQL interface, similar to earlier tools like SQL Management Studio. Since people are mostly comfortable with SSMS already or not, Databricks can build integration to known tools for data access, and that also helps, apart from what they're doing. I would like to see improvements with respect to user enablement, which is a good part of enterprise strategy. I would like to see their integration with a broader ecosystem of products. If you have to do data governance in tools like Microsoft Purview, it's manual and difficult. Now, I'm unsure if that momentum must be from Databricks or Microsoft. But it would be good if Databricks had some open interfaces to share metadata, which could be viewed in tools enabling data governance like Collibra, Purview, or Informatica. The improvement has to do with user and metadata integration for tools.
For how long have I used the solution?
I've worked with Databricks for over five or six years, but it's been on and off.
What do I think about the scalability of the solution?
The solution is scalable. In this particular ecosystem, there is no one else who can catch up with Databricks for now.
How are customer service and support?
Databricks' customer support is very good. They have a lot of ways in which they interact with vendors and service partners across the globe. They have periodic touch-up sessions with vendors, where their engineers answer your questions.
How was the initial setup?
The implementation is not challenging because the solution integrates well with the platforms on which they are established, whether it's Azure, AWS, or GCP. The solution is not difficult to set up, but you'd probably need a technical user to operate it.
It's the same story with maintenance, where you'd need a technically proficient person with programming knowledge to maintain it.
What other advice do I have?
Databricks integrates many enterprise processes because data processing and AIML are a small part of a larger ecosystem. Databricks has been a part of other platforms, and they are trying to establish their platform, which is a good direction.
Most of the capabilities of the underlying platform can be leveraged there. But the setup isn't difficult if the database lacks some capability, you can't find it in the database, or you're not comfortable with a certain feature in the database. It integrates well with the underlying platform. For example, with scheduling, let's say you are uncomfortable with workflow management. You can utilize integrations with EDA for any other tool and probably perform scheduling. Even if what you're trying to do is not easy, it is enabled with integration. Either they build a required feature in their tool later on, like a GUI, or you perform integrations to make the features possible.
We did evaluate licensing costs, but it had more to do with the Azure ecosystem pricing since whatever we are doing has more to do with Azure Databricks. Many optimizations are recommended, but we haven't exercised those for now. But considering that the processing is a bit more efficient, the overall price won't be much different from what it could be for any other similar component or technology. We haven't had specific discussions with Databricks' folks on pricing.
My advice to users who would like to start working with Databricks is that it is a good solution to work with for data integration and machine learning. Databricks is maturing for other use cases, so there are two points to be considered. One is that you need to evaluate how they will mature, which will be on a case-to-case basis. Second, how will it align with the overall platform story? There will be many overlapping aspects over there as Databricks expands its capabilities. In that case, it must be considered that if those capabilities overlap, how will the underlying platform vendors handle it? How would that interplay happen if many of Databricks' new capabilities align with Microsoft Fabric? That has to be very carefully considered. Otherwise, if you utilize those new capabilities, there might be a discontinuity where you cannot use Databricks because the platform does not support that.
If I specifically talk about Spark-based processing transformations, the data integration story, and advanced stability, I would rate Databricks around eight out of ten. However, with respect to new capabilities like cataloging, data governance, and security integration, I rate Databricks around five because it has to establish these features. And since Databricks integrates with platforms, we must see the interplay with the platforms' capabilities.
I overall rate Databricks a seven out of ten.
Using the Databricks Lakehouse Platform to Manage Data in a Flexible and Cost Efficient Way
What do you like best about the product?
The Databricks Lakehouse Platform makes it easy to manage data governance and handle my data in a flexible and cost-efficient way.
What do you dislike about the product?
The hardest part about using the platform has been the learning curve. But there are training materials available and once you get comfortable, it's great!
What problems is the product solving and how is that benefiting you?
Databricks Lakehouse Platform helps me enable business intelligence and machine learning on my datasets,
Ease of use for EDA and modeling with room for efficiency to improve
What do you like best about the product?
Easy to work with and create derived EDA
What do you dislike about the product?
Version control integration, cluster management/failures, DBR migrations
What problems is the product solving and how is that benefiting you?
Makes it easy for everyone to work with and share the same data to build systems, but it comes at the cost of idle time
Databricks Lakehouse is awesome!
What do you like best about the product?
Databricks offers a full stack solution for any data engineers (MLflow, ApacheSpark, Delta lake). One can you use Databricks for all data needs.
What do you dislike about the product?
Cant wait for Lakehouse IQ to launch which will make it easier for non technical individuals using the platform. Currently it will require some knowledge of data
What problems is the product solving and how is that benefiting you?
Delta lake allows different types of data.
Data + AI Summit
What do you like best about the product?
Scaling applications with databricks, hugging face with marketplace agents.
What do you dislike about the product?
Need to be more user friendly and engaging.
What problems is the product solving and how is that benefiting you?
A data sharing journey through modern data stack.
Great Data Handling & Management platform.
What do you like best about the product?
This tool supports almost all types of data.
Supports even the partially arranged/aligned data also.
Provides a way for better data handling and Management ( like Unified View, graphical view, and Comparision graphs ), which helps in better decision-making.
What do you dislike about the product?
Need a specified or specialized consultant to set up the tool because of the adaptability of this tool is complex.
Pricing is high compared to similar tools in the market.
What problems is the product solving and how is that benefiting you?
1) To build a unified portal for Data's unified view: From the legacy application of having data in different places, used this tool to create a unified portal of the data.
2) To integrate with AI and Machine Learning tools: Having the data in a unified way using this tool helps us to integrate with other devices ( But this required some high technical expertise)
Great platform for working collaboratively
What do you like best about the product?
- Ability to edit the same notebook with collaborators
- GitLab compatibility
- Multiple languages supported
- Broad functionality allows most of our digital teams to use it for their own needs
- Spark compute is fast and the amount of processors on a cluster is clear
What do you dislike about the product?
- UI is constantly changing, and changes are not announced with any leadup
- UI can be buggy - WebSocket disconnects, login timeouts, copy/pasting into incorrect cells
- Pricing structure is a little opaque - DBUs don't have a clear dollar-to-time amount
- Notebook structure isn't perfect for production engineering, better for ML or ad-hoc operations
What problems is the product solving and how is that benefiting you?
- Maintains access to all of our business data on both AWS and Azure, and can switch between those platforms
- Has an interface for data scientists, engineers, and business users and prevents needing to buy additional tools
- Allows big data applications to run without having to do much Spark configuration
Built to accelerate development
What do you like best about the product?
I have been using databricks for almost 4 years and it has been a great asset to our development as a team and our product.
Shared folders of re-usable and tracked notebooks allow us to work on tasks only once, minimising duplication of work, which in turn accelerates development cycle.
One of my personal favourites are the workflows, that allowed us to automate a variety of tasks, which availed capacity for us to focus on the right problems at the right time.
Another great selling point for me, is that collaborators can see each other typing and highlighting live.
What do you dislike about the product?
UX could be improved
While I appreciate the addition of new features, developments and experiments, the frequency of changes made it tiring and frustrating for me recently.
Too much, too frequently. The 'new notebook editor' is a great example here. The editor itself could be a very useful change, but changing all the keyboard shortcuts at the same time without letting the user know is questionable to me.
I would prefer it, if changes were rolled out less frequently with detailed patch updates (see Dota 2 for example), and configurable options in the user settings.
E.g. I would use the experimental 'new notebook editor' if I could keep the keyboard shortcuts the same.
Less frequent, more configurable updates please.
One of the biggest pain point for me is the Log In and Log Out process. Why does Databricks have to log me out every couple of hours? Especially while I am typing in a command cell?
Could this be improved please?
Also, would love it if libraries on clusters could be updated without having to restart the cluster.
Having said all this, I do love some of the new features, such as the new built-in visualisation tool, however would love it even more if titles could be added and adjusted.
What problems is the product solving and how is that benefiting you?
Databricks is used as the core of our research environment.
It is used to provide quick and efficient analysis of whatever question or problem might arise while keeping the production environment safe and undisturbed.
Progressing in the right direction
What do you like best about the product?
Being quickly able to get the environment up and running for any kind of workloads. The support for all three languages and catering to the needs of Data Engineering and ML.
What do you dislike about the product?
Too many customizations are needed to achieve the right mix of parameterization for optimal performance. On the other hand, snowflake provides lots of features out of the box without the developer worrying about these things.
What problems is the product solving and how is that benefiting you?
Managing the intermediate layers and data engineering activities like wrangling/mashing/slicing/dicing of the data well. Greater control of the data via data frames.