…and this database is Just Right
I have been fortunate to have created data warehouses in Oracle, Teradata, Netezza, SQL Server and Vertica. (I did a POC with Greenplum when it was called Metapa but it was not quite ready for prime time.)
I have never been able to achieve performance on multi-terabyte databases using SMP (Symmetrical Multi Processing) servers and OLTP databases. In my opinion, the relational/OLTP databases like Oracle, SQLServer, MYSql, etc.. are just the wrong tool for adhoc queries, analytics and reporting of VLDB’s (very large databases). The results I obtained when utilizing MPP have been staggering when compared to the response times of query and analytics of OLTP databases.
There is no theoretical size limit on databases based on MPP architectures. (Massively parallel processing). The more nodes, SPUS, blades, or pizza boxes you have in your cluster, the more capacity your database has and the faster it will perform. With that said, I am pretty confident that Teradata, Netezza and Vertica can scale to the Petabyte size. There seems to be no real inhibiting factor that I know of.
So what does one choose?
I maintained a Teradata environment for about 6 years and have gone under the hood pretty deeply with Netezza and Vertica. I have programmed at the FPGA level for Netezza and had the opportunity to pick the brains of the senior engineers at Vertica for a while. I am pretty confident that at this point, the MPP technologies I mention below should handle the workload.
The question comes down to “What is your budget and business need?”
Teradata
Teradata has a wide range of logical models and applications they can sell along with the WH. Teradata supports transactional processing and ad-hoc queries very nicely. Teradata also has a professional services team which contributes a large portion of their revenue, but all this comes at a premium price.
Netezza
Netezza is a beast of a machine that does full table scans extremely well, their FPGA technology is very impressive, and the zone indexing does a nice job in minimizing I/O. Netezza’s secret sauce is in their FPGA (Field programable gate arrays). Not to long ago, Netezza opened up the FPGA to developers and now give tremendous power by creating FPGA based UDF’s (User defined functions). These UDF’s perform exceptionally well and can put the afterburners on your query and analytics if done correctly. Netezza has proven the test of time and has a pretty impressive client list and is a very viable alternative to Teradata.
Vertica
Vertica is relatively a new comer to the VLDB space and they are definitely the one to watch. I like the technology and the fact that Vertica is led by Stonebraker who is a veteran in the industry. ALso quite notable is Jerry Held (Exec from Oracle and Tandem) who sits on the board. You will need to grasp a slight paradigm shift in the way their database operates but it supports ansi standard SQL and ODBC connections so you shouldn’t have any problems. Vertica’s compression algorithms are extremely clever and when combined with a columnar data store they are breaking records. Vertica will definitely give you an A-HA! moment once you understand how the magic happens.
By the way, Syncsort and Vertica broke the world * TPC record for data loads at a fraction of a price when compared to Netezza and Teradata.
* A clarifying point on the TPC record. This test was not an instance of the TPC-H benchmark but rather a benchmark on load performance based on the TPC-H data produced by DBGEN. The benchmark was audited by a 3rd party, published, and available for others to contest.
Summary
Technically they will all crunch the data and they will all do it MUCH faster than OLTP based databases. Some questions you will need to consider:
-
- Q) Do I need or am I looking at other logical data models and applications to kick start my project?
- Q) Do I want to leverage a professional services team?
- Q) Do I need user defined functions or stored procedures?
- Q) Does purchasing proprietary hardware scare you?
- Q) Do I have a substantial budget or is this a warehouse on a shoestring?
- Q) Is my DBA adaptable?
- Q) Do I have a mixed workload?
- Q) How many users will be hitting the database concurrently?
I know both Vertica and Netezza offers proof of concepts.
TAKE ADVANTAGE OF THE POCS!
One approach you may want to consider is to start with a technology like Vertica. They offer a free trial and it installs easily on commodity hardware. This will provide the lowest cost of ownership and it just may fit your needs. In parallel, work on setting up the POC with Netezza. This may take a little longer because staging a Netezza environment within your walls has many logistics involved. Choosing the right technology is a crucial step in the development of a Very Large Data Warehouse. You should take your time on this critical phase and choose wisely.
« Syncsort SI Certification | Home | Database in the Clouds? »
Comments
This is a false statement. There is no TPC benchmark for data loads. They used the datagen tool for TPC-H, but that is the only association with TPC.
Lucas,
Have you reviewed the Audit report? It is very clear with regard to the Transaction Council’s TPC-H Benchmark that it was not an instance of the TPC-H benchmark but rather an instance of the load performance portion of the benchmark. Section 4.2 Benchmark Definition. The audit report goes into great detail about using DBGEN and ALL the players in the field have the opportunity to do exactly what they did to disprove the claim.
If have an audit report, benchmark, or similar report that you can disprove the Press Release, I would really love to see it. I have not seen any formal rebuttal from TPC, Netezza, Sybase, Teradata, or Paraccel but would welcome the post.
Thanks
Lucas,
I reread your post and went back to the audit report to review section 4.2 a few more times. I believe your point is valid enough that the blog should be clarified. I have adjusted the blog above.
Thanks for pointing it out.
Charlie
The issue is that the Vertica benchmark is not a “TPC record” as you claim. If this were so, it would be listed on the TPC website, http://tpc.org, which it is not.
It may be a ETL world record. It may be a data load record. But is certainly not a TPC record.
Vertica never once mentions “TPC record”. Read their press release, especially take notice to the foot note.
I’m sorry for being picky about this but making inaccurate claims is a pet peeve of mine.
No worries about being picky. This is an open forum and your feedback is welcome.
Thank you for your insight.
Charlie
A few comments on the “ETL World Record”:
1. Who judges this world record and what are the qualifying criteria? Does it just depend on which “independent” analyst is looking at what solution at a particular point in time?
2. Who held this “world record” previously – does Vertica even know or are they guessing? In which case, do they really know they’ve beaten it?
3. The most obvious point. Nice performance over an hour in a controlled lab environment. But what sort of comparable stats can Vertica produce for a very large production environment (100 TB+) over a sustained period (i.e. weeks/months/years) – assuming they have such a production environment. And not the marketing-speak terms of peak performance, but TB/hr in continued, 24/7, mixed workload. There are systems out there loading 2TB/hr in real time into something as mainstream as Oracle 10gR2 (peaking at 5x that in busy hours), so is this really that remarkable or just more hollow marketing bluster from the niche players?
Ash,
You ask some great questions. I am going to do a little research and see what I can find.
What I do know is that this effort involved three core technologies. DMExpress, HP Blade System C-Class, and Vertica. The DMExpress PR is here -> http://www.bcsolution.com/2009/05/syncsort-world-record/ – From what I understand, they were all extremely excited from the results.
I find Syncsort to be and extremely reputable brand. For over 30 years they have provided rock solid technology in the very large data space. I have only worked with them during the past 10 years but remember their advertising campaigns in the late 80’s.
So I believe you will find that Syncsort can stand behind their claims. They have some clients with enormous data deploys that I am sure have run through weekends.
I will see what I can find out.
Charlie
The strange thing is that you forgot about Sybase IQ.
Hi Alex,
I did not forget about Sybase IQ, When I first had the opportunity to come across them, I did not see MPP as being the underlying architecture. At that time, they still had many components of their technology that was shared and could be considered a potential bottleneck. Things may have changed since I did my eval, but at that time, Column Store – Yes, Shared Nothing/MPP – No. As a result, I took them off my VLDB and MPP watch list.
Charlie
Leave a Comment