What do yo mean by high-frequency data? 100Hz, 1KHz, 100KHz? For that kind of use cases many time-series DBs break apart. We have customers storing multiple millions of high frequency measurements per sec in arrays.
I would say, Postgres is not too storage efficient in itself for large amounts of data, especially if you need any sorts of indexes. Timescale basically mitigates that by automatically creating new table in the background ("chunks") and keeping individual tables small.
TimescaleDB also implements compression. From the docs:
When compression is enabled, TimescaleDB converts data stored in many rows into an array. This means that instead of using lots of rows to store the data, it stores the same data in a single row. Because a single row takes up less disk space than many rows, it decreases the amount of disk space required, and can also speed up some queries.
That article takes various concepts from typical TSDB solutions and seemingly only looks at the bad sides. Time series data has many different forms, not every form works for every TSDB solution.
For the 3 caveats at the top, there are already two TS solutions that look promising (QuestDB, TimescaleDB). Often an operational analytics DB (Clickhouse, CrateDB) might also be a solution.
Thanks for the mention, and I completely agree :-)
Personally, there is a lot in this article that is misguided.
For example, it essentially defines "time-series database" as "metric store." As TimescaleDB users know, TimescaleDB handles a lot more than just metrics. In fact, we handle any of the data types that Postgres can handle, which I suspect is more than what Honeycomb's custom store supports.
TSDBs are good at what they do, but high cardinality is not built into the design. The wrong tag (or simply having too many tags) leads to a combinatorial explosion of storage requirements.
This is a broad generalization. Some time-series databases are better at high cardinality than others. Also, what is "high-cardinality" - 100K? 1M? 10M? (We in fact are designed for _higher cardinalities_ than most other time-series databases [0])
In contrast, our distributed column store optimizes for storing raw, high-cardinality data from which you can derive contextual traces. This design is flexible and performant enough that we can support metrics and tracing using the same backend. The same cannot be said of time series databases, though, which are hyper-specialized for a specific type of data.
We just launched tracing and metrics support in the same backend - in Promscale, built on TimescaleDB [1]
I do commend the folks at Honeycomb for having a good product loved by some of my colleagues (at other companies). I also commend them for attempting to write an article aimed to educate. But I wish they had done more research - because without it, this article (IMO) ends up confusing more than educating.
How does timescale (a single-purpose database) hold up against single-store (a multi-purpose database)? Of course, timescale is cheaper, but other than that, have you folks compared / contrast against single-store as a TSDB?
TimescaleDB performs quite well. One of our unique insights is that it is quite possible to build a best-in-class time-series database on top of Postgres (although it’s not easy ;-)
There are some challenges with building on Postgres - but what we’ve been able to do is build innovative capabilities that overcome these challenges (Eg columnar compression in a row-oriented store, multi-node scale out).
We also have some exciting things that we are announcing this week. Stay tuned :-)
PS - Where did you find that PDF? Thought we took it down (it was hard to keep it up to date :-) )
> Often an operational analytics DB (Clickhouse, CrateDB) might also be a solution
This might be a bit off topic, but speaking of gaps in common observability tooling: is an OLAP database a common go-to for longer-timescale analytics (as in [1])? We're using BigQuery, but on ~600GB of log/event data I start hitting memory limits even with fairly small analytical windows.
In this context I have seen other references to: Sawzall (google), Lingo (google), MapReduce/Pig/Cascading/Scalding. Are people using Spark for this sort of thing now? Perhaps a combined workflow would be ideal: filter/group/extract interesting data in Hadoop/Spark, and then load into OLAP for ad-hoc querying?
> is an OLAP database a common go-to for longer-timescale analytics (as in [1])?
I would not consider Clickhouse or CrateDB "classic" OLAP DBs. I can speak for CrateDB (I work there), that it definitely would be able to handle 600GB and query across it in an ad-hoc manner.
We have users ingesting Terabytes of events per day and run aggregations across 100 Terabyte.
- Depends - Just inserting, indexing, storing and simple querying can be done with little memory (i.e. 1:500 memory-disk-ratio 0.5GB RAM per 1TB disk). Typical production clusters with high query load are in the 1:150 range i.e. 64GB RAM for 10TB disk).
Interesting, so that'd be about 1 vCPU and 4GB RAM per 625GB of data. That seems very price efficient. Would something like AWS's EBS be sufficient for this? Would you need one of the higher tiers? Or would you be looking at running this on a box with locally attached storage?
Most of CrateDB clusters run on cloud providers hardware (azure, aws, alibaba). Using EBS (GP2 or now GP3) is also quite common. Due to the indexing / storage engine, gp disks are typically sufficient and faster disks have little to no advantage
For longer-scale timeseries I still recommend Druid as the go-to. Mainly because if you make use of it's ahead-of-time aggregations (which you can do for real-time or scale-out batch ingestion) then your ad-hoc queries can execute extremely quickly even over very large datasets.
Druid only really has 1 downside, which is it's still a bit of a pain to setup. It's gotten a ton ton better in recent times and I have been contributing changes to make it work better out of the box with common big data tooling like Avro.
For performance it's the top dog except for really naive queries that are dominated by scan performance. For those you are best off with Clickhouse, it's vectorized query engine is extremely fast for simpler/scan heavy workloads.
What are the best books out there to learn about Time Series databases? (there are already a million for relational and graph, but haven't seen one for time series). Bonus on how to implement one
JOINs vs no JOINs isn't an adhoc vs not-adhoc thing but more of a schema thing. If you try jam a star schema into it you aren't going to have a good time. This is true for pretty all of these more optimised stores. If you have a star schema and want to do those sorts of queries (and performance or cost aren't your #1 driving factors) then the better tool is a traditional data warehouse like BigQuery.
This probably won't be the case forever though, there is significant progress in the Presto/Trino world to enable push-down for stores like Druid which would allow you to source most of your fact table data from other sources and then join into your events/time-orientated data from Druid very efficiently.
Some of the biggest changes within ES come from Lucene, like _massive_ reduction in memory footprint, enabling ES to use cases not even possible before.
> If your business model cannot survive when a critical upstream piece of your infrastructure moves to GPL, you probably have a bad business model to begin with.
To be clear CrateDB started out as OSS and we decide to stay OSS. Elasticsearch used the Apache License and so did CrateDB. All in the spirit of OSS. Elastic are however now the ones how decided, that their business model isn't viable anymore.
> It sounds like they are making up excuses for not wanting to fully Open Source their code
We do want to make it fully open source! Everything that was under a more restrictive License is going to be offered under Apache License.
> This begs the question: isn't "a restrictive OSS licence" not less "fully open source" than a more permissive licence like GPL, MIT or BSD?
We gonna change CrateDB fully to Apache License v2 ;) I would say that counts as a "more permissive" license.
> Is that really only because of some enterprises not liking GPL?
There are various reasons for the change. A big part is definitely also the spirit of many our contributors. We built CrateDB on open source software and also want to make the software available as open source. It also was planned for quite some time to be more open.
Copyleft licensing was invented in an era when most things were written in C, and software fell into roughly four categories:
• Unix-ish black-box "primitive" tools, that were so focused on accomplishing one fundamental job that they were essentially "final" in their interfaces, with there being no point to extending them any further; where you reused them by executing them, not by integrating with them.
• A library for such Unix-ish black-box tools to use. Most tools that used any libraries at all, would use one main library to accomplish their one primary purpose, effectively making the tool a "driver program" for the library.
• Academic data-science/statistics code.
• Cathedral-style highly-integrated software, e.g. Windows.
Copyleft was mostly devised for the purpose of licensing the codebases of the first two types.
As Unix tools are self-contained, they only "infect" direct forks. The GPL originally intentionally avoided infecting the things that called (i.e. interacted with) those tools — because, back then, a downstream project that "uses" a tool wasn't vendoring in its own version, but rather relying on the system installation of that tool, through that tool's known API; and it wouldn't make sense for a license to be infectious through a standardized API.
It was intentional that libraries would infect their downstream clients with copyleft; but downstream clients, back then, were mostly just those single-purpose tools. It wouldn't make sense for e.g. libgit to be GPL-licensed, but for git(1) to be proprietary.
Of course, there was also an awareness that the Cathedral-style codebases would have their whole monolith infected if they used the GPLed library. The idea there, though, wasn't to actually cause that infection — it was to inhibit Cathedral-style codebases from using GPLed software at all.
(With the parallel awareness that such entities could always reach out to the project maintainers, and buy a separate license, just like you can purchase a license from any IP-holder. There were few-enough contributors per project, back then, that "copyright assignment" and such wasn't a concern; you could just get a proprietary license from the one dude who built the whole thing.)
And that same consideration was implicit with academic use of GPLed software. FOSS programmers, back then, considered academic (or non-profit) integration/derivation of their software to be something they'd grant a free license for if asked; and academics knew this, and so didn't bother to ask for such licenses, because they knew they'd almost assuredly get one, for free, if-and-when it ever became important to do so.
---
The GPL was well-suited to this early-90s software IP ecosystem. It doesn't fit nearly as well in the modern software IP ecosystem.
There's a whole fifth category of software — semi-Cathedral, semi-Bazaar mega-tools, like youtube-dl or Calibre; or mega-libraries, like Qt, or LLVM, or WebKit; which both are components while also consuming many components themselves. The GPL never "expected" this type of software. This kind of software just didn't exist back then; only its entirely-Cathedal final-product equivalent did.
Which is a problem, because it's impossible to build something like WebKit or LLVM in a self-contained, "you call it over a standard interface" sort of way, where it's non-infectious. These days, lot more projects are infectious, even when "integrated" at a much higher level of abstraction, than the GPL was ever intended to require.
I'm not talking about the GPL or the LGPL specifically, but rather I'm talking about the thing that was in Stallman's head before either of them existed — the concept of copyleft, of an infectious "Free" license.
Yes, you can create different implementations of that concept, that are variously infectious. But the reason I laid out the whole state of the ecosystem as RMS would have seen it when he was still just conceptualizing copyleft, is that in that ecosystem, "infectiousness" was something that's almost trivial, toothless.
In the ecosystem of the early 90s, licensing a codebase was just a consideration of who you trusted to freely use and modify your thing ("us", hackers); vs. who you wanted to not use your thing, unless they paid you ("them", corporate.) Copyleft neatly prevented "them" from swiping and profiting off of the software created by "us", while not really inhibiting anything that "us hackers" wanted to do with that same software.
Contrast to the ecosystem of today: there's an entire category of people — individuals who start projects as hackers or academics, but then build huge software businesses around them — that didn't even exist back in the 90s.
Google is the epitome of this: at its inception, BackRub (Google Search) was exactly the type of project that copyleft was designed to avoid restricting. But it evolved, through commercialization, patenting, SaaS-ification, and scale, into exactly the type of project that copyleft wants to "shun out of" the FOSS ecosystem. (Not that Google Search integrated any FOSS libraries; just that it could have.)
Google's story, is the story of most software projects today. Every developer is considering their project as a potential "open core" for a SaaS, or considering having an "enterprise version" of their tool, or considering licensing their algorithm as a plugin to some big studio to redistribute. Which is exactly why many programmers avoid integrating copyleft software into even their hobby projects. Why build on GCC when you can build on LLVM, and ensure that there'll be no legal problem
Modern copyleft licenses are ever-more-strained legal contortions to make a design that's no longer very applicable to the modern software IP ecosystem, work for it anyway. They're licenses with epicycles.
Sure, I can in fact link AGPLed and LGPLed system libraries in my language runtime; and in a pinch — if there's no equally-good alternative — I'll take the time, work out the precise legal implications, and go ahead with it.
But if there's a BSD or MIT-licensed (or even Apache-licensed) alternative to those libraries? I'll choose that one. Because, in the modern landscape, by doing so, I'm saving myself, my future self, and my future hypothetical SaaS's future hypothetical lawyers, a lot of time and effort.
Re:
> it was to inhibit Cathedral-style codebases from using GPLed software at all.
I'm surprised at the notion that RMS ever had the intention of inhibiting GPL use in Cathedral-style codebases. And I agree with others that big frameworks existed back then also.
Sure, big frameworks existed back then, but they were all either
1. literally "frameworks" in the inversion-of-control sense — where your code is a script that runs "inside" the framework — and you don't ship the framework to your customers as part of your product, but rather walk them through getting it from its own vendor as a prerequisite step to installing your product (e.g. TeX); or
2. proprietary, not copyleft-licensed (e.g. game-console SDKs.)
If anyone has a good counter-example, I'm all ears :)
How does the user benefit from having a phone built on top of open source software, that they cannot update a well known security vulnerability because the manufacturer can't bother to run a build with the last upstream version?
> How does the user benefit from having a phone built on top of open source software, that they cannot update a well known security vulnerability because the manufacturer can't bother to run a build with the last upstream version?
Well, depending on the incentives and restrictions involved, an ecosystem of 3rd-party builds is a potentially viable escape hatch for the user from the manufacturer's grip.
Of course the sticking point is the degree to which the hardware requires proprietary and opaque binary blobs in order to enable important user-facing features. But then, that isn't anything really new, as open source PC operating systems have been dealing with this issue since forever, with the caveat that PC hardware is mostly modular, so having or swapping in well-supported components is an option, whereas smartphones are an integrated slab of metal, plastic, and glass, with "no user serviceable parts inside" as the status quo.
But even that caveat has precedents, in non-PC devices such as consumer networking gear that only became well supported through aggressive GPL license enforcement actions that freed some of the necessary code.
You guys are missing the point that some company already cornered the market by using open source code and not contributing back their work. they have a head start from everyone that can ever be involved.
that would be a full GPL compliant product, which is not what the comment was talking about.
Someone said companies use GPL software, add their business logic or drivers, and never contribute back: e.g. android phones.
Then someone else said the insane thing that the user benefits.
I pointed out that if there is a security flaw, you CANNOT build/path because you do not have all the source (e.g. alternative android OS cannot use the camera or radio for lack of kernel drivers)
Your post above doesn't mention GPL compliance, only vendors using ancient versions of open source code. If you don't have the source, of course you can't do anything. So you ask the vendor for source and if they refuse then you contact the Linux kernel community to enforce GPL compliance. At some point the source will come out, even if the vendor has to get sued in order to do it.
Can you clarify why the GPL doesn't work for you. Clearly you release all the source code for your product. Is it that you have downstream customers that make modifications to your source code, and they don't want to have to release their changes for their products?
Yes, we have customers using CrateDB as part of their proprietary product.
Also the SSPL is so vague, that we probably would not only have to release CrateDB itself - which we already do, but also everything we use for the services we provide. Also we could never make any kind of deals with OEMs, etc.
- Scalability
CrateDB is built for horizontal scale from the ground up on top of distributed technologies. We have customers using clusters with 80+ nodes in production for many years now.
Timescale just released their multi-node feature in beta and they follow a different concept then we do. While Timescale uses a leader (access node) - follower (data node) model with a single point of failure CrateDB is built on a shared-nothing architecture. Many features you would want to see in a distributed system are present in CrateDB and still missing in TS:
- cluster wide replication
- automatic rebalancing
- cluster wide backup
- shared nothing architecture / no single point of failure
- Full Text Search
CrateDB is built on Lucene and parts of ES and includes search capabilities you would typically need a separate product for when using PG/TS.
- Distributed Query Engine
Yes, PG/TS are fast if you query "small" amounts of data (e.g. last days data). But if you have distributed system, you might as well also want to run queries on larger data sets.
- Geospatial Queries
Powered with Lucenes BKD-Trees
---
Disclaimer:
I work for Crate.io and I also think Timescale are doing awesome stuff in many ways and give Influx the competition they deserve. I don't see us in direct competition (at least not yet), as the focus of Timescale is clearly more on smaller use cases.
> databases providing an abstraction through the Postgres wire protocol
I would not call it an abstraction, if one has a full parser, analyzer, planner and execution engine. It is just a common language ;)