proddata's comments

proddata · on Feb 10, 2022

CrateDB DevRel here :)

> databases providing an abstraction through the Postgres wire protocol

I would not call it an abstraction, if one has a full parser, analyzer, planner and execution engine. It is just a common language ;)

proddata · on Oct 18, 2021

What do yo mean by high-frequency data? 100Hz, 1KHz, 100KHz? For that kind of use cases many time-series DBs break apart. We have customers storing multiple millions of high frequency measurements per sec in arrays.

I would say, Postgres is not too storage efficient in itself for large amounts of data, especially if you need any sorts of indexes. Timescale basically mitigates that by automatically creating new table in the background ("chunks") and keeping individual tables small.

LoriP · on Oct 19, 2021

TimescaleDB also implements compression. From the docs:

When compression is enabled, TimescaleDB converts data stored in many rows into an array. This means that instead of using lots of rows to store the data, it stores the same data in a single row. Because a single row takes up less disk space than many rows, it decreases the amount of disk space required, and can also speed up some queries.

(Timescale employee)

digbybk · on Oct 19, 2021

Generally in the 100Hz or 200Hz range for the time being. What do you mean by break apart?

proddata · on Oct 20, 2021

Not being able to keep up with the incoming data. But 100-200Hz I'd consider fine for most

proddata · on Oct 18, 2021

That article takes various concepts from typical TSDB solutions and seemingly only looks at the bad sides. Time series data has many different forms, not every form works for every TSDB solution.

For the 3 caveats at the top, there are already two TS solutions that look promising (QuestDB, TimescaleDB). Often an operational analytics DB (Clickhouse, CrateDB) might also be a solution.

akulkarni · on Oct 18, 2021

(TimescaleDB co-founder)

Thanks for the mention, and I completely agree :-)

Personally, there is a lot in this article that is misguided.

For example, it essentially defines "time-series database" as "metric store." As TimescaleDB users know, TimescaleDB handles a lot more than just metrics. In fact, we handle any of the data types that Postgres can handle, which I suspect is more than what Honeycomb's custom store supports.

  TSDBs are good at what they do, but high cardinality is not built into the design. The wrong tag (or simply having too many tags) leads to a combinatorial explosion of storage requirements.

This is a broad generalization. Some time-series databases are better at high cardinality than others. Also, what is "high-cardinality" - 100K? 1M? 10M? (We in fact are designed for _higher cardinalities_ than most other time-series databases [0])

  In contrast, our distributed column store optimizes for storing raw, high-cardinality data from which you can derive contextual traces. This design is flexible and performant enough that we can support metrics and tracing using the same backend. The same cannot be said of time series databases, though, which are hyper-specialized for a specific type of data.

We just launched tracing and metrics support in the same backend - in Promscale, built on TimescaleDB [1]

I do commend the folks at Honeycomb for having a good product loved by some of my colleagues (at other companies). I also commend them for attempting to write an article aimed to educate. But I wish they had done more research - because without it, this article (IMO) ends up confusing more than educating.

For anyone curious on our definition of "time-series data" and "time-series databases": https://blog.timescale.com/blog/what-the-heck-is-time-series...

[0] https://blog.timescale.com/blog/what-is-high-cardinality-how...

[1] https://blog.timescale.com/blog/what-are-traces-and-how-sql-...

ignoramous · on Oct 18, 2021

How does timescale (a single-purpose database) hold up against single-store (a multi-purpose database)? Of course, timescale is cheaper, but other than that, have you folks compared / contrast against single-store as a TSDB?

PS https://www.timescale.com/papers/timescaledb.pdf is 404

akulkarni · on Oct 18, 2021

TimescaleDB performs quite well. One of our unique insights is that it is quite possible to build a best-in-class time-series database on top of Postgres (although it’s not easy ;-)

Here is one benchmark: https://blog.timescale.com/blog/timescaledb-vs-influxdb-for-...

There are some challenges with building on Postgres - but what we’ve been able to do is build innovative capabilities that overcome these challenges (Eg columnar compression in a row-oriented store, multi-node scale out).

We also have some exciting things that we are announcing this week. Stay tuned :-)

PS - Where did you find that PDF? Thought we took it down (it was hard to keep it up to date :-) )

ignoramous · on Oct 19, 2021

Thanks.

Re: paper: I stumbled upon it when going through other timescaledb threads on news.yc, specifically here, https://news.ycombinator.com/item?id=13943939 (5 yrs ago)

eska · on Oct 18, 2021

I had a serious case of deja vu reminding me of your article on compression in timescaledb :-D

akulkarni · on Oct 18, 2021

Thanks for reading that article :-)

oconnore · on Oct 18, 2021

> Often an operational analytics DB (Clickhouse, CrateDB) might also be a solution

This might be a bit off topic, but speaking of gaps in common observability tooling: is an OLAP database a common go-to for longer-timescale analytics (as in [1])? We're using BigQuery, but on ~600GB of log/event data I start hitting memory limits even with fairly small analytical windows.

In this context I have seen other references to: Sawzall (google), Lingo (google), MapReduce/Pig/Cascading/Scalding. Are people using Spark for this sort of thing now? Perhaps a combined workflow would be ideal: filter/group/extract interesting data in Hadoop/Spark, and then load into OLAP for ad-hoc querying?

[1]: https://danluu.com/metrics-analytics/

proddata · on Oct 18, 2021

> is an OLAP database a common go-to for longer-timescale analytics (as in [1])?

I would not consider Clickhouse or CrateDB "classic" OLAP DBs. I can speak for CrateDB (I work there), that it definitely would be able to handle 600GB and query across it in an ad-hoc manner.

We have users ingesting Terabytes of events per day and run aggregations across 100 Terabyte.

nicoburns · on Oct 18, 2021

What kind of hardware requirements would be needed to store and query this much data?

proddata · on Oct 18, 2021

- Depends - Just inserting, indexing, storing and simple querying can be done with little memory (i.e. 1:500 memory-disk-ratio 0.5GB RAM per 1TB disk). Typical production clusters with high query load are in the 1:150 range i.e. 64GB RAM for 10TB disk).

Otherwise typical general purpose hardware (Standard SSDs, 1:4 vCPU:memory ratios, ...)

nicoburns · on Oct 18, 2021

Interesting, so that'd be about 1 vCPU and 4GB RAM per 625GB of data. That seems very price efficient. Would something like AWS's EBS be sufficient for this? Would you need one of the higher tiers? Or would you be looking at running this on a box with locally attached storage?

proddata · on Oct 18, 2021

Most of CrateDB clusters run on cloud providers hardware (azure, aws, alibaba). Using EBS (GP2 or now GP3) is also quite common. Due to the indexing / storage engine, gp disks are typically sufficient and faster disks have little to no advantage

claytonjy · on Oct 18, 2021

Wouldn't 0.5GB RAM per 1TB disk be more like 1:2000 memory-disk-ratio? Which is even better!

proddata · on Oct 18, 2021

Sorry, mixed up the number 2GB memory (0.5GB heap). So 1:500 is correct

jpgvm · on Oct 18, 2021

For longer-scale timeseries I still recommend Druid as the go-to. Mainly because if you make use of it's ahead-of-time aggregations (which you can do for real-time or scale-out batch ingestion) then your ad-hoc queries can execute extremely quickly even over very large datasets.

Druid only really has 1 downside, which is it's still a bit of a pain to setup. It's gotten a ton ton better in recent times and I have been contributing changes to make it work better out of the box with common big data tooling like Avro.

For performance it's the top dog except for really naive queries that are dominated by scan performance. For those you are best off with Clickhouse, it's vectorized query engine is extremely fast for simpler/scan heavy workloads.

shaklee3 · on Oct 18, 2021

We used clickhouse on about 80TB in a raid10 setup. It was extremely fast

alfiedotwtf · on Oct 18, 2021

What are the best books out there to learn about Time Series databases? (there are already a million for relational and graph, but haven't seen one for time series). Bonus on how to implement one

dominotw · on Oct 18, 2021

CMU did a timeseries series couple of years ago:

https://www.youtube.com/playlist?list=PLSE8ODhjZXjY0GMWN4X8F...

Things have changed a little bit now , but not much.

alfiedotwtf · on Oct 19, 2021

Excellent, thanks

jpgvm · on Oct 18, 2021

If you want something like Honeycomb but scales better then maybe look at Druid.

dikei · on Oct 18, 2021

Last time I checked, Druid were not very good at ad-hoc tasks because it lacked join and SQL supports was sketchy. How is it now ?

jpgvm · on Oct 18, 2021

Limited JOIN support. SQL is now very good.

JOINs vs no JOINs isn't an adhoc vs not-adhoc thing but more of a schema thing. If you try jam a star schema into it you aren't going to have a good time. This is true for pretty all of these more optimised stores. If you have a star schema and want to do those sorts of queries (and performance or cost aren't your #1 driving factors) then the better tool is a traditional data warehouse like BigQuery.

This probably won't be the case forever though, there is significant progress in the Presto/Trino world to enable push-down for stores like Druid which would allow you to source most of your fact table data from other sources and then join into your events/time-orientated data from Druid very efficiently.

camel_gopher · on Oct 18, 2021

Take a look at IronDB too. High scale distributed implementationd.

proddata · on April 13, 2021

Sorry, but this is not true at all.

Some of the biggest changes within ES come from Lucene, like _massive_ reduction in memory footprint, enabling ES to use cases not even possible before.

proddata · on March 2, 2021

If you are looking an OSS ES replacement, CrateDB might also be worth a look :)

Basically a best of both worlds combination of ES and PostgreSQL, perfect for time-series and log analytics.

proddata · on Jan 28, 2021

The thing is, that all the arguments they now bring up for the move, have been true in 2018 as well ...

proddata · on Jan 27, 2021

> So, they don't run Linux, don't use glibc? That can't be all that common? (I mean sure, there's the bsds.. But still..).

We do run Linux :)

But there is a difference between building on and building with.

proddata · on Jan 27, 2021

> If your business model cannot survive when a critical upstream piece of your infrastructure moves to GPL, you probably have a bad business model to begin with.

To be clear CrateDB started out as OSS and we decide to stay OSS. Elasticsearch used the Apache License and so did CrateDB. All in the spirit of OSS. Elastic are however now the ones how decided, that their business model isn't viable anymore.

> It sounds like they are making up excuses for not wanting to fully Open Source their code

We do want to make it fully open source! Everything that was under a more restrictive License is going to be offered under Apache License.

berkes · on Jan 27, 2021

Thanks for correcting me!

> We do want to make it fully open source!

This begs the question: isn't "a restrictive OSS licence" not less "fully open source" than a more permissive licence like GPL, MIT or BSD?

If you are fully committed to OSS, why not go full-oss, instead of retaining control through a restrictive OSS licence?

Is that really only because of some enterprises not liking GPL?

proddata · on Jan 27, 2021

> This begs the question: isn't "a restrictive OSS licence" not less "fully open source" than a more permissive licence like GPL, MIT or BSD?

We gonna change CrateDB fully to Apache License v2 ;) I would say that counts as a "more permissive" license.

> Is that really only because of some enterprises not liking GPL?

There are various reasons for the change. A big part is definitely also the spirit of many our contributors. We built CrateDB on open source software and also want to make the software available as open source. It also was planned for quite some time to be more open.

eeZah7Ux · on Jan 27, 2021

> why not go full-oss, instead of retaining control through a restrictive OSS licence?

The main point of copyleft is to pass down freedoms to use/modify/distribute all the way to the end user.

Instead we got locked-down privacy-breaching smartphones, IoT devices, SaaS, where only the manufacturer benefits from OSS.

derefr · on Jan 27, 2021

Copyleft licensing was invented in an era when most things were written in C, and software fell into roughly four categories:

• Unix-ish black-box "primitive" tools, that were so focused on accomplishing one fundamental job that they were essentially "final" in their interfaces, with there being no point to extending them any further; where you reused them by executing them, not by integrating with them.

• A library for such Unix-ish black-box tools to use. Most tools that used any libraries at all, would use one main library to accomplish their one primary purpose, effectively making the tool a "driver program" for the library.

• Academic data-science/statistics code.

• Cathedral-style highly-integrated software, e.g. Windows.

Copyleft was mostly devised for the purpose of licensing the codebases of the first two types.

As Unix tools are self-contained, they only "infect" direct forks. The GPL originally intentionally avoided infecting the things that called (i.e. interacted with) those tools — because, back then, a downstream project that "uses" a tool wasn't vendoring in its own version, but rather relying on the system installation of that tool, through that tool's known API; and it wouldn't make sense for a license to be infectious through a standardized API.

It was intentional that libraries would infect their downstream clients with copyleft; but downstream clients, back then, were mostly just those single-purpose tools. It wouldn't make sense for e.g. libgit to be GPL-licensed, but for git(1) to be proprietary.

Of course, there was also an awareness that the Cathedral-style codebases would have their whole monolith infected if they used the GPLed library. The idea there, though, wasn't to actually cause that infection — it was to inhibit Cathedral-style codebases from using GPLed software at all.

(With the parallel awareness that such entities could always reach out to the project maintainers, and buy a separate license, just like you can purchase a license from any IP-holder. There were few-enough contributors per project, back then, that "copyright assignment" and such wasn't a concern; you could just get a proprietary license from the one dude who built the whole thing.)

And that same consideration was implicit with academic use of GPLed software. FOSS programmers, back then, considered academic (or non-profit) integration/derivation of their software to be something they'd grant a free license for if asked; and academics knew this, and so didn't bother to ask for such licenses, because they knew they'd almost assuredly get one, for free, if-and-when it ever became important to do so.

---

The GPL was well-suited to this early-90s software IP ecosystem. It doesn't fit nearly as well in the modern software IP ecosystem.

There's a whole fifth category of software — semi-Cathedral, semi-Bazaar mega-tools, like youtube-dl or Calibre; or mega-libraries, like Qt, or LLVM, or WebKit; which both are components while also consuming many components themselves. The GPL never "expected" this type of software. This kind of software just didn't exist back then; only its entirely-Cathedal final-product equivalent did.

Which is a problem, because it's impossible to build something like WebKit or LLVM in a self-contained, "you call it over a standard interface" sort of way, where it's non-infectious. These days, lot more projects are infectious, even when "integrated" at a much higher level of abstraction, than the GPL was ever intended to require.

eeZah7Ux · on Jan 27, 2021

The distinction between GPL and LGPL seems completely lost on you.

LGPL existed since 1991. Also large libraries and frameworks.

derefr · on Jan 27, 2021

I'm not talking about the GPL or the LGPL specifically, but rather I'm talking about the thing that was in Stallman's head before either of them existed — the concept of copyleft, of an infectious "Free" license.

Yes, you can create different implementations of that concept, that are variously infectious. But the reason I laid out the whole state of the ecosystem as RMS would have seen it when he was still just conceptualizing copyleft, is that in that ecosystem, "infectiousness" was something that's almost trivial, toothless.

In the ecosystem of the early 90s, licensing a codebase was just a consideration of who you trusted to freely use and modify your thing ("us", hackers); vs. who you wanted to not use your thing, unless they paid you ("them", corporate.) Copyleft neatly prevented "them" from swiping and profiting off of the software created by "us", while not really inhibiting anything that "us hackers" wanted to do with that same software.

Contrast to the ecosystem of today: there's an entire category of people — individuals who start projects as hackers or academics, but then build huge software businesses around them — that didn't even exist back in the 90s.

Google is the epitome of this: at its inception, BackRub (Google Search) was exactly the type of project that copyleft was designed to avoid restricting. But it evolved, through commercialization, patenting, SaaS-ification, and scale, into exactly the type of project that copyleft wants to "shun out of" the FOSS ecosystem. (Not that Google Search integrated any FOSS libraries; just that it could have.)

Google's story, is the story of most software projects today. Every developer is considering their project as a potential "open core" for a SaaS, or considering having an "enterprise version" of their tool, or considering licensing their algorithm as a plugin to some big studio to redistribute. Which is exactly why many programmers avoid integrating copyleft software into even their hobby projects. Why build on GCC when you can build on LLVM, and ensure that there'll be no legal problem

Modern copyleft licenses are ever-more-strained legal contortions to make a design that's no longer very applicable to the modern software IP ecosystem, work for it anyway. They're licenses with epicycles.

Sure, I can in fact link AGPLed and LGPLed system libraries in my language runtime; and in a pinch — if there's no equally-good alternative — I'll take the time, work out the precise legal implications, and go ahead with it.

But if there's a BSD or MIT-licensed (or even Apache-licensed) alternative to those libraries? I'll choose that one. Because, in the modern landscape, by doing so, I'm saving myself, my future self, and my future hypothetical SaaS's future hypothetical lawyers, a lot of time and effort.

nealmcb2 · on Jan 28, 2021

Re: > it was to inhibit Cathedral-style codebases from using GPLed software at all.

I'm surprised at the notion that RMS ever had the intention of inhibiting GPL use in Cathedral-style codebases. And I agree with others that big frameworks existed back then also.

derefr · on Jan 28, 2021

Sure, big frameworks existed back then, but they were all either

1. literally "frameworks" in the inversion-of-control sense — where your code is a script that runs "inside" the framework — and you don't ship the framework to your customers as part of your product, but rather walk them through getting it from its own vendor as a prerequisite step to installing your product (e.g. TeX); or

2. proprietary, not copyleft-licensed (e.g. game-console SDKs.)

If anyone has a good counter-example, I'm all ears :)

xorcist · on Jan 28, 2021

Especially since the term cathedral was popularized by esr to call out the development style of the GNU project.

eznzt · on Jan 28, 2021

>Instead we got locked-down privacy-breaching smartphones, IoT devices, SaaS, where only the manufacturer benefits from OSS.

You forget the user.

gcblkjaidfj · on Jan 28, 2021

i will bite.

How does the user benefit from having a phone built on top of open source software, that they cannot update a well known security vulnerability because the manufacturer can't bother to run a build with the last upstream version?

webmaven · on Jan 28, 2021

> How does the user benefit from having a phone built on top of open source software, that they cannot update a well known security vulnerability because the manufacturer can't bother to run a build with the last upstream version?

Well, depending on the incentives and restrictions involved, an ecosystem of 3rd-party builds is a potentially viable escape hatch for the user from the manufacturer's grip.

Of course the sticking point is the degree to which the hardware requires proprietary and opaque binary blobs in order to enable important user-facing features. But then, that isn't anything really new, as open source PC operating systems have been dealing with this issue since forever, with the caveat that PC hardware is mostly modular, so having or swapping in well-supported components is an option, whereas smartphones are an integrated slab of metal, plastic, and glass, with "no user serviceable parts inside" as the status quo.

But even that caveat has precedents, in non-PC devices such as consumer networking gear that only became well supported through aggressive GPL license enforcement actions that freed some of the necessary code.

gcblkjaidfj · on Jan 28, 2021

You guys are missing the point that some company already cornered the market by using open source code and not contributing back their work. they have a head start from everyone that can ever be involved.

eznzt · on Jan 28, 2021

The user does not care, he throws it away. But the user cares about having a good operating system and good apps and those are built on OSS.

pabs3 · on Jan 28, 2021

They can backport the patch for the security fix themselves and rebuild the old version, or band together with other users to do the same.

gcblkjaidfj · on Jan 28, 2021

that would be a full GPL compliant product, which is not what the comment was talking about.

Someone said companies use GPL software, add their business logic or drivers, and never contribute back: e.g. android phones.

Then someone else said the insane thing that the user benefits.

I pointed out that if there is a security flaw, you CANNOT build/path because you do not have all the source (e.g. alternative android OS cannot use the camera or radio for lack of kernel drivers)

pabs3 · on Jan 29, 2021

Your post above doesn't mention GPL compliance, only vendors using ancient versions of open source code. If you don't have the source, of course you can't do anything. So you ask the vendor for source and if they refuse then you contact the Linux kernel community to enforce GPL compliance. At some point the source will come out, even if the vendor has to get sued in order to do it.

https://sfconservancy.org/copyleft-compliance/

eeZah7Ux · on Jan 28, 2021

No, I did not forget the user. The user is the victim.

simonh · on Jan 28, 2021

Can you clarify why the GPL doesn't work for you. Clearly you release all the source code for your product. Is it that you have downstream customers that make modifications to your source code, and they don't want to have to release their changes for their products?

proddata · on Jan 28, 2021

Yes, we have customers using CrateDB as part of their proprietary product.

Also the SSPL is so vague, that we probably would not only have to release CrateDB itself - which we already do, but also everything we use for the services we provide. Also we could never make any kind of deals with OEMs, etc.

proddata · on Nov 28, 2020

Fair point - I will review this with our marketing and get that fixed

proddata · on Nov 28, 2020

Many reasons actually ...

- Scalability CrateDB is built for horizontal scale from the ground up on top of distributed technologies. We have customers using clusters with 80+ nodes in production for many years now.

Timescale just released their multi-node feature in beta and they follow a different concept then we do. While Timescale uses a leader (access node) - follower (data node) model with a single point of failure CrateDB is built on a shared-nothing architecture. Many features you would want to see in a distributed system are present in CrateDB and still missing in TS:

- cluster wide replication - automatic rebalancing - cluster wide backup - shared nothing architecture / no single point of failure

- Full Text Search CrateDB is built on Lucene and parts of ES and includes search capabilities you would typically need a separate product for when using PG/TS.

- Distributed Query Engine Yes, PG/TS are fast if you query "small" amounts of data (e.g. last days data). But if you have distributed system, you might as well also want to run queries on larger data sets.

- Geospatial Queries Powered with Lucenes BKD-Trees

---

Disclaimer: I work for Crate.io and I also think Timescale are doing awesome stuff in many ways and give Influx the competition they deserve. I don't see us in direct competition (at least not yet), as the focus of Timescale is clearly more on smaller use cases.