Hacker Newsnew | past | comments | ask | show | jobs | submit | proddata's commentslogin

CrateDB DevRel here :)

> databases providing an abstraction through the Postgres wire protocol

I would not call it an abstraction, if one has a full parser, analyzer, planner and execution engine. It is just a common language ;)


What do yo mean by high-frequency data? 100Hz, 1KHz, 100KHz? For that kind of use cases many time-series DBs break apart. We have customers storing multiple millions of high frequency measurements per sec in arrays.

I would say, Postgres is not too storage efficient in itself for large amounts of data, especially if you need any sorts of indexes. Timescale basically mitigates that by automatically creating new table in the background ("chunks") and keeping individual tables small.


TimescaleDB also implements compression. From the docs:

When compression is enabled, TimescaleDB converts data stored in many rows into an array. This means that instead of using lots of rows to store the data, it stores the same data in a single row. Because a single row takes up less disk space than many rows, it decreases the amount of disk space required, and can also speed up some queries.

(Timescale employee)


Generally in the 100Hz or 200Hz range for the time being. What do you mean by break apart?


Not being able to keep up with the incoming data. But 100-200Hz I'd consider fine for most


That article takes various concepts from typical TSDB solutions and seemingly only looks at the bad sides. Time series data has many different forms, not every form works for every TSDB solution.

For the 3 caveats at the top, there are already two TS solutions that look promising (QuestDB, TimescaleDB). Often an operational analytics DB (Clickhouse, CrateDB) might also be a solution.


(TimescaleDB co-founder)

Thanks for the mention, and I completely agree :-)

Personally, there is a lot in this article that is misguided.

For example, it essentially defines "time-series database" as "metric store." As TimescaleDB users know, TimescaleDB handles a lot more than just metrics. In fact, we handle any of the data types that Postgres can handle, which I suspect is more than what Honeycomb's custom store supports.

  TSDBs are good at what they do, but high cardinality is not built into the design. The wrong tag (or simply having too many tags) leads to a combinatorial explosion of storage requirements.
This is a broad generalization. Some time-series databases are better at high cardinality than others. Also, what is "high-cardinality" - 100K? 1M? 10M? (We in fact are designed for _higher cardinalities_ than most other time-series databases [0])

  In contrast, our distributed column store optimizes for storing raw, high-cardinality data from which you can derive contextual traces. This design is flexible and performant enough that we can support metrics and tracing using the same backend. The same cannot be said of time series databases, though, which are hyper-specialized for a specific type of data. 
We just launched tracing and metrics support in the same backend - in Promscale, built on TimescaleDB [1]

I do commend the folks at Honeycomb for having a good product loved by some of my colleagues (at other companies). I also commend them for attempting to write an article aimed to educate. But I wish they had done more research - because without it, this article (IMO) ends up confusing more than educating.

For anyone curious on our definition of "time-series data" and "time-series databases": https://blog.timescale.com/blog/what-the-heck-is-time-series...

[0] https://blog.timescale.com/blog/what-is-high-cardinality-how...

[1] https://blog.timescale.com/blog/what-are-traces-and-how-sql-...


How does timescale (a single-purpose database) hold up against single-store (a multi-purpose database)? Of course, timescale is cheaper, but other than that, have you folks compared / contrast against single-store as a TSDB?

PS https://www.timescale.com/papers/timescaledb.pdf is 404


TimescaleDB performs quite well. One of our unique insights is that it is quite possible to build a best-in-class time-series database on top of Postgres (although it’s not easy ;-)

Here is one benchmark: https://blog.timescale.com/blog/timescaledb-vs-influxdb-for-...

There are some challenges with building on Postgres - but what we’ve been able to do is build innovative capabilities that overcome these challenges (Eg columnar compression in a row-oriented store, multi-node scale out).

We also have some exciting things that we are announcing this week. Stay tuned :-)

PS - Where did you find that PDF? Thought we took it down (it was hard to keep it up to date :-) )


Thanks.

Re: paper: I stumbled upon it when going through other timescaledb threads on news.yc, specifically here, https://news.ycombinator.com/item?id=13943939 (5 yrs ago)


I had a serious case of deja vu reminding me of your article on compression in timescaledb :-D


Thanks for reading that article :-)


> Often an operational analytics DB (Clickhouse, CrateDB) might also be a solution

This might be a bit off topic, but speaking of gaps in common observability tooling: is an OLAP database a common go-to for longer-timescale analytics (as in [1])? We're using BigQuery, but on ~600GB of log/event data I start hitting memory limits even with fairly small analytical windows.

In this context I have seen other references to: Sawzall (google), Lingo (google), MapReduce/Pig/Cascading/Scalding. Are people using Spark for this sort of thing now? Perhaps a combined workflow would be ideal: filter/group/extract interesting data in Hadoop/Spark, and then load into OLAP for ad-hoc querying?

[1]: https://danluu.com/metrics-analytics/


> is an OLAP database a common go-to for longer-timescale analytics (as in [1])?

I would not consider Clickhouse or CrateDB "classic" OLAP DBs. I can speak for CrateDB (I work there), that it definitely would be able to handle 600GB and query across it in an ad-hoc manner.

We have users ingesting Terabytes of events per day and run aggregations across 100 Terabyte.


What kind of hardware requirements would be needed to store and query this much data?


- Depends - Just inserting, indexing, storing and simple querying can be done with little memory (i.e. 1:500 memory-disk-ratio 0.5GB RAM per 1TB disk). Typical production clusters with high query load are in the 1:150 range i.e. 64GB RAM for 10TB disk).

Otherwise typical general purpose hardware (Standard SSDs, 1:4 vCPU:memory ratios, ...)


Interesting, so that'd be about 1 vCPU and 4GB RAM per 625GB of data. That seems very price efficient. Would something like AWS's EBS be sufficient for this? Would you need one of the higher tiers? Or would you be looking at running this on a box with locally attached storage?


Most of CrateDB clusters run on cloud providers hardware (azure, aws, alibaba). Using EBS (GP2 or now GP3) is also quite common. Due to the indexing / storage engine, gp disks are typically sufficient and faster disks have little to no advantage


Wouldn't 0.5GB RAM per 1TB disk be more like 1:2000 memory-disk-ratio? Which is even better!


Sorry, mixed up the number 2GB memory (0.5GB heap). So 1:500 is correct


For longer-scale timeseries I still recommend Druid as the go-to. Mainly because if you make use of it's ahead-of-time aggregations (which you can do for real-time or scale-out batch ingestion) then your ad-hoc queries can execute extremely quickly even over very large datasets.

Druid only really has 1 downside, which is it's still a bit of a pain to setup. It's gotten a ton ton better in recent times and I have been contributing changes to make it work better out of the box with common big data tooling like Avro.

For performance it's the top dog except for really naive queries that are dominated by scan performance. For those you are best off with Clickhouse, it's vectorized query engine is extremely fast for simpler/scan heavy workloads.


We used clickhouse on about 80TB in a raid10 setup. It was extremely fast


What are the best books out there to learn about Time Series databases? (there are already a million for relational and graph, but haven't seen one for time series). Bonus on how to implement one


CMU did a timeseries series couple of years ago:

https://www.youtube.com/playlist?list=PLSE8ODhjZXjY0GMWN4X8F...

Things have changed a little bit now , but not much.


Excellent, thanks


If you want something like Honeycomb but scales better then maybe look at Druid.


Last time I checked, Druid were not very good at ad-hoc tasks because it lacked join and SQL supports was sketchy. How is it now ?


Limited JOIN support. SQL is now very good.

JOINs vs no JOINs isn't an adhoc vs not-adhoc thing but more of a schema thing. If you try jam a star schema into it you aren't going to have a good time. This is true for pretty all of these more optimised stores. If you have a star schema and want to do those sorts of queries (and performance or cost aren't your #1 driving factors) then the better tool is a traditional data warehouse like BigQuery.

This probably won't be the case forever though, there is significant progress in the Presto/Trino world to enable push-down for stores like Druid which would allow you to source most of your fact table data from other sources and then join into your events/time-orientated data from Druid very efficiently.


Take a look at IronDB too. High scale distributed implementationd.


Sorry, but this is not true at all.

Some of the biggest changes within ES come from Lucene, like _massive_ reduction in memory footprint, enabling ES to use cases not even possible before.


If you are looking an OSS ES replacement, CrateDB might also be worth a look :)

Basically a best of both worlds combination of ES and PostgreSQL, perfect for time-series and log analytics.


The thing is, that all the arguments they now bring up for the move, have been true in 2018 as well ...


> So, they don't run Linux, don't use glibc? That can't be all that common? (I mean sure, there's the bsds.. But still..).

We do run Linux :)

But there is a difference between building on and building with.


> If your business model cannot survive when a critical upstream piece of your infrastructure moves to GPL, you probably have a bad business model to begin with.

To be clear CrateDB started out as OSS and we decide to stay OSS. Elasticsearch used the Apache License and so did CrateDB. All in the spirit of OSS. Elastic are however now the ones how decided, that their business model isn't viable anymore.

> It sounds like they are making up excuses for not wanting to fully Open Source their code

We do want to make it fully open source! Everything that was under a more restrictive License is going to be offered under Apache License.


Thanks for correcting me!

> We do want to make it fully open source!

This begs the question: isn't "a restrictive OSS licence" not less "fully open source" than a more permissive licence like GPL, MIT or BSD?

If you are fully committed to OSS, why not go full-oss, instead of retaining control through a restrictive OSS licence?

Is that really only because of some enterprises not liking GPL?


> This begs the question: isn't "a restrictive OSS licence" not less "fully open source" than a more permissive licence like GPL, MIT or BSD?

We gonna change CrateDB fully to Apache License v2 ;) I would say that counts as a "more permissive" license.

> Is that really only because of some enterprises not liking GPL?

There are various reasons for the change. A big part is definitely also the spirit of many our contributors. We built CrateDB on open source software and also want to make the software available as open source. It also was planned for quite some time to be more open.


> why not go full-oss, instead of retaining control through a restrictive OSS licence?

The main point of copyleft is to pass down freedoms to use/modify/distribute all the way to the end user.

Instead we got locked-down privacy-breaching smartphones, IoT devices, SaaS, where only the manufacturer benefits from OSS.


Copyleft licensing was invented in an era when most things were written in C, and software fell into roughly four categories:

• Unix-ish black-box "primitive" tools, that were so focused on accomplishing one fundamental job that they were essentially "final" in their interfaces, with there being no point to extending them any further; where you reused them by executing them, not by integrating with them.

• A library for such Unix-ish black-box tools to use. Most tools that used any libraries at all, would use one main library to accomplish their one primary purpose, effectively making the tool a "driver program" for the library.

• Academic data-science/statistics code.

• Cathedral-style highly-integrated software, e.g. Windows.

Copyleft was mostly devised for the purpose of licensing the codebases of the first two types.

As Unix tools are self-contained, they only "infect" direct forks. The GPL originally intentionally avoided infecting the things that called (i.e. interacted with) those tools — because, back then, a downstream project that "uses" a tool wasn't vendoring in its own version, but rather relying on the system installation of that tool, through that tool's known API; and it wouldn't make sense for a license to be infectious through a standardized API.

It was intentional that libraries would infect their downstream clients with copyleft; but downstream clients, back then, were mostly just those single-purpose tools. It wouldn't make sense for e.g. libgit to be GPL-licensed, but for git(1) to be proprietary.

Of course, there was also an awareness that the Cathedral-style codebases would have their whole monolith infected if they used the GPLed library. The idea there, though, wasn't to actually cause that infection — it was to inhibit Cathedral-style codebases from using GPLed software at all.

(With the parallel awareness that such entities could always reach out to the project maintainers, and buy a separate license, just like you can purchase a license from any IP-holder. There were few-enough contributors per project, back then, that "copyright assignment" and such wasn't a concern; you could just get a proprietary license from the one dude who built the whole thing.)

And that same consideration was implicit with academic use of GPLed software. FOSS programmers, back then, considered academic (or non-profit) integration/derivation of their software to be something they'd grant a free license for if asked; and academics knew this, and so didn't bother to ask for such licenses, because they knew they'd almost assuredly get one, for free, if-and-when it ever became important to do so.

---

The GPL was well-suited to this early-90s software IP ecosystem. It doesn't fit nearly as well in the modern software IP ecosystem.

There's a whole fifth category of software — semi-Cathedral, semi-Bazaar mega-tools, like youtube-dl or Calibre; or mega-libraries, like Qt, or LLVM, or WebKit; which both are components while also consuming many components themselves. The GPL never "expected" this type of software. This kind of software just didn't exist back then; only its entirely-Cathedal final-product equivalent did.

Which is a problem, because it's impossible to build something like WebKit or LLVM in a self-contained, "you call it over a standard interface" sort of way, where it's non-infectious. These days, lot more projects are infectious, even when "integrated" at a much higher level of abstraction, than the GPL was ever intended to require.


The distinction between GPL and LGPL seems completely lost on you.

LGPL existed since 1991. Also large libraries and frameworks.


I'm not talking about the GPL or the LGPL specifically, but rather I'm talking about the thing that was in Stallman's head before either of them existed — the concept of copyleft, of an infectious "Free" license.

Yes, you can create different implementations of that concept, that are variously infectious. But the reason I laid out the whole state of the ecosystem as RMS would have seen it when he was still just conceptualizing copyleft, is that in that ecosystem, "infectiousness" was something that's almost trivial, toothless.

In the ecosystem of the early 90s, licensing a codebase was just a consideration of who you trusted to freely use and modify your thing ("us", hackers); vs. who you wanted to not use your thing, unless they paid you ("them", corporate.) Copyleft neatly prevented "them" from swiping and profiting off of the software created by "us", while not really inhibiting anything that "us hackers" wanted to do with that same software.

Contrast to the ecosystem of today: there's an entire category of people — individuals who start projects as hackers or academics, but then build huge software businesses around them — that didn't even exist back in the 90s.

Google is the epitome of this: at its inception, BackRub (Google Search) was exactly the type of project that copyleft was designed to avoid restricting. But it evolved, through commercialization, patenting, SaaS-ification, and scale, into exactly the type of project that copyleft wants to "shun out of" the FOSS ecosystem. (Not that Google Search integrated any FOSS libraries; just that it could have.)

Google's story, is the story of most software projects today. Every developer is considering their project as a potential "open core" for a SaaS, or considering having an "enterprise version" of their tool, or considering licensing their algorithm as a plugin to some big studio to redistribute. Which is exactly why many programmers avoid integrating copyleft software into even their hobby projects. Why build on GCC when you can build on LLVM, and ensure that there'll be no legal problem

Modern copyleft licenses are ever-more-strained legal contortions to make a design that's no longer very applicable to the modern software IP ecosystem, work for it anyway. They're licenses with epicycles.

Sure, I can in fact link AGPLed and LGPLed system libraries in my language runtime; and in a pinch — if there's no equally-good alternative — I'll take the time, work out the precise legal implications, and go ahead with it.

But if there's a BSD or MIT-licensed (or even Apache-licensed) alternative to those libraries? I'll choose that one. Because, in the modern landscape, by doing so, I'm saving myself, my future self, and my future hypothetical SaaS's future hypothetical lawyers, a lot of time and effort.


Re: > it was to inhibit Cathedral-style codebases from using GPLed software at all.

I'm surprised at the notion that RMS ever had the intention of inhibiting GPL use in Cathedral-style codebases. And I agree with others that big frameworks existed back then also.


Sure, big frameworks existed back then, but they were all either

1. literally "frameworks" in the inversion-of-control sense — where your code is a script that runs "inside" the framework — and you don't ship the framework to your customers as part of your product, but rather walk them through getting it from its own vendor as a prerequisite step to installing your product (e.g. TeX); or

2. proprietary, not copyleft-licensed (e.g. game-console SDKs.)

If anyone has a good counter-example, I'm all ears :)


Especially since the term cathedral was popularized by esr to call out the development style of the GNU project.


>Instead we got locked-down privacy-breaching smartphones, IoT devices, SaaS, where only the manufacturer benefits from OSS.

You forget the user.


i will bite.

How does the user benefit from having a phone built on top of open source software, that they cannot update a well known security vulnerability because the manufacturer can't bother to run a build with the last upstream version?


> How does the user benefit from having a phone built on top of open source software, that they cannot update a well known security vulnerability because the manufacturer can't bother to run a build with the last upstream version?

Well, depending on the incentives and restrictions involved, an ecosystem of 3rd-party builds is a potentially viable escape hatch for the user from the manufacturer's grip.

Of course the sticking point is the degree to which the hardware requires proprietary and opaque binary blobs in order to enable important user-facing features. But then, that isn't anything really new, as open source PC operating systems have been dealing with this issue since forever, with the caveat that PC hardware is mostly modular, so having or swapping in well-supported components is an option, whereas smartphones are an integrated slab of metal, plastic, and glass, with "no user serviceable parts inside" as the status quo.

But even that caveat has precedents, in non-PC devices such as consumer networking gear that only became well supported through aggressive GPL license enforcement actions that freed some of the necessary code.


You guys are missing the point that some company already cornered the market by using open source code and not contributing back their work. they have a head start from everyone that can ever be involved.


The user does not care, he throws it away. But the user cares about having a good operating system and good apps and those are built on OSS.


They can backport the patch for the security fix themselves and rebuild the old version, or band together with other users to do the same.


that would be a full GPL compliant product, which is not what the comment was talking about.

Someone said companies use GPL software, add their business logic or drivers, and never contribute back: e.g. android phones.

Then someone else said the insane thing that the user benefits.

I pointed out that if there is a security flaw, you CANNOT build/path because you do not have all the source (e.g. alternative android OS cannot use the camera or radio for lack of kernel drivers)


Your post above doesn't mention GPL compliance, only vendors using ancient versions of open source code. If you don't have the source, of course you can't do anything. So you ask the vendor for source and if they refuse then you contact the Linux kernel community to enforce GPL compliance. At some point the source will come out, even if the vendor has to get sued in order to do it.

https://sfconservancy.org/copyleft-compliance/


No, I did not forget the user. The user is the victim.


Can you clarify why the GPL doesn't work for you. Clearly you release all the source code for your product. Is it that you have downstream customers that make modifications to your source code, and they don't want to have to release their changes for their products?


Yes, we have customers using CrateDB as part of their proprietary product.

Also the SSPL is so vague, that we probably would not only have to release CrateDB itself - which we already do, but also everything we use for the services we provide. Also we could never make any kind of deals with OEMs, etc.


Fair point - I will review this with our marketing and get that fixed


Many reasons actually ...

- Scalability CrateDB is built for horizontal scale from the ground up on top of distributed technologies. We have customers using clusters with 80+ nodes in production for many years now.

Timescale just released their multi-node feature in beta and they follow a different concept then we do. While Timescale uses a leader (access node) - follower (data node) model with a single point of failure CrateDB is built on a shared-nothing architecture. Many features you would want to see in a distributed system are present in CrateDB and still missing in TS:

- cluster wide replication - automatic rebalancing - cluster wide backup - shared nothing architecture / no single point of failure

- Full Text Search CrateDB is built on Lucene and parts of ES and includes search capabilities you would typically need a separate product for when using PG/TS.

- Distributed Query Engine Yes, PG/TS are fast if you query "small" amounts of data (e.g. last days data). But if you have distributed system, you might as well also want to run queries on larger data sets.

- Geospatial Queries Powered with Lucenes BKD-Trees

---

Disclaimer: I work for Crate.io and I also think Timescale are doing awesome stuff in many ways and give Influx the competition they deserve. I don't see us in direct competition (at least not yet), as the focus of Timescale is clearly more on smaller use cases.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: