On DataFrame datatype in Ruby (2016)

MrPowers · on May 30, 2021

All programming languages should have a PyArrow-backed DataFrame data structure. After working with DataFrames in Python or Spark, you'll never want to go back to manipulating nested arrays.

Ruby, Go, Scala, Java should all build these libs (Scala has Spark for cluster computing, but it needs a good Pandas-like single node lib) to attract the data engineers. I'd imagine it's relatively easy to add DataFrame support now cause PyArrow does the heavy lifting.

teruakohatu · on May 30, 2021

I am not sure how useful this 5 year old article is. I stopped following the SciRuby project a long time ago

The DaRu (Ruby DataFrame) sub-project's last commit it around a year ago, but the daru-view sub-project (visualization and plotting for web apps) seems to be maintained and updated frequently:

https://github.com/SciRuby/daru

https://github.com/SciRuby/daru-view

While I like Ruby's syntax and community, it is really hard to compete with in high-level scientific programming market when you are competing against Python, R & Julia (and Mathematica and MatLab).

zverok · on May 30, 2021

(author here) Me neither! Don't know how or why it suddenly appeared on the HN main :)

Just to provide a bit of the personal context: this article caused creator of DaRu (Sameer Deshmukh) to contact me and propose to work on DaRu together, and so I did (see @zverok here: https://github.com/SciRuby/daru/graphs/contributors). I also was, for some time, SciRuby/DaRu's mentor for Google Summer of Code (and, IIRC, it was my initial idea that daru-view grew from).

Also, since that article, an independent dataframe library https://github.com/ankane/rover was created by Andrew Kane, handling some of API and implementation in a cleaner way.

That being said, I am not sure that DaRu, or Rover (or "dataframe" idea in general) has enough visibility in the Ruby community. It is mostly thought as "some special scientific thing", while I believe in 2021 it should be seen as one of the necessary everyday high-level datatypes.

That's what I'd focus this article on if I'd written it today.

mamcx · on May 30, 2021

> , while I believe in 2021 it should be seen as one of the necessary everyday high-level datatypes.

And I add that just the datatype is not enough, because is "this close" to be the foundation for relational programming (p.d: I'm building a relational language where you can say everything is alike data-frames/relations: https://tablam.org).

> It is mostly thought as "some special scientific thing"

This is part of the problem, truly. Being so focused on "science" when I think is better to frame it as data manipulation like you do in SQL tables/views, making it much more general than is used for...

teruakohatu · on May 30, 2021

Thank you for the update.

> I believe in 2021 it should be seen as one of the necessary everyday high-level datatypes.

I agree. It is one of those data structures that developers tend to use by reimplementing badly using arrays/lists and dictionarys, usually even unaware that they are doing so.

rolae · on May 30, 2021

Andrew Kane created a dataframe gem fairly recently: https://github.com/ankane/rover

oxinabox · on May 30, 2021

Cool post but kinda dated. It's not wrong as such but of it were written today I would expect a lot more context and depth around various aspects.

E.g. Apache Arrow.

E.g. Julia's constellation of different dataframe (Tables.jl) libraries that are mutually compatible.

E.g. efforts at standardizing dataframes in python.

E.g. tidyverse etc in R

zverok · on May 30, 2021

Yeah, it is dated and naive in many ways :) Really don't know how the HN works and why this one made it to the main page while some of the other stuff I write rarely does so ¯\_(ツ)_/¯

(In _some_ defense of the post, at the time of the writing it was targeted _only_ to the Ruby community. But still, probably, if I'd included "how others do it" section it would bear more weight.)