Hacker Newsnew | past | comments | ask | show | jobs | submit | cbetti's commentslogin

The high numbers here are eye-opening, but the article doesn't shed any light on the process.

I'm left with the feeling that opening a restaurant is hard, but there is nothing to chew on in terms of improving the situation as a citizen or interested party.


The full study that the article links to has a more in-depth look on the processes in several cities: https://ij.org/wp-content/uploads/2021/12/Barriers-to-Busine...


Many of the counted steps are required state wide, and not in the ambit of the municipality. State licensing processes for various Barbers, and so on.

This is the state legislature's doing.

Others are standard nationwide, and not going away:

Setting up a corporation, or LLC for example. Or filing for a "doing business as" d/b/a name.

I am unsympathetic to counting these as a step.

I cannot get worked up about building permits. Here is why:

This is a national regime, and most states operate under the "International Building Code", and similar Electrical and other codes, and the requirements there, for commercial structures are often based on factual risks and deaths from lack of proper construction.

That means every building needs to be up to code when renovated in various categories: electrical, plumbing, heating/ventilation, smoke and carbon monoxide detectors, fire code compliance, structurally, and more recently for energy code (typically insulation and heating/cooling related). These are essential for safety and health, and for economic well being in the long run. Yes these take capital. That can be 5 to 10 permits and inspections there, and those requirements are not going away, nationwide.

There can be other municipal department participation for curb cuts, street access, sidewalk access and so on. Deal with it.

Zoning is a municipal level, and that requires City Council and Planning Board participation, and not in control of the administrators operating the regulations. This is political level of regulation, and requires political effort to modify, typicall not in the ambit of administrators.

Other "steps" in which all fees and taxes to a municipality need to be up to date are simply good practice.

- No action if you are overdue on your real estate taxes, or have outstanding orders for compliance with health or building codes.

That is mere enforcement of existing municipal regulations. Get up to date on all of your obligations.


Not having played with SIMD much myself, does leveraging these instructions for an intensive operation like a sort push other workloads out of the CPU more aggressively than operating on 32 or 64 bits at a time would?

In other words, do you have to be more careful when integrating these wide operators to preserve some resources for other operations?


At least on Intel, AVX has its own functional units and register file, so I would say it's not a major concern. It's possible that running some AVX instructions could be almost or completely free if you weren't using execution port 5 anyway, to the extent that instruction-level parallelism can be considered "free".

If you're really taking a microscope to performance, the main hazards would be intermittently using AVX for only a few instructions, because that might lead to the CPU stopping for a few microseconds to turn the power on and off on the functional units. If you're using them heavily the overall thermal situation might cause a core or package-wide clock rate degradation, but if you have a use case for sustained AVX-512 usage this is likely to be a good tradeoff.


Pushing them out of the CPU, I don't know, but some SIMD instruction sets on some CPUs have side effects that can negatively affect the performance of other operations. For example, the use of AVX2 / AVX-512 can cause some Intel CPUs to lower their base frequency, thus reducing the performance of simultaneous operations that are not using SIMD.


Is that still true? I was under the impression that post-Haswell CPUs don’t have that particular issue.


Not for recent hardware - Ice Lake eliminated the throttling on 256b ops (they already didn't exist for certain 256b ops and all smaller ops), and reduced the throttling to almost nothing for 512b ops. Rocket Lake eliminated the throttling for 512b ops.

https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Dow...

They do use a lot of power (and as a result, generate a lot of heat), so they can still cause thermal throttling, or throttling due to power limits - but there's no more "AVX offset"


They do emit a lot of heat, which might actually throttle the CPU overall.

But to my knowledge they use different registers, and when properly pipelined they don't hog the CPU cache like unoptimized algorithms that constantly round trip to RAM.


The CPU has a certain upper bound on power, TDP. That limit is independent of whether it is reached by executing scalar code on lots of cores, or SIMD on a few. One little-known fact is that out of order CPUs burn lots of energy just on scheduling instructions, ~10x more than the operation itself. SIMD amortizes that per-instruction overhead over multiple data elements, and actually increases the amount of useful work done per joule. We're talking 5-10x increase in energy efficiency.


Right, though your use of the terms "different registers" and "properly pipelined" is quite nuanced for a non-specialist. It leaves a lot to the imagination or prior experience!

If it is a compute-heavy code that was spilling to a stack and now can fit in the SIMD registers, there could be a speed up with less memory traffic. This is analogous to a program's working set either fitting in RAM or spilling to swap while it runs, but at a different level of the memory hierarchy. In the extreme case, your working set could be a fixed set of variables in expensive loops for some sequential algorithm, and so the compiler register placement ability and number of registers available could be the difference between effectively O(1) or O(n) memory accesses with respect to the number of compute steps.

Of course, you can find such transitions between registers/L1 cache, L1/L2 cache, L2/L3 cache (if present), local/remote RAM (for modern NUMA machines), etc. This happens as you transition from a pure CPU-bound case to something with bigger data structures, where the program focus has to move to different parts of the data to make progress. Naively holding other things equal, an acceleration of a CPU-bound code will of course mean you churn through these different data faster, which means more cache spills as you pull in new data and have to displace something else. Going back to the earlier poster's question, this cache spilling can have a detrimental effect on other processes sharing the same cache or NUMA node, just like one extremely bloated program can cause a virtual memory swapping frenzy and hurt every other program on the machine.

One form of "pipelining" optimization is to recognize when your loop is repeatedly fetching a lot of the same input data in multiple iterations and changing that to use variables/buffers to retain state between iterations and avoid redundant access. This often happens in convolution algorithms on arrays of multi-dimensional data, e.g. image processing and neural nets and many cellular-automata style or grid-based simulations. Your algorithm slides a "sampling window" along an array to compute an output value as a linear combination of all the neighboring values within the window using some filtering kernel (a fixed pattern of multiplicative scalars/weights). Naive code keeps loading this whole window once for each iteration of the loop to address one output position. It is effectively stateless except for the loops to visit every output. Pipelined code would manage a window buffer and fetch and replace only the fringe of the buffer while holding and reusing inputs from the previous iterations. While this makes the loops more stateful, it can dramatically reduce the number of memory reads and shift a memory-bound code back into a CPU-bound regime.

Other major optimizations for these N-dimensional convolutions are done by the programmer/designer: some convolution kernels are "decomposable" and the expensive multi-dimensional operation can be strength-reduced to a sequence of 1-dimensional convolutions with appropriately chosen filter kernels to produce the same mathematical result. This optimization has nothing to do with SIMD but simplifies the work to embody the operation. Fewer input values to read (e.g. a 1D line of neighbors instead of 2D box) and the number of arithmetic operations to do all the multiply-and-add steps to produce the output. Imagine a 2D filter that operates on an 8x8 window. The 2D convolution has to sample and combine 64 input neighbors per output, while the decomposed filter does 8 neighbors in each axis and so one quarter as many steps total after one pass for the X axis and one pass for the Y axis.

Naively, decompsition is done as separate 1D passes over the data and so performs more memory writes and allocates an additional intermediate array between the original input and final output. This is often a big win in spite of the extra memory demands. It's a lot more coding, but this could also be "pipelined" to some degree in N-dimensions to avoid pushing the entire intermediate results arrays through main memory. Approaches differ, but you can make different tradeoffs for how many intermediate results to store or buffer versus redundant calculation in a less stateful manner.

Generally, as you make the above kinds of optimizations your code becomes more "block-based", loading larger chunks of data with fewer changes to new "random" memory locations. This is very much like how databases and filesystems optimized their access to disk storage in prior decades to avoid random seeks for individual bytes or blocks and to instead use clever structures to support faster loading of sets of blocks or pages. When this is successful, your program does most of its memory access sequentially and achieves higher throughput that is bound by total memory bus bandwidth rather than random access latency limitations.

The final form of "pipelining" matters once you really hit your stride with these optimized, block-based programs. All this access again can cause cache spilling as you sustain high bandwidth memory access. But if you can provide the right hints to the CPU, or if the CPU is clever enough to detect the pattern, it can shift into a different cache management regime. Knowing that you will only access each block of data once and then move on (because you are holding the reusable state in registers), the CPU can use a different prefetching and spilling policy to both optimize for your next memory access and to avoid churning the whole cache with these values you will only read once and then ignore. This reduces the impact of the code on other concurrent tasks which can still enjoy more reasonable caching behavior in spite of the busy neighbor.


Generally yes to the first question, no to the second.

If you want your code to have low perturbance on other concurrent work done by the system, implementing it in a inefficient way doesn't usually help with that goal, since then your code will just be executing all the time because it takes a long time to finish. And you still won't have good control of the execution resources compared to using normal tools like OS scheduling policies to address the goal.


Sincere question: What benefit is there to owning equity in such a company in employee sized quantities?


Employee sized quantities should be much bigger at a Mittelstand company vs. VC-funded. You don't need thousands of people when you're going after 7-8 figure outcomes, which means each employee should be able to get better profit-sharing. You also are not diluted big-time by investors that can bully the cap table


Presumably dividends if the owner doesn’t just capture profits via inflated salary.


Maybe dividends, maybe cash-out if there is a change in ownership


Dividends?


To focus on fewer problems.


This isn't how scaling works though. Across all applications the hot data growth outpaces the cold.

So if you're designing capacity for exponential growth, the future point at which you stop experiencing exponential growth and only have to worry about roughly linear growth is a much easier problem to solve.


Are you more interested in fixing a process or fixing the problem?

Sending in-product feedback certainly could work because it's more likely to be seen by product management as it continues to roll in.

Support-driven product change requests are well intentioned but generally break down as a process internally. The working knowledge base and incentives are not properly aligned.


> Sending in-product feedback certainly could work

If it could work, we would have seen that happening. We have not, so we must assume it can't. Thus, asking to "send feedback in-product" is just a way to waste everyone's time. You avoid the negative stigma that is associated with knowing a problem exists and ignoring it, without having to undertake any concrete action. Corporate spin at it's finest.


I think you're spot on. Self driving software seems to me like it will handle the vast majority of driving situations much better than humans could within the 5 year horizon OP is asking about.


> Self driving software seems to me like it will handle the vast majority of driving situations much better than humans could within the 5 year horizon OP is asking about.

I remember this being said as far as a decade ago, and still, Teslas from 2022 are emergency braking on the highway.


At scale safety is just a numbers game. A car model could be 10x as safe as human drivers and still occasionally drive directly into oncoming traffic.

Tesla isn’t anywhere close to that, but self driving cars have no reason to get worse. Systems good enough for fully self driving taxi services are already on the road and the the software is steadily getting better.


I know that Tesla's FSD is already safer than human drivers. But the society is not ready for children dying because of an erroneous data point in the training model that we can't even debug, let alone fix it.


Is this a non-sequitur or are you trying to somehow explain the difference between the two drives noted in the comment you are replying to?


Did you read that account's tweets? It's clearly a biased account with an agenda.


Patterns and automation supporting modularization hasn't received the attention that patterns and automation around services has over the past 10 years.

In practice, modularization raises uncomfortable questions about ownership which means many critical modules become somewhat abandoned and easily turn into Frankensteins. Can you really change the spec of that module without impacting the unknown use cases it supports? Tooling is not in a position to help you answer that question without high discipline across the team, and we all know what happens if we raise the question on Slack: crickets.

Because services offer clear ownership boundaries and effective tooling across SDLC, even though the overheads of maintenance are higher versus modules, the questions are easier and teams can move forward with their work with fewer stakeholders involved.


If a dev team cannot self-discipline itself in maintaining a module, is it wise to entrust it with the responsibility of a whole service?

I'd rather work on better mentoring the single members.


Modules have higher requirements for self discipline the services. Precisely because the boundaries are so much easier to cross.

And also because it is harder to guard module from changes made by other teams. Both technically and politically, the service is more likely to be done by a single team who understands it. Module is more likely to be modified by many people from multiple teams who are just guessing what it does.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: