Hacker Newsnew | past | comments | ask | show | jobs | submit | reinhardt's commentslogin

Also curious why every comment mentions just the number of rows as the only factor that matters. A 100M rows table of 3 integer columns is quite different from 50+ columns, 5 of which are text up to a few MB long.


Getting a cyclic import error is not a bug, it's a feature alerting you that your code structure is like spaghetti and you should refactor it to break the cycles.


That's not a problem, let alone the biggest one. You should just use relative imports explicitly.


It is a problem because stdlib does not use relative imports for other stdlib modules, and neither do most third-party packages, which then breaks you regardless of what you do in your code.


I haven't used Airflow for years but it used to be quite clunky, not sure how much it's improved since. I'd look into Prefect and/or Dagster first, both are more modern alternatives built with Airflow's shortcomings in mind.


I'd guess career-progression points, or even keep-getting-a-paycheck points at worst.


This low-stakes exchange won't affect much, it's about putting Mr. Incomp in his place. Remember it's a PM we're talking about, not "the Boss."


> Its a massive amount of state aggregated from billions of events that needs to be served at extremely low latency, but couldn't it be partitioned somehow???

The bidder/pacer state is not necessarily massive, and certainly it does not consist of all the gazillions of past events. Depending on the strategy/bidding model, it can range from a few MB to several GBs, something that can fit in a beefy node.

> Google Fi/Spanner and BigTable have certainly been developed to support these issues.

I doubt any external store can be used with so low latency constraints (2-10ms) and high throughput (millions RPS). Perhaps Aerospike but even that is a stretch to put it in the hot-path. At this scale you're pretty much limited to fetch the state in memory and update it asynchronously every couple of minutes/hours.

Source: I also work in ad tech.


Why PostgreSQL only? The mara-DB dependency [1] claims to support more.

[1] https://github.com/mara/mara-db


(author here)

Currently there is a hard dependency to Postgres for the bookkeeping tables of mara. I'm working on dockerizing the example project to make the setup easier.

For ETL, Mysql, Postgres & SQL Server are supported (and it's easy to add more).


I'm a bit confused about this. What if the target is HDFS? Why this dependency on SQL databases for ETL?


> Airflow requires task queues (e.g. celery), message broker (e.g. rabbitmq), a web service, a scheduler service, and a database. You also need worker clusters to read from your task queues and execute jobs.

All these are supported but the scheduler is pretty much the only requirement.

Source: been running Airflow for the last two years without a worker cluster, without having celery/rabbitmq installed and sometimes without even an external database (i.e. a plan sqlite file).


Yet another reason for trimming off old jobs after some point; the primary one being nobody cares going through 3+ page long resumes.


I think resumes should be treated less as a report card and more of a brochure. Hiring managers have little time, so keeping it focused on relevant highlights and selling the candidate for that job are the entire point.

A 15 page menu isn't better than a 1 page menu... A spa advertising every stone in its parking lot doesn't make you think nice things about their mud baths...


In which case you can just compare the dicts without performing the multiplication (which happens to be the costliest part for arbitrary-precision integers).


Exactly.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: