I've been in a similar slump for a while now (lectures + paper skims >> books + coding), so this is advice I'm telling myself right now. Put a stack of good books in a place where you see them several times a day. There's a good chance their presence will taunt you into reading them. Maybe charge your phone on the stack. Don't feel guilty about skipping around between books. Do feel guilty about neglecting them. I'm going to null route HN and YouTube for the remainder of November. Thanks for the question.
I've seen a very broad spectrum of research code. In general research code translates O(1e1-1e2) lines of mathematics into O(1e3-1e4) lines of code. I find mathematics easier to understand than code, so that's going to color my opinion.
My favorite research code tends to look like the mathematics it implements. And that's really hard to do well. You need to pick abstractions that are both efficient to compute and easy to modify as the underlying model changes. My favorite research code also does the reader a lot of favors (eg documents the shape of the data as it flows through the code, uses notation consistent with the writeup or standard conventions in the field).
Industry research code... I'm happy to see basic things. Version control (not a bunch of Jupyter notebooks). Code re-use (not copy+paste the same thing 20x). Separation of config and code (don't litter dozens of constants throughout thousands of lines of code). Functions < 1000 lines apiece. Meaningful variable names. Comments that link the theory to the code when the code has to be complicated.
Overall it's probably most helpful to find a researcher in your field whose code you like to read, and copy the best aspects of that style. And ask readers of your code for feedback. I really enjoy reading Karpathy's code (not my field), but that may be an exception because a lot of what I've read is intended to teach a more or less codified approach, rather than act as a testbed for iteration in a more fluid design space.
> Actually it's highly usual... CUSIPs... there's nothing to stop you from setting up your own, alternative... numbering system
I don't think there's anything natural about the mandatory use of copyrighted CUSIP identifiers in regulatory reporting. When SEC publishes its quarterly list of 13F securities it includes a disclaimer that it does so "with permission" from the copyright holder. My city doesn't pay royalties or seek approvals when it records and processes car license plate numbers for parking enforcement. The copyright holder seems actively involved in rulemaking that has the potential to diminish the role CUSIPs play in mandatory regulatory reporting.
If you're adding more LLM integration, a cool feature might be sending the results of allow_many="left" off to an LLM completions API that supports structured outputs. Eg imagine N_left=1e5 and N_right=1e5 but they are different datasets. You could use jellyjoin to identify the top ~5 candidates in right for each left, reducing candidate matches from 1e10 to 5e5. Then you ship the 5e5 off to an LLM for final scoring/matching.
I wonder how much ΔT you need at the crust to meaningfully change Earth's magnetic field by altering convection patterns in the outer core. I don't know enough physics to attempt an answer.
The outer core is 2,890 KM (~ 1800 miles) below the earths crust, and has the mantle in the way. The crust itself is only 30KM thick. [https://phys.org/news/2017-02-journey-center-earth.html] The crust is basically a thin layer of slag on top of a giant ball of molten everything.
Even at million+ year timescales, I can’t see any way the temperature of the upper crust could matter to the core at all - even if the crust was at absolute zero.
Dirt insulates relatively well, and the amount of thermal mass present is mindboggling.
> would be a rounding error above absolute zero anyway
Kind of joking: unless there are nonlinear effects near 300K? Fig 4 [1] seems to suggest that the thermal diffusivity of the mantle grows very fast as temperature declines past 300K... but the data stop at 200K.
Reason for initial comment: we could probably set up a spherical heat equation to guess how crust cooling would change heat conduction at the outer core. But I have absolutely no idea how to reason about changes in heat conduction affecting the convection dynamics that generate the field. I was silently hoping for one of the domain experts lurking this forum to see it and share wisdom. (But overall it was a silly question, I know).
Calculating or simulating how earths magnetic field behaves or is generated is quite a complex task. So im doubtful we can usefully estimate it to such precision. It would be interesting though.
We know that if the convection in the outer core stops, the Earth's magnetic field stops, and removing enough heat from the core will stop the convection.
I've seen a confident estimate in the form of a calculation. They know what kind of compounds (term?) are in the outer core and they know the minimum temperature those compounds need to be at to be free-flowing enough to sustain the field. And I'm pretty sure we know the current temperature of the outer core.
My memory is that the calculation found that if humanity switched to geothermal for all its energy needs, then in only about 1000 years, the core cools enough for the magnetic field to stop, but I am not sure.
(We should definitely deploy geothermal in the Yellowstone caldera though long enough to cool it down enough so that it will not erupt again.)
That is definitely not true hahaha. The outer core is several thousand km down, and the crust is only 30km thick. And we have the entire mantle below us.
Humanity could max out geothermal for a million years and never make a dent.
Whoa, this is a bit scary. As mentioned earlier, it should basically be used in a way where other energy sources are tapped first, and only the shortfall is covered.
Talk to people in your department about where students who enter industry end up working after graduation. Your university may have a kind of "jobs fair" in Autumn where companies come to recruit. Look into those companies and find out what skills they seem to like.
For what it's worth: I ended up going the quant/finance route (as a "regular guy" with no meaningful accomplishments). If I could start over I would try to do something involving data analysis and biology. I think RNA sequencing is on an exp(-a*t) cost curve, and it feels like this is a domain where data analysis could produce something of greater value than slightly more efficient asset prices.
Cool, EDGAR is an amazing public service. I think they use Akamai as their CDN so the downloads are remarkably fast.
A few years ago I wrote an SGML parser for the full SEC PDS specification (super tedious). But I have trouble leveraging my own efforts for independent research because I don't have a reliable securities master to link against. I can't take a historical CUSIP from 13F filings and associate it to a historical ticker/return. Or my returns are wrong because of data errors so I can't fit a factor model to run an event study using Form 4 data.
I think what's missing is a serious open source effort to integrate/cleanse the various cheapo data vendors into something reasonably approximating the quality you get out of a CRSP/Compustat.
Securities master to link against - Interesting. Here's a pipeline off the top of my head
1. Get CUSIP, nameOfIssuer, titleOfClass using the Institutional Holdings database
2. Use the company metadata crosswalk to link CUSIP + titleOfClass to nameOfIssuer to get cik
https://github.com/john-friedman/datamule-data/blob/master/d...
(recompiled daily using GH actions)
3. Get e.g. us-gaap:EarningsPerShareBasic from the XBRL database. Link using cik. Types of stock might be a member - so e.g. Class A, Class B? Not sure there.
For form 4, not sure what you mean by event study. Would love to know!
Event study: A way to measure how returns respond to events. Popularized by Fama in "The Adjustment of Stock Prices to New Information" but ubiquitous in securities litigation, academic financial economics, and equity L/S research. The canonical recipe is MacKinlay's "Event Studies in Economics and Finance". Industry people tend to just use residual returns from Axioma / Barra / in house risk model.
So let's say your hypothesis is "stock go up on insider buy". Event studies help you test that hypothesis and quantify how much up / when.
Cool metadata table, I'm curious about the ticker source (Form4, 10K, some SEC metadata publications?).
My comment about CUSIP linking was trying to illustrate a more general issue: it's difficult to use SEC data extractions to answer empirical questions if you don't have a good securities master to link against (reference data + market data).
Broadly speaking a securities master will have 2 kinds of data: reference data (identifiers and dates when they're valid) and market data (price / volume / corporate actions... all the stuff you need to accurately compute total returns). CRSP/Compustat (~$40k/year?) is the gold standard for daily frequency US equities. With a decent securities master you can do many interesting things. Realistic backtests for the kinds of "use an LLM to code a strategy" projects you see all over the place these days. Or (my interest) a "papers with code" style repository that helps people learn the field.
What you worry about with bad data is getting a high tstat on a plausible sounding result that later fails to replicate when you use clean data (or worse, try to trade it). Let's say your securities master drops companies 2 weeks before they're delisted... just holding the market is going to have serious alpha. Ditto if your fundamental data reflects restatements.
On the reference data front, the Compustat security table has (from_date, thru_date, cusip, ticker, cik, name, gics sector/industry, gvkey, iid) etc all lined up and ready to go. I don't think it's possible to generate this kind of time-series from cheap data vendors. I think it could be possible to do it using some of the techniques you described, and maybe others. Eg get (company-name, cik, ticker) time-series from Form4 or 10K. Then get (security-name, cusip) time-series from the 13F security lists SEC publishes quarterly (pdfs). Then merge on date/fuzzy-name. Then validate. To get GICS you'd need to do something like extract industry/sector names from a broad index ETF's quarterly holdings reports, whose format will change a lot over the years. Lots of tedious but valuable work. Also a lot of surface area to leverage LLMs. I dunno, at this point it may be feasible to use LLMs to extract all this info (annually) from 10Ks.
On the market data front, the vendors I've seen have random errors. They tend to be worst for dividends/corporate-actions. But I've seen BRK.A trade $300 trillion on a random Wednesday. Haven't noticed correlation across vendors, so I think this one might be easy to solve. Cheap fundamental data tends to have similar defects to cheap market data.
Sorry for the long rant, I've thought about this problem for a while but never seriously worked on it. One reason I haven't undertaken the effort: validation is difficult so it's hard to tell if you're actually making progress. You can do things like make sure S&P500 member returns aggregate to SPY returns to see if you're waaay off. But detailed validation is difficult without a source of ground truth.
re: metadata table - it's constructed from the SEC's submissions.zip, which they update daily. What my script does is download the zip, decompress just the bytes where the information (ticker, sic code, etc) is stored, then convert into a csv.
And yep! Agree with most of this. Currently, I'd say my data is in the stage where it's useful for startups / phd research and some hedge funds / quant stuff (at least that's who is using it so far!)
I've seen the trillion dollar trades, and they're hilarious! You see it every so often in Form 3,4,5 disclosures.
re: LLMs, this is something I'm planning to move into in a month or two. I'm mostly planning to use older NLP methods which are cheaper and faster, while using LLMs for specific stuff like structured output. e.g. wrds boardex data can be constructed from 8-k item 5.02s.
I think the biggest difficulty wrt to data is just the raw data ingest is annoying AF. My approach has been to make each step easy -> use it to build the next step.
> If options & futures are more liquid than the underlying, someone will be tempted to nudge the underlying.
This is a weird statement. Why would liquidity matter here? As a point of reference there are generally two types of options: (1) options that depend directly upon the underlying, like a Tesla stock option, or (2) options that depend indirectly upon the underlying, like options on S&P 500 index futures. The liquidity in category 2 is normally tiny. Cat 1 normally has far less liquidity than the underlying.
Why is the adjective "more" important here? Even if less, the opportunity to profit is still good, assuming that one chooses the path of market manipulation.
What matters is the volume rather than the liquidity per se, but the two are generally pretty well correlated. The point is that moving a market costs money, making a trade moves the market against that trade, so even if someone is deliberately trying to move a market they'll pay more than they could ever hope to recoup. The exception is when there's a derivative market that has more volume than the underlying - in that case profitable manipulation becomes possible, as you can spend to move the underlying, losing money, but making more money on the derivatives where you'd bought the other side.
> What is the difference between volume and liquidity for the purpose of this discussion?
You're the one asking "Why would liquidity matter here?", you tell me. Like I said, they're correlated well enough that it doesn't really matter as far as I can see.
I have a suspicion this has been happening with a particular MAG7 stock these last few months, but I can't fully convince myself such a large stock can be manipulated like that.
Of course large stocks can be manipulated like this. But there is also normal oscillation, repeatable human psychological behavior (e.g. momentum, mean reversion), and just crowd murmuration. Be careful using your eyes to gleam patterns from graphs; you may find if you try to monetize these they vanish before you.
except Tesla is not just car company but a taxi company and energy company and AI company and robotics company… 6-7 more decades and orders will be pouring in :)
> but I can't fully convince myself such a large stock can be manipulated like that.
I have the same initial reluctance to believe it that you do, but less so when I remind myself that we live in a world where the Social Security Administration sent out a mass email praising the passing of the "big beautiful bill".
I think our built-up understanding of how the US government functions at a baseline has not caught up to recent events. Especially in regards to how much regulatory bodies are doing their traditional jobs vs being forced to sit on their hands, or in some cases just not even existing anymore.
As someone living in a country with very weakened democracy, getting emails from the government was the point I realized things were really messed up. If this is now happening in the USA, well, good luck.
We got email from a government agency that is weakened by the passage of a new law celebrating the passage of said law, so yeah that's not looking great.