You can find GPT-2's training dataset list - at a high level - in the GPT-2 repository on Github: https://github.com/openai/gpt-2/blob/master/model_card.md#da... However, OpenAI goes dark after that regarding the 'data soup' that was fed into their LLMs. In general, start around 2019 and definitely by 2020 you'll notice that research labs became much less forthcoming about the data that went into their models. As far as I'm aware, BookCorpus is one of the more commonly-used 'large books dataset' that's been utilized in recent years to train large language models (LLMs) like generative pretrained transformers: https://12ft.io/proxy?q=https%3A%2F%2Ftowardsdatascience.com...
I hadn't heard of this website, SiliconAngle.com, before this week but they interviewed someone from the company that I work at (not for this Neeva article; for a different article), so they're actually a real news reporting organization. I was reading that article with my colleague's interview this week when I saw the Snowflake + Neeva article title in the sidebar on SiliconAngle.com
(I don't have a subscription to The Information, so unfortunately I cannot read that article's whole text. If someone with a subscription to The Information could summarize that article and share their summary with the community here I'd appreciate it!)
The signal to noise ratio in this comment is pretty low haha. The only relevant information here is
a.) the headline - "Snowflake in Talks to Buy Search Startup Neeva in AI Push"
b.) that you're not sure how reputable the site is.
Instead, you have included several details about how you were reading some other article, leading you to think they are reputable. Then reading (the same? another?) article about a colleague of yours (okay?) where you found a (relevant?) article on that site. You link it, but don't summarize or even paste the headline and go on to discuss _yet another_ organization cited _in_ the article?
I'm sorry - not trying to be rude. It's just very jarring to me when people write in this manner where they are explaining everything _but_ the important parts.
If we’re talking about signal to noise, comments that do nothing but critique prose don’t help matters. But if it’s that important to you to be a bit rude to someone you think doesn’t write well, be my guest.
Mostly my intent to clear confusion for other users. Certainly not making any judgements, there’s a variety of reasons one would write this way and the link referenced is indeed useful (you just wouldn’t know it until you clicked on it).
At the bottom of the chart/table you can see (a) the source of the data points as well as (b) the last updated date. For the page today the text reads as follows, so the data was last updated on Friday, the 24th of March:
In this post I've read A LOT of incorrect population decline estimates. Here is a paper from The Lancet - not retracted, that I know of - that charts the populations of the present day through 2100 for almost all countries in the world:
"Fertility, mortality, migration, and population scenarios for 195 countries and territories from 2017 to 2100: a forecasting analysis for the Global Burden of Disease Study"
Click on the large 'View PDF' button to read the paper; there is no paywall.
The wildly incorrect statement that China's population will halve by 2035 is very much off base:
> The reference projections for the five largest countries in 2100 were India (1·09 billion [0·72–1·71], Nigeria (791 million [594–1056]), China (732 million [456–1499]), the USA (336 million [248–456]), and Pakistan (248 million [151–427]).
China will become the largest economy starting around 2035 but the US is once again forecasted to become the largest economy starting around 2098, according to that study.
With the semiconductor on-goings (both by the United States as well as a the EU) over the last six or so months I think this changes the calculus on this, though.
---
I am wholly against child labor in all its forms and posters who are arguing that bagging groceries part-time after school as a middle-class kid is somehow equivalent to working in a meatpacking plant (a slaughterhouse) or on a construction site are either woefully naive or are arguing in bad faith.
ProPublica has some consistently high-quality reporting on meatpacking as a an industry from all kinds of perspectives - the (mis)management of them during the pandemic, the hiring of children to work there, the horrid safety conditions, etc.
I'm not sure if ProPublica has a free, public dataset on meatpacking plants but they generally have datasets that you can purchase that pertain to particular areas of reporting: Medicare and Medicaid overbilling / fraud; repeated pollution violators who keep paying fines instead of stopping polluting; etc. If anyone knows of a dataset on meatpacking plants, health and/or safety violations, etc. please do share the link to the dataset.
It's kind of ironic for some people in this thread to dismiss meatpacking jobs as "low value" when they have been crucial to keep the supply chain functional. Severe chicken shortages around 2020-2021 were mainly caused by workers in that industry falling ill, because the work conditions there are ideal incubators for respiratory illnesses (cold temperature, indoor environment, lots of people inside the same space, etc.). These jobs exist so anyone can walk into a grocery store and have an obscene selection of animal protein available for purchase without ever having to touch an animal. They are crucial to the modern consumer society and the way these workers have been mistreated have been awful enough without adding god damn child labour to the mix. We need far more regulatory pressure on the industry, not less.
The argument that more pay & benefits for these workers will make meat more expensive is also nothing but a scare tactic. There is already so much automation that the increased cost will hardly be noticeable when it is distributed among purchasing units at the consumer end.
Note that 'Yahoo! News' - f/k/a the Verizon Media Group - was bought by Apollo Capital Management a few years ago, so perhaps Apollo is short SVB and that's why 'Yahoo! News' is amplifying this narrative about the CAO having worked at Lehman until 2007; recall that Lehman imploding in 2008 is largely seen as one of the catalysts for the 2008 Global Financial Crisis.
I think they mean Transformers in the Vaswani et al 'Attention is all you need' paper, not Generative Pretrained Transformers, specifically? Paper link below:
For some papers on attention mechanisms from before the 2017 'Attention is all you need' paper, check out that paper's references. Chris Manning's 2015 paper covers attention mechanisms. And so do a few other researchers from that mid-2010s time period:
[21] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
[22] Ankur Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention
model. In Empirical Methods in Natural Language Processing, 2016.
Does anyone have a good alternative to Privacy.com where your virtual credit card transaction data isn't sold to Wall Street? If you're unfamiliar with what a "virtual [credit] card" is here's the page from Privacy.com's website: https://privacy.com/virtual-card I use the Privacy app on my mobile phone to create virtual cards (primarily for work subscriptions). Pro-tip: since each Privacy card can have its own name put a tag such as `[WORK_RECURRING]` into the card name and then you can search your email inbox for `[WORK_RECURRING]`, quickly and easily finding all of the transactions / charges that you may want to submit to your workplace for reimbursement.
Privacy is owned / created by Lithic, but if you look at Lithic's investors you'll see that the plurality of the company's investors are in the private equity or VC space: Bessemer Ventures, Tusk Partner Ventures, Index Ventures, etc. You can see the Privacy.com / Privacy mobile app's funders here:
https://www.crunchbase.com/organization/lithic-pay
Thus, I have no doubt that my transactions on cleverly-named Privacy app are being gifted or sold to Wall Street so that hedge funds can squeeze out a few addition drops of 'signal' from consumer purchase pattern data that would otherwise remain dark. (I'd imagine that many folks use the Privacy app to buy things that they'd rather not have show up on their regular credit card bills: 'adult websites', marijuana or tobacco products, etc.
So, two questions:
(1) Does anyone have a privacy-respecting alternative to Privacy.com's virtual credit cards?
(2) Does anyone know of a recent blog post where these virtual credit card services are compared / contrasted by
- the services that they offer,
- the cost: free, paid, etc.,
- the terms of service: how your data is re-sold / who your data is transmitted to
It doesn't block wall street knowing about what you're buying, but at least it's likely got one (or more) fewer middlemen looking at all your transactions.
Once I saved Capital One card in Google Chrome, Payment Methods in Chrome Settings offers a radio button Virtual Card On/Off. If switched on, Chrome directly generates a virtual card, removing the need of Eno extension.
I can confirm this works pretty well. The Capital One mobile app will also give you a single virtual card number (without the merchant lock or any other extra features) if you don’t want the browser extension or just need it once.
Citi also offers unlimited virtual card numbers for credit cards, and it is a bit easier than Eno since you can manage directly from the website without needing to install anything.
Capital One lets you manage them on their website as well. They just really seem to be pushing their extension and Chrome integration, which I both don't care too much about, but the functionality is there (albeit well-hidden).
I would bet that all of your electronic transactions end up in some pool of data, no matter what you try. I believe only cash at a swap meet while wearing dark sunglasses and a hat is really private.
I was writing a long comment up top about this, but instead I'll reply here. Person with a background in public policy, especially around poverty alleviation efforts. I think that the removal of non-compete clauses has to due to with employees who you normally (at least I wouldn't think) of being subject to non-compete clauses... namely, service industry workers. Apparently 1 in 6 are subject to non-compete clauses: https://thecounter.org/biden-targeting-non-compete-agreement... This industry has been 'suffering from' high vacancy rates since the start of the pandemic because the industry has low wages, minimal-to-non-existent benefits, generally no paid sick days, advanced notice of your work schedule isn't provided (thus it becomes difficult or impossible to attend college or work a second or third job), etc.
For a group of workers who are generally less able to afford legal remedies to situations such as non-compete clauses I can see how non-compete clauses are especially damaging to this group of workers. (I'll show some hard data points on non-compete clauses by pay and by education level obtained in a few paragraphs.)
If I had to guess I would say that non-compete clauses are being removed now because there's a "worker shortage": 1M dead from covid and of that some percentage (50%?) of that in the workforce; restricted immigration - legal and otherwise - for the last 3+ years and before that a decline in immigration due to the polices of the former guy). Why the quotation marks around worker shortage? Basically, the service industry businesses want workers but hardly anyone wants to work in the service industry because the pay's bad, there are often no benefits, in many states you don't know your schedule until the day of (which makes planning for childcare, attending college, etc. damn near impossible), etc.
So if I had to guess this is the federal government's way of attempting to address the "labor shortage" in the service industry across the United States as well as allow people in white collar jobs to switch into new roles. I would bet that most folks who fall under the 'knowledge worker' class of employment know that their company's non-compete clause is pretty much non-enforceable, but ask your average restaurant worker who is under such a clause and I bet that they believe that the non-compete clause _is_ enforceable.
From this report from 2015, it looks like ~18% of all US workers are under a non-compete clause in their current role, with ~15% of workers without a college degree being under a non-compete clause and roughly the same percentage of workers with an annual wage of <$40,000 being subject to a non-compete clause (Rf. page 7 of 36): https://home.treasury.gov/system/files/226/Non_Compete_Contr... In that same document on page 16 you'll note that California, Oklahoma and North Dakota have the 'least enforcement' of such clauses. I suspect that the oil and gas industry in OK and ND enjoys not paying for training of employees, so if your employee can be trained at a competitor and then jump to your place of employment, full trained / ready to work, that seems to be what those states are looking for. (Yep, large swaths of ND and OK have over 20% of that county's employees employed in the petroleum extraction industries: https://www.ers.usda.gov/data-products/chart-gallery/gallery... )
TL;DR: Fifteen to twenty percent of all Americans are currently working under non-compete clauses (with 1 in 6 food service industry workers being subject to non-compete clauses). Thousands of jobs are going unfilled in the service industry as well as in white collar, 'knowledge worker' domains. By removing the ability of employers to create and enforce non-compete clauses this should, in theory, 'free up' around 20% of the workforce to change jobs. In theory, most of these workers would be changing jobs for factors such as more flexible work schedule (advanced notice in the case of service industry workers; WFH for white-collar workers), benefits and sick days, and increased wages. My (admittedly cynical?) take on this is that by freeing up 20% of the workforce to switch jobs the federal (and state) governments are hoping that they can get away with any increased spending toward social services and instead can just tell people 'Well, go look for and get a better [paying] job! What's stopping you? Certainly not a non-compete!" Also, by allowing a 'great migration' into new roles the federal government can get a rough tally as to how many immigrants they'll need to let in via the skilled (H1B, NAFTA, etc.) and unskilled (EB3) visa programs; it's my opinion from looking at state- and federal-level labor statistics over the past 3+ years that the data is rather 'noisy' and by removing non-compete agreements it should make it easier to get a "closer to reality" tally of how many workers the US will 'need' to import to create and maintain full employment in various skilled and unskilled industries.
One thing that I'm not seeing mentioned in any capacity in the comments on this HN page is role of institutional investors. Prior to 2008 'single family homes' did not exist as an asset class that the asset management companies - Blackstone, etc. - were able to invest their client's/investor's money into. Long story short, the Savings & Loan implosion of the 1980s in the United States soured investor's money on SFHs. But when 2008 rolled around asset prices were depressed; the banks were flush with cash thanks to a bailout bonanza, and thus began the bulk purchasing of residential real estate by asset management firms.
In 2016, 2017, 2018 we were averaging around 50,000 to 60,000 homes being bought each quarter by institutional investors. Then, when the pandemic fun bucks got released (and I'm not talking about the $1200 'relief' checks that went out) we were looking at 80,000 to 90,000 homes bought by institution investors each quarter in Q2 2021 and Q3 2021, respectively.
Add in Airbnb / the short-term rental markets decimating housing stock in many cities and countries, plus intentionally restrictive zoning ordinances in the US and in Canada, and it's no wonder that North America's housing market looks the way it does in 2022.
Also from that Redfin article:
> Investors Have the Highest Market Share in Atlanta, Phoenix
>
>In Atlanta, nearly one-third (32%) of homes that sold in the third quarter were purchased by investors—the highest share of the 40 U.S. metropolitan areas Redfin analyzed. Next came Phoenix (31.7%), Charlotte, NC (31.5%), Jacksonville, FL (28.3%) and Miami (28.1%).
>
> Atlanta also saw the largest year-over-year gain, with investor market share rising to 32% in the third quarter from 12.9% a year earlier (+19.1 ppts). The second-biggest jump was in Charlotte (+18.2 ppts), followed by Phoenix (17.7 ppts), Jacksonville (+15.2 ppts) and Las Vegas (+14.6 ppts).
I'm curious (and hopeful) to see how the Canadian situation plays out. I hope that increased enforcement of antimoney laundering laws will also result in a return of many thousands of single-family housing units to the market. If the Canada law works well it would be encouraging to see it applied in the United States with additional funding at the federal level for the construction of dense apartment housing in cities near transportation hubs as well dense housing in suburbs/exurbs with the creation of mini walkable downtowns in those denser pockets.
At my alma mater I remember the large-scale Google book scanning devices and what a herculean effort that was to digitize the largest university library system's books - University of Michigan - although only 7M texts from the entire collection of ~16 million texts: https://en.wikipedia.org/wiki/University_of_Michigan_Library) were digitized.I too was curious about the state of the Google Books project: https://www.edsurge.com/news/2017-08-10-what-happened-to-goo...
This is an interesting piece of ephemera from 2005, when Google started digitizing books at UMich: https://apps.lib.umich.edu/files/services/mdp/faq.pdf
As far as I recall, the Books project allowed the early n-grams functionality to be built out: https://ai.googleblog.com/2006/08/all-our-n-gram-are-belong-...
The Google Books Ngram Viewer tool is actually still in existence; you can play around with it here: https://books.google.com/ngrams/graph?corpus=0&content=Vorsp...