I realized that we had Hacker News for every day and wanted to compute the top PDFs for my own usage so I wanted to share with you guys.
My side project, Polar (https://getpolarized.io/) is used for managing PDF and other reading so you might want to check that out too. It's basically a tool for researcher, students, or anyone passionate about long term education and use read a great deal of technical and research material.
PDFs are supported but we also support offline caching of web page content. We also support Anki sync so you can create flashcards and sync them with Anki so that you never forget what you've read.
EDIT. Awesome! This landed on the home page in less than 15 minutes and is now #2.. Super excited you guys found this helpful. Great to contribute back to such an awesome community!
Correct link seems to be http://cliffc.org/blog/wp-content/uploads/2018/05/2018_AWarO...
Which includes a blog post that goes into this without being just a presentation-aid PDF: http://cliffc.org/blog/2017/07/30/introverts-emotional-proce...
With Twitter Music shutdown, I was looking back on their acquihire of WeAreHunted, a music ranking service which at its core was a crawler that indexed torrents/tumblr/soundcloud to find what's up and coming. As I was pondering this, I was thinking about how they would normalise the data.
My main question was, how much difficulty do you encounter indexing a social site? I can imagine Tumblr, Facebook, and other sites have a plurality of new content appearing at arbitrary intervals (posts, comments, etc) - and I don't imagine that there are RSS feeds to diff here. So how would it function?
Basically you have a link rank build out your crawl frontier and then you have an incremental ranking algorithm re-rank the frontier.
The problem is that latency is a major factor.
Many of the top social news posts start from some random user that witnesses something very interesting and then that shoots up rapidly.
Our goal basically has to be to index anything that's not spam and has a potential for being massive.
Additionally, a lot of the old school Google architecture applies. A lot of our infra is devoted to solving problems that would be insanely expensive to build out in the cloud.
We keep re-running the math but to purchase our infra on Amazon web services would be like 150-250k per month but we're doing it for about 12-15k per month.
It's definitely fun to have access to this much content though.
Additionally, our customers are brilliant and we get to work with the CTOs of some very cool companies which is always fun!
You now gave me (and others) a reason to download your app. Ive downloaded the pdfs and am tracking my reading progress with your app.
I would consider making pdf reading lists and sharing them (with a great pitch to use your app) as a marketing effort in multiple verticals.
Edit: Nvm, just saw the git icon. Awesome!
I too built an application for HN: https://hnprofile.com/
If you're interested, I'd be happy to discuss it as what I focused on was ranking content (not indexing exactly). Might be some interesting synergy. They system requires a fraction of the data of a regular search engine and is often more effective.
One of them was a broken link (at least from here) :
#15 Cognitive Distortions of People Who Get Stuff Done (2012) [pdf]
I did something like this for a side project, you can check out he code: https://github.com/afallon02/pocket2kindle
Right now Polar can import a whole directory full of PDFs but that doesn't really get tagging.
I might end up building a file format so that we can do imports.
So you could take the Pocket RSS list, then convert it to the Polar import format, then just import that directly.
We would probably try to bundle up standard importers though.
import re import requests
from bs4 import BeautifulSoup
def download_file(download_url, name): #create response object r = requests.get(download_url, stream = True)
html = requests.get("https://getpolarized.io/2019/01/08/top-pdfs-of-2018-hackerne...) soup = BeautifulSoup(html.content) sAll = soup.findAll("a")
#download started with open("repo" + name, 'wb') as f: for chunk in r.iter_content(chunk_size = 1024*1024): if chunk: f.write(chunk)
for href in sAll: if(href.has_attr('href')): link = href['href'] if(link.find(".pdf") > 0): print(link) last_index = link.rindex("/") name = link[last_index + 1:] print(name) try: download_file(link, name) except: print("error downloading " + link )
from __future__ import print_function import re import requests from bs4 import BeautifulSoup def download_file(download_url, name): r = requests.get(download_url, stream = True) with open("repo" + name, 'wb') as f: for chunk in r.iter_content(chunk_size = 1024*1024): if chunk: f.write(chunk) html = requests.get("https://getpolarized.io/2019/01/08/top-pdfs-of-2018-hackernews.html") soup = BeautifulSoup(html.content, features='html.parser') sAll = soup.findAll("a") for href in sAll: if href.has_attr('href'): link = href['href'] if link.find(".pdf") > 0: print(link) last_index = link.rindex("/") name = link[last_index + 1:] print(name) try: download_file(link, name) except: print("error downloading " + link )
It's also interesting that it differs from this search performed in HN's own search system.
 [pdf] with filtering on Past Year -> 4,422 hits.
This article posted by me and reached score is 352:
22. Software-Defined Radio for Engineers [pdf] score: 292 comments
Software-Defined Radio for Engineers [pdf] (analog.com) 352 points by app4soft 6 months ago | 50 comments
That's the power of "Above the fold" in action.