Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
EC2's most dangerous feature (daemonology.net)
256 points by dwaxe on Oct 9, 2016 | hide | past | favorite | 101 comments


The blog post buries the lead a little bit because it's talking about lots of pain points with the ec2 API and IAM. The important point to take away is that any process with network access running on your instance can contact the EC2 metadata service at http://169.254.169.254 and get the instance-specific IAM credentials.

Think about things like services that accept user submitted URLs, crawl them, and display results...


This is actually a vulnerability I've seen countless times. If a site accepts a URL which it reads and returns to the user, submit the 169.254.169.254 metadata service. About 1 out of 5 times I've tried it, I'm about to get a response.


I think that's another example of bad filtering, same thing that happens if you accept an URL and don't disallow `file:///etc/passwd`.


Sometimes you can do this with things like smtp servers as well.


Fun fact: you can ignore EC2 instance roles and use the Amazon Cognito service for processes to obtain role-based short-term credentials.

I've described it previously as "Kerberos for the AWS Cloud" (which will make any self-respecting crypto nerd squirm) but hopefully it conveys the general idea. Yes it was designed for mobile & browser use, and yes the API isn't pretty, but it's there.


do you know if any similar vulnerability exists with Azure or Google Cloud?

edit: not sure why the downvote, I use Google Cloud, so honest question :(


There is a local metadata API for Azure but it is fairly simple right now.

Some limited instance info about your host ID, fault domain, and update domain: curl http://169.254.169.254/metadata/v1/InstanceInfo

Poll this regularly to find out when your VM is about to go down for maintenance: http://169.254.169.254/metadata/v1/maintenance

Those are the only two endpoints I know of. I encouraged them to require a special header for access to this API before releasing to the public but it looks like it was not included. A special header would help prevent your app from being able to access this from a user specified URL.

Google requires a header to help prevent this: https://cloud.google.com/compute/docs/storing-retrieving-met...


thank you so much for posting these extra details. My product is hosted on Google Cloud and allows users to arbitrarily craft http requests (including headers!)

I try to limit the possibility of abuse (restricting protocol to http, https, //, data, or ftp) but didnt know about this metadata issue.

I updated my product to account for this issue too.

it looks like google's metadata doesn't leak any important secrets (unless I had custom metadata, which i do not) but better safe than sorry!


You are getting downvoted because it's not a vulnerability.

It's well documented and very useful feature in aws and it also exists in gcloud, although gcloud documentation is not as good^1.

It only becomes a vulnerability if you don't read the documentation AND don't follow best practices^2.

I don't know if azure provides this feature.

^1 I use gcloud now and have used aws in the past.

^2 Don't run unknown code without proper sandboxing


It's borderline a privilege escalation vulnerability. The default behavior is that an unprivileged user on the system with network access can get access to an instance's credentials that can be used to perform administrative functions.

Like Colin pointed out in the blog post, this completely subverts the permissions model in modern operating systems.


No, it is not a vulnerability. Misunderstanding of this functionality can result in developer introducing a vulnerability, but this is a well described, well defined feature.

Calling this a vulnerability is akin to calling the existence of `rm` a vulnerability because it can delete files.


I didn't mean to suggest it being what it isn't.

I allow usage patterns similar to what is being described, so it is a vulnerability in something, be it my fault or not.


If your service allows arbitrary url queries that a user can trigger then you should make sure that you only allow queries to publicly routable ip ranges anyway.

169.254.0.0/16 is link-local range which you should be flitering along with publicly routable ip ranges that might be very upset if you access them like .mil reserved ip ranges. Go as far to also only allow DNS names instead of arbitrary ip, keeping in mind dns names may resolve to non publicly routable ranges or ranges you may not wish to access. These are all standard dangers of making queries on a user's behalf.

Good list of ipv4 ranges you should not allow: https://github.com/robertdavidgraham/masscan/blob/master/dat...


....and that you can imitate the metadata service to make life easier :) A plug for my friends side project: https://github.com/otm/limes . It's a local metadataservice. Very handy for making aws libs work without having to config them much. And great support for MFAs.


I believe Pocket faced exactly this issue once upon a time.



What I've done for a previous company was to, as one of the very first things done within every EC2 instance, add an iptables owner match rule to only allow packets destined to 169.254.169.254 if they come from uid 0. Any information from that webservice that non-root users might need (for instance, the EC2 instance ID) is fetched on boot by a script running as root and left somewhere in the filesystem.


This won't help with IAM roles, since the credentials provided in the metadata expire. Of course, a small tweak to the iptables entry would help there as well.

Mind posting your entry for us iptables impared folks?


I don't work there anymore, so I don't have access to the exact rule I used, but IIRC it was something like

iptables -t filter -I OUTPUT -d 169.254.169.254 -m owner \! --uid-owner 0 -j REJECT --reject-with icmp-admin-prohibited


But the data is not static.


Most of it is - your instance ID, network position, etc, will never change after boot (or 'start' if you stop it) - so caching it is just fine. There's very little that will change on a running instance except of course, the IAM credentials referred to in this article (as they expire within 90 minutes IIRC).


The irony is that the attempt to make the instance creds secure by rotation actually prevent protecting them in this fashion. A local file, readable only by root, w/ embedded keys is actually far more secure than the current implementation.


It's easy to implement that yourself. Generate a pair of IAM keys and drop it on the filesystem or bake it in the ami or application or whatever.

What they have reduces the duration of a vulnerability so that if you know someone had access to your machine at some point in time, you cna figure out from there how long their keys would have lasted and scope down the timeframe to start digging with in cloudtrail


That is the obvious answer, do you have any scripts you could share?


Hopefully the operators using EC2 instance profiles understand and weigh the risks of using that feature. It's good to be cautious, but the feature is only dangerous if you don't take the time to understand it. Running a server on the Internet at all is "dangerous" in the same sense. And for this particular risk, it turns out there's a simple fix.

He _is_ right in his first criticism that the IAM access controls available for much of the AWS API are entirely inadequate. In the case of EC2 in particular, it's all or nothing--either your credentials can call the TerminateInstances API or they can't. I'm sure Amazon is working on improving things, but for now it's pretty terrible. But in practice it just means you have to take care in different ways than you would if his tag-based authz solution were implemented.

That said, while it's certainly frustrating to an implementor, it's not "dangerous" that limitations exist in these APIs. We're talking about decade-old APIs from the earliest days of AWS, and while things have been added, the core APIs are still the same. That's an amazing success story. But like any piece of software, there are issues that experienced users learn how to work around.

You can bet that the EC2 API code is hard and scary to deal with for its maintainers. Adding a huge new permissions scheme is likely nearly impossible without a total rewrite... I don't envy them their task.


It's impossible to limit access to any part of the instance metadata in any way w/o firewalling (which has its own issues) or even to expire access to any part of it. Since instance profiles have keys (even though automatically rotated), any process on the system, owned by any user, can access anything exposed via the instance role. This makes embedding IAM keys into your instance and protecting it by root-only or ACL's MUCH MUCH safer... but AWS specifically states that instance profiles are preferred. In fact, for our Userify AWS instances (ssh key management), we are required to use instance roles and not allowed to offer the option. (This is why we do not offer S3 bucket storage on our AWS instances but we do on Pro and Enterprise self-hosted.)

The biggest issue with the IAM instance profiles is that they trade security for convenience.. and it's not a good trade.


For the most part EC2 instances should be single-purpose. Use tiny instances that do one job. Your IAM role describes the permissions that should be granted to that one job. It's absolutely true that you cannot isolate permissions at the process level, but by using single-job-type instances, you can easily isolate permissions on a per-job (in this model, per-instance) basis.


What? Why should EC2 instances be single-purpose? Amazon offers a wide variety of massive instance sizes with 160+gb of RAM and 30+ cores. It's extremely common to run software like mesos, kubernetes, docker, etc on these. Dedicating an instance per app is extremely cost-ineffective.


EC2 instances should be single-purpose (Or, if you want to mux containers onto the instance and retain per-container/job IAM role isolation, use ECS) if you're developing for AWS as a platform. I'm a huge fan of k8s, and have respect for Mesos, but these are largely alternatives to the model provided by EC2/ECS/IAM.

In a perfect world, any service would cleanly interoperate with any other service. Unfortunately we don't live in a perfect world. If you want to take advantage of `advanced` features in a given platform, you have to understand the drawbacks and limitations of those features, and what it means when they aren't available on another platform.

To me, the greatest tragedy in the way EC2 operates is that it looks/tastes/smells like a `server`, but it's far more akin to a process.


Well.. an EC2 instance running Linux is not a process or even a container, even if functionally it's easier to treat it like one.

It is a full virtual server with its own Linux kernel and operating system: it has to be updated, secured, and maintained just like any other Linux server. Most Linux distributions on an EC2 instance have dozens of processes already running out of the box.

I understand your point -- that ideally a single instance can be treated as a single functional point from the point of view of the application, and I agree, but not from a point of view of security. As you know, in any larger environment, there are likely many additional support applications running on that server: things like app server monitoring, file integrity, logging, management, security checks, remote data access or local databases, etc. Those must not all be treated with the same levels of security and access. (i.e., why would rsyslog or systemd need access to all objects in our S3 bucket or be able to delete instances or any of the other rights that might legitimately be granted to an instance via an IAM instance role?)

To treat security for all of these processes as if they're all part of the same app tosses out decades of operating system development and security principles and places your single function app, as well as that of your entire environment, at grave risk. I.e., there's a reason why a typical Linux distribution has about 50 accounts right out of the box and everything doesn't just run as root.

If you are developing or deploying microservices or containers and don't want to be burdened by the security requirements, then there are alternatives at AWS like ECS and Lambda that you should seriously consider.


Amusing examples, systemd/rsyslog, as both at least briefly execute as root, with rsyslog being relied upon to willingly drop its own privs (Not to mention being slowly replaced by systemd-journald, which runs as root), and systemd always running as root (ya know, since it's init, and all).

It really sounds like we have vastly different ideas about what kinds of processes belong in an EC2 instance, as well as the ideal life-cycle of an EC2 instance. I tend to adopt a strategy of relatively short-lived EC2 instances that get killed and replaced frequently. Persistence that depends on a single instance surviving is avoided at all costs, in favor of persistence distributed across a number of instances (or punted out to Dynamo/S3/RDS).

You're absolutely right that there is a reason why the typical Linux distro has 50 accounts out of the box -- it was built with traditional multi-user system security models in mind. I sure as hell appreciate it on workstations and traditional stateful hosts. That said, eschewing the traditional security model in favor of an alternative model does not make your environment inherently more or less safe -- there are going to be pros and cons to both approaches (in terms of both security and functionality).


I agree that it's important to do your research, but Amazon does us no favors here. I didn't know about this potential leakage until I needed to use the metadata system in AWS and then I realized the potential for abuse. Honestly, this should probably be an enabled option and it should be off by default.

The fact is that Amazon provides a commodity service and it's not a standard thing that most people expect to have an internal HTTP service that exposes potentially sensitive information to non-root users.

I actually disagree with the OP where he says they should use Xen Store for metadata. If I were Amazon, there is no way I would want to commit to using an option that is specific to one hypervisor technology. What if Amazon wants to switch to KVM?


If Amazon wants to switch to KVM, they're going to have so many other things they need to change that adjusting how instance metadata is exported will be the least of their problems.


I appreciate your comment!

> either your credentials can call the TerminateInstances API or they can't

Note that you can restrict the inputs to this API using IAM Policy in semantically meaningful ways. Three controls I'm familiar with that are useful for restricting inputs are the resource type for instances and the conditions for instance profile and resource tags [1]. The latter two are most flexible.

An instance profile restriction allows you to express a concept like, "This user may only terminate instances that are part of this specific instance profile"; in that way, the instance profile characterizes a collection of instances that can be affected by the policy. The resource tag condition can be used in a similar way. [2] is an example of a policy restricting terminations based on instance profile. The key fragment of it is:

  "ArnEquals": {"ec2:InstanceProfile":
    "arn:aws:iam::123456789012:instance-profile/example"}
A role with this policy condition can only affect instances that are part of the specified instance profile.

This allows you to create roles or users that have access to instances that are part of a certain instance profile only. If you wanted a group of instances to be able to manage (e.g. terminate) themselves, then the role on those instances could be access-restricted to the instance profile of those same instances. By assigning different fleets of instances different instance profiles, you can control which users or roles can access each fleet by restricting access to the fleet's Instance Profile. Similar restrictions are possible with resource tags on instances.

That said, though, I agree that there's room to improve the access control story. Managing instances through their full lifecycle sometimes involves accessing other resources like EBS volumes too, and it's not easy to construct a policy container that sandboxes access to just the right resources and actions while allowing the creation of new resources. Colin called out some of the gaps in his post. If you do not need to allow the creation of new resources then the problem is a bit easier. For example, you can avoid the need to create new EBS volumes directly by specifying EBS root volumes as part of instance creation using BlockDeviceMapping.

[1] http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-suppo... [2] https://gist.github.com/jcrites/d6826fc57b17c3c0ac50cae1fc9b...


tl;dr:

1) IAM instance roles have no security mechanisms to protect them from being read from any process on the instance, thus completely eliminating them from all Linux/UNIX/Windows permission systems. (The real reason for this is that instance metadata was a convenient semi-public information store for things like instance ID, but it was extended to also provide secret material, which was, at best, an idiotic move.) As the author points out, Xen already provided a great filesystem alternative that could be mounted as another drive (or network drive) to be managed with the regular OS filesystem permission system. (reading an instance ID is just a matter of reading a "file")... for some reason, AWS didn't leverage this and instead just added the secret material to its local instance metadata webserver.

2) the API calls are not fine grained enough and/or there are big holes in their coverage -- so, for instance, if you want to use some other AWS services, you can end up exposing much more than you intended.


This is interesting! Can this be abused with AWS-hosted services that reach out to fetch URLs? For example, image hosts that allow specifying an URL to retrieve, or OAuth callbacks, etc? Are there any tricks to be played if someone were to register a random domain and point it to 169.254.169.254 (or worse, flux between 169.254.169.254 and a public IP in case there is blacklisting application code that first checks to resolve the hostname but then passes the whole URL into a library that resolves again?)


That's a fairly common vulnerability. A good approach for services that need to fetch arbitrary URLs is roughly:

1. Resolve hostname and remember the response

2. Verify that the response does not contain any addresses in a private IP space, or any other IP that is only accessible to you

3. Use the IP from step 1 when establishing a connection

With other solutions, you might end up being vulnerable to DNS rebinding attacks.

Bonus points for doing all your URL fetching in some sort of sandbox that enforces these access rules.


This is accurate.

Remember that even ELB's in AWS have IP's that change all the time, and this itself is actually a source of vulnerabilities from apps that don't respect DNS TTL's (as has been seen in the forums repeatedly -- apps get connected to the previous IP instead of the new one). It's probably safer to retrieve and verify the IP for each request, and just cache if the IP is 'safe'. (And just doing IP subnet calculations is non-trivial in most less-common languages.)

Also, request throttling should be maintained and HTTP verb checking, to prevent being turned into a proxy for other attacks.

Actually, any decision to accept an arbitrary URL should be carefully examined in light of how hard it is to do safely.


What if we just check whether target host is 169.254.169.254 before allowing an HTTP call?


As long as you manually get the IP for every domain. I.e. if they ask for "blah.com" you have to get the IP, check it, then turn it into "curl -H 'Host: blah.com' http://IP". (Otherwise, it may be a race condition that allows the DNS server to resolve to a different IP address the 2nd time. See https://en.wikipedia.org/wiki/Time_of_check_to_time_of_use )


The EC2 documentation specifically warns against allowing yourself to be tricked into loading your metadata for someone else.

So yes, there are probably lots of services which get this wrong; but they did at least get a warning about that particular failure mode.


Yes, one project I've worked on required a crawling bot, and you could crawl the metadata service...(until fixed). dont even need a domain in most cases, either the ip or instance-data DNS I bet works in a bunch of places.

We redid all our policies to be extremely restrictive in response, if the instance did anything based on user input. Anything more admin happened on a different machine.


For real-world, complicated applications, that's virtually always a game-over vulnerability no matter what cloud you're deployed on. Be extraordinarily careful with backend code that generates HTTP requests based on user inputs.

Pentesters have been using this trick to pivot from unexpected backend web proxies to (e.g.) management consoles, LOM servers, JBoss interfaces, &c for 15 years or so already.


This is an interesting attack that I must confess I hadn't thought of, but surely any service that accepts an arbitrary URL has a list of IP ranges to avoid. However, to harden a role in the event of instance role credentials leaking, you could use an IAM Condition [0].

There is actually an example of this in the IAM documentation [1], although the source VPC parameter doesn't work for all services, and I can't see a list of services that support this parameter. This would ensure that the requests actually came from instances within your VPC.

[0] http://docs.aws.amazon.com/IAM/latest/UserGuide/reference_po...

[1] http://docs.aws.amazon.com/AmazonS3/latest/dev/example-bucke...


The point is not requests that originate elsewhere. The point is that this system is not protected in any way from any other process on your system.


Er, the problem is not "people can hit an un-routable IP from outside your instance". The problem is that "if your instance allows an attacker to make a HTTP request, you might expose personal information". For example, web crawlers or other fetchers.


> almost as trivial for EC2 instances to expose XenStore as a filesystem to which standard UNIX permissions could be applied, providing IAM Role credentials with the full range of access control functionality which UNIX affords to files stored on disk.

Doesn't this become more complicated when you think about EC2 offering Windows instances? Even with straight UNIX file writing, what writes this? Where does it write this? Which user has read permissions?


In UNIX, the same way that EBS volumes are mounted... think of the /proc or /sys virtual filesystems.

In Windows, I'm guessing that this would be exposed as a network drive.


My point is that these type of solutions create a lot more overhead, inconsistency and variation compared to a HTTP request; granted, less security.


I'm sure that's why they went with HTTP -- it is universal and will work the same way everywhere.

I still think they should just disable it by default, so you have to "opt-in" to the potential security risk and plan accordingly.


Yeah, having the metadata available over an http interface is actually brilliant. Simple HTTP calls are easy to do from any network-capable OS or language.


So is reading a file on the filesystem.. easier, actually in most languages, since HTTP requests usually require loading an extra library.


A library which uses the same (built-in) underlying IO mechanisms as the (built-in) filesystem.


I've used firewall rules in the past to scope the metadata store to admin users.


That's better than nothing, but runs into problems if you have different users who need to be able to access different subsets of the metadata store.


An admin user could fetch these subsets of the metadata and leave a copy of them in the local filesystem.


That doesn't work for IAM Roles, because (a) AWS library code expects to get the keys out of the metadata store, and (b) IAM Role credentials are periodically rotated, so the keys you downloaded in advance would expire.


You're right it doesn't. IAM roles are intended to be granted to the entire server, not a subset of the server's users. Any compromise of the server would be considered a compromise of its role. Yeah, this is a bit crazy depending on what you're running on it. I was running multi-tenant IIS hosts and the apps had no business with the metadata or ec2 IAM roles in my case.

If you want roles to work for other users via the meta data store you can intercept requests with a proxy and then grab the temp credentials STS assume role. This is how kube2iam works. Depending on your use case you'd have to write the proxy, automate the mappings and firewall rules, etc etc etc. PITA but probably doable.

On a different note, I agree with just about everything you had to say about the PITA that is IAM. Properly scoping permissions is much harder than it needs to be. Not all resources support tags, and even then almost nothing outside of ec2 supports tag conditions in IAM. This leads to many naming schemas and wildcard resource conditions :( AWS should have created the concept of resource groups. This would have greatly simplified giving users permissions to subsets of an accounts resources. I chalk this up to AWS's VERY poor collaboration between service teams. Nothing that came out seemed coordinated. This appears to be getting better (shrug).


It's true that instance profile roles today supply credentials to the entire server. One benefit of virtualization is that it's reasonable to run small, single-purpose VMs. However if you do wish to restrict role credentials to certain processes, there are ways of doing it, such as using EC2 Container Service with task-level IAM Roles [1]:

> Credential Isolation: A container can only retrieve credentials for the IAM role that is defined in the task definition to which it belongs; a container never has access to credentials that are intended for another container that belongs to another task.

If you do firewall the instance metadata service and want to get credentials into individual processes, then you could do that using one of the credential providers in the AWS SDK. I haven't worked with every language SDK, but service clients in the SDK for Java take an AWSCredentialsProvider as input, and you can pick from a number of standard implementations [2] or define a custom one.

> An admin user could fetch these subsets of the metadata and leave a copy of them in the local filesystem.

So if you wanted to take this approach, an admin agent could periodically copy the role credentials as property files into the home directories of users that need them, and then applications could load them by configuring the SDK with ProfileCredentialsProvider (which can refresh credentials periodically). The admin agent could perhaps be a shell script run by cron that `curl`s from the instance metadata service and writes the output to designated files.

[1] http://docs.aws.amazon.com/AmazonECS/latest/developerguide/t... [2] See ProfileCredentialsProvider and DefaultAWSCredentialsProviderChain


One benefit of virtualization is that it's reasonable to run small, single-purpose VMs.

Single-purpose doesn't mean single-user. Lots of services divide their code into "privileged" and "unprivileged" components in order to reduce the impact of a vulnerability in the code which does not require privileges. As far as I'm aware, there's no way to have an sshd process which is divided between two EC2 Containers...


EC2 instances are designed to enforce isolation between instances, not processes. Presumably there would only be one primary service running on each.

Use AWS be pushed towards an architecture based on containers and services. AWS is the OS, not any individual machine.


Until AWS fixes this (which, as the article points out, may never happen), a chmod'ed 600 file (only readable by root) is actually much safer, even when STS auto-rotation is taken into account.


If users can issue arbitrary commands on an instance then that instance should have zero Iam roles and should delegate actions to services running on separate instances.

The instances hosting our users go a step further and null route Metadata service requests via iptables.


It isn't just about users, its also about malicious software you may accidentally install, if for example a library you use is compromised as has happened before with Ruby gems.


It's much like the same Problem with the Google Cloud. Even worse there I'd say.


Could you please elaborate? I'm not doubting you, just very interested in learning more.


I just double checked, and the most similar thing we expose is the token's for each service-account in the instance metadata. As pointed out in the article, any uid on the box can read that. But, you can create instances with a zero-permission service account (the equivalent of nobody?) and just avoid it.

This does mean that everywhere else you'd have to have explicit service accounts and such, but that seems like a reasonable "workaround" until or unless we make metadata access more granular (I like the block device idea! Would you want entirely different paths for JSON versus "plain" though?)


Google Cloud does seem better here. The exception is GKE — Kubernetes nodes are associated with service accounts which have permissions that, if abused by a malicious Docker container, could be disasterous for your entire cluster.

Considering the amount of unpatched Docker containers out there, that's a bit scary. It also effectively prevents GKE from being usable in any scenario where you want to schedule containers on behalf of third-party actors (think PaaS). (GKE also doesn't let you disable privileged Docker containers, but that's another story.)

On AWS you can run a metadata proxy to prevent pods from getting the credentials, but I don't know of a clean way to accomplish the same thing on GKE.


If you're sharing the same instance for multiple users, trying to achieve security among the users is almost impossible anyway. That's why physical separation/virtualization is one of the first thing to focus on when talking about security.


Isolation is definitely important, but not all parts of the system running a single function need the same levels of access, and in fact it may be possible to target those components separately. Take a look at the wikipedia articles for 'defense in depth' or 'privilege separation' to see how important it is inside a system to treat each component isolated to itself as much as possible. (This is also why you don't want to rely on only a perimeter firewall for access control.)


IAM instance roles are still an improvement over how it was typically done in the past: hard-coding the same key in a configuration file and deploying it everywhere.

It's a balance between security and convenience.


I wonder how many services on Amazon allow user-configurable webhooks that can be pointed to http://169.254.169.254 ...


Same happens with metadata access in Openstack.

The access is controlled by source IP (and namespace). I wonder if it's possible to spoof the IP and access Metadata of other servers/users.


It has been the case for 10 years, anyone knows that, I don't see the problem. If you're not happy with it just use API keys.


IAM instance roles were only added in 2012: https://aws.amazon.com/blogs/aws/iam-roles-for-ec2-instances...


OK, dwaxe, I have to ask: Are you a robot? Because I uploaded this blog post, tweeted it, and then came straight over here to submit it and you still got here first.

Not that I mind, but getting your HN submission in within 30 seconds of my blog post going up is very impressive if you're not a robot.


Yes I am. This is my personal account, but I use it to automatically post to Hacker News. I was playing around with BigQuery one day and found the Hacker News dataset [1]. From my experience with the Reddit submissions dataset [2], I knew that I could compose this query,

SELECT AVG(score) AS avg_score, COUNT(* ) AS num, REGEXP_EXTRACT(url, r'//([^/]*)/') AS domain FROM [fh-bigquery:hackernews.full_201510] WHERE score IS NOT NULL AND url <> '' GROUP BY domain HAVING num > 10 ORDER BY avg_score DESC

which returns a list of domains with more than ten submissions sorted by average score. This turns out to be a list of some of the most successful tech blogs on the internet, as well as various YCombinator related materials. Out of the domains with over 100 submissions, daemonology.net has the 9th highest average score per submission. I manually visited all the domains with more than about 30 submissions, found the appropriate xml feeds, and saved them. I added a few websites like eff.org whose messages I think everyone should read anyways.

Then I jumped into python and started trying to figure out how to post to Hacker News. It was a little more complicated than I anticipated [3], but an open source HN app for Android helped me figure it out.

I set up a cron job on my $5 Digital Ocean that runs the script every few minutes (pseudocode):

If you can reach http://news.ycombinator.com, Check all feeds for new entries, Post a new entry to hn, Sleep for an hour before posting another

[1] https://bigquery.cloud.google.com/dataset/bigquery-public-da...

[2] The only difference on Reddit is the subreddit system.

[3] After you send a POST request to send to the login screen, Hacker News gives you a url with a unique "fnid" parameter, and you send another POST request to another url with the appropriate "fnid".


We appreciate both the cleverness here and your detailed explanation. But could you please not do this anymore? It isn't malicious, but it's unhealthy for HN's ecosystem. For example, when an author submits his or her own work, that can add a lot of value to the community—but your bot pre-empts that, as it indeed did in this case.

There are many more reasons why this isn't a good thing for HN. For example, it's better for submissions from popular sites to be distributed across a wide range of accounts. That gives more users a chance to feel like they're making important contributions, and gives the community (and authors) a clearer sense of the audience.

There are lots of ways to write software to interact with HN, and lots of users with the ability to do it, so we really depend on the good will of the community only to do that when it serves the whole.


Yes, there's definitely a script at work.

Compare these timestamps (hover over date): https://hn.algolia.com/?query=dwaxe%20eff&sort=byPopularity&... To the RSS feed: https://www.eff.org/rss/updates.xml

Humans don't post everything from an RSS feed within 10 minutes!


Wouldn't it be funny if you added an RSS entry that linked to an otherwise unpublished page with the url path /i-am-a-bot-that-posts-to-hackernews with a matching title to go?



Haha that's brilliant! Nice touch with the recursive link.


> Yes, there's definitely a script at work.

Presumably to farm karma for some other purpose.

I once observed that everything that shows up here eventually shows up in a particular cluster of subreddits and vice versa. Usually with a lag of 24-48 hours.

I half-jokingly floated the idea that one could write a karma arbitrageur bot which cross posts between HN and those subreddits.

I was told that I would be banned. Not just the bot: me.


The submissions [0] appear to be scripted. Most of those appear to be from selected few domains.

Your domain seems to have been added recently [1]. Congratulations!

So whether you choose to blog about tarsnap or anything else, chances are that it'd be posted to HN before you're able to.

[0] https://news.ycombinator.com/submitted?id=dwaxe

[1] https://news.ycombinator.com/from?site=daemonology.net


Suggestion: post before tweeting.

And to your question, dwaxe is not a bot (there are comments associated with the account too), and this has happened before (apparently a lightning fast submitter):

https://news.ycombinator.com/item?id=12218797

Of course he/she could still be running a script.


Suggestion: post before tweeting.

I'll probably do that next time. I didn't think it would matter! (And really it doesn't -- it's not as if I need the karma points.)


> And really it doesn't -- it's not as if I need the karma points.

It's the principle of the matter! Does HN allow bot submissions?


Or better; assuming the URI is known before posting (it's not randomly generated on submission or encapsulates absurdly precise submission time) just post it to HN a fraction of a second before it's live. Nothing any script can do to beat you then.


HN doesn't check submissions by fetching them so that would work.


> OK, dwaxe, I have to ask: Are you a robot?

Cyborg. Story submission is clearly robotic, but there's a lot of charmingly humanesque entries under dwaxe's comments:

https://news.ycombinator.com/threads?id=dwaxe


Assuming dwaxe replies claiming to not be a robot, how would we go about verifying? :P


We could administer it a test of some sort, and evaluate the ensuing conversation, whether it's convincingly human. I think it'd be fitting name this test in honor of some forefather of computer science.


I heard robots float on water


What also floats on water?


Ducks!


Did I mention that dwaxe turned me into a newt?

(I got better.)


We could compare the time it took to submit (seconds) to the time it took to comment (3 hours and counting...)


Could be using something like IFTTT with a simple rule RSS to Post Immediately on Forum


Most likely dwaxe is human using scripts to boost his karma. Almost like sportsman using enhancing performance drugs.

Side question: I wonder when AI pass Hacker News Turing test. So bot can trick us into being human by HN comments.


Probably right time right place.


The number of people in this thread not merely nodding their heads and mmhmm-ing (or the internet equivalent) is a concern.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: