Collecting statistics was brought up on the mailing list earlier, primarily to track successful flashes/upgrades per board/device. I've had some thoughts about collecting statistics earlier, but for other purposes, although not in conflict with tracking successful flashes. Communities tend to generate a lot of popularity questions, seeing countless threads asking which device is the most popular (in both LEDE and OpenWrt forums over time) lead me on to this. I'll start with two polls covering the basic questions, and elaborate on my further thoughts afterwards.
- LEDE should collect statistics/telemetry
- LEDE should not collect statistics/telemetry at all
0 voters
- LEDE should collect statistics/telemetry by default
- LEDE should not collect any statistics/telemetry by default
0 voters
What data to collect and why
This is a list of data points I have considered and what makes them relevant, it is not meant to be exhaustive.
- Board/device name and release/revision information - Tracking successful software/hardware combinations, and showing what devices are most used, could potentially track use of community builds if they identify themselves
- If the image is dirty (LEDE source has been modified, look for rXXXX**+Y**) - Particularly related to the previous point, does it run LEDE vanilla
- List of installed packages and potentially versions - Show popular packages, discover widespread use of outdated packages, discover little used packages
- From package list, create flags/points based on packages of particular interest - Show which wireless drivers are popular (e.g. kmod-ath9k, kmod-mt76...), LuCI or no LuCI etc.
How to collect
My idea is to use a script on LEDE devices that will collect the data of interest, and submit it to a server using HTTP POST. Uclient-fetch is included and supports sending POST data as well as HTTPS if one of the libustream variants are installed. A script on a web server receives this data, and saves it in a database. Periodically, another script runs through the database and generates pretty web pages with aggregated statistics for users to browse, search and filter (my thought was that generating statistics on the fly per request could be fairly slow).
Conceptual issues
There will be concerns about any kind of data collection or dialing home, so we need to consider what data we really want (if any), and inform users appropriately. We also need to decide whether this package should be included by default, or optional. Some filtering of data could also be useful, we could e.g. only show packages that exist in official feeds, so if anyone has a custom package in their image that would be ignored by the server collecting statistics.
Another issue is keeping the statistics relevant (in my mind the goal is up-to-date statistics, not historical statistics). My plan is to have the device generate a random device ID on firstboot, and on each submission (let's say the default is to submit an updated report each 48 hours) include this ID so the server knows if it's a new device or one that has submitted before. If a device ID has not submitted any reports for say 30 days, the data about it is deleted and no longer counted. As tying each submission to a unique ID will likely cause further concern from some users, I suggest some datapoints could be optional - opt-in or opt-out remains to be decided. The UCI configuration could look like this (the device has generated an ID, and the user does not want to submit package list):
config statistics
option 'device_id' '5dcb530e6cb90b63c75ab8792a0176792bb1bae433312f6d7891b108653f7db0'
option 'submit_release' 1
option 'submit_revision' 1
option 'submit_board' 1
option 'submit_packages' 0
option 'submit_dirty' 1
Reliability is another issue - there is little that stops spammers or trolls from filling this database with garbage data. Rate limiting submissions based on IP addresses and accepting only expected input (release names that are real, revision numbers that look like actual revision numbers etc.) are measures I can think of.
Implementation issues
I don't know if this kind of volume justifies using a message broker like RabbitMQ, or if that's overkill. My main concern, which a message broker would solve, is that the database could be overloaded with many simultaneous device reports if it's a synchronous operation (scripts writes to database as soon as it receives data). The web server could probably handle it, but especially with package lists it could lead to complex/large INSERT queries depending on the data model.
What I could contribute
- Writing the client part (seems easy)
- Implement the server side parts (medium+ - I understand the logic, but will need time to get things right)
- Writing a script to generate simple HTML reports
- Wiki documentation
What I would need help with
- Making the browseable reports pretty and functional (e.g. filtering)
- Making a LuCI application
- Infrastructure - for development and testing free AWS/other cloud things will work, but I don't have infrastructure to provide for long term deployment