Mauro Santos via arch-general
2017-03-05 21:12:55 UTC
Hi,
I was recently contacted by a Polish researcher asking for a list of AUR
account names. I did not expect this to be controversial but a couple of
Trusted Users raised concerns on IRC, so I decided to move this to the
public mailing list and discuss the whole topic in generality. I would
like to head more opinions but please read the whole email and give it a
second thought before simply bringing up the usual privacy arguments
mentioned below.
My original questions was: Are we fine with sharing the list of AUR
accounts names (only user names, no real names or email addresses) with
a researcher that seems trustworthy and agrees to not share the data in
any form other than the resulting anonymized statistics?
In this particular case, we are talking about Dorota Celinska [1] from
the University of Warsaw, Faculty of Economic Sciences [2], see [3] for
a list of her publications and [4] for a summary of her research project
funded recently by the Polish National Science Centre. She needs the
list of user names to perform a segmentation analysis, including users
which were active on the older AUR releases both do not show any
activity on AUR 4. She would also like to use the user names as
identifiers to establish connections with other platforms, such as
GitHub.
The next question is: Would it make sense to even make this data
publicly available? Would it make sense to extend our RPC interface such
that one can search for users names? GitHub, for example, already
provides such an interface [5]. Let me quickly summarize some arguments
* User names are mostly identifiers. It is questionable whether they
can/should be considered personal/private information. Maybe this can
only be answered by a lawyer, though.
* The user names of all accounts with any kind of public activity, like
uploading a package, filing a request, writing a comment, are public
already.
* After logging into the aurweb interface, you can already check whether
an account with a given user name exists because the account details
page URIs have the form https://aur.archlinux.org/account/$username.
This means that for any platform providing a list of user names (such
as GitHub), you can "establish connections" with the AUR already.
* Principle of data economy: We should not share any kind of information
we do not need to share.
* Sharing user names lowers the threshold for sharing other information
which is considered more confidential.
* Users can (and should) already use crawlers to fetch the user names.
For example, the user names of all package maintainers and comment
authors appear on the package details pages. The names of all users
filing package requests appear in the mailing list archives etc.
* We do not have ToS so we better not share anything.
I, personally, find the second last argument a very weak one. Telling
users to build crawlers scraping an brute-forcing our HTML pages makes
life difficult for both them and us. What do you think?
On the other side of the coin, the last argument is a very good one and
it brings me to my last point. Independently of the outcome of this
discussion, I think we should add some ToS that users need to agree upon
when registering. It should contain information on liability and on
privacy. Is anybody willing to write a draft? Do we need the support of
a lawyer here?
Thank you for your time and have a nice Sunday!
Regards,
Lukas
[1] http://coin.wne.uw.edu.pl/dcelinska/en/
[2] https://www.wne.uw.edu.pl/index.php/en/
[3] http://coin.wne.uw.edu.pl/dcelinska/en/pages/publications.html
[4] https://ncn.gov.pl/sites/default/files/listy-rankingowe/2016-03-15/streszczenia/337724-en.pdf
[5] https://developer.github.com/v3/users/
I'd say err on the caution side and don't share, even though theI was recently contacted by a Polish researcher asking for a list of AUR
account names. I did not expect this to be controversial but a couple of
Trusted Users raised concerns on IRC, so I decided to move this to the
public mailing list and discuss the whole topic in generality. I would
like to head more opinions but please read the whole email and give it a
second thought before simply bringing up the usual privacy arguments
mentioned below.
My original questions was: Are we fine with sharing the list of AUR
accounts names (only user names, no real names or email addresses) with
a researcher that seems trustworthy and agrees to not share the data in
any form other than the resulting anonymized statistics?
In this particular case, we are talking about Dorota Celinska [1] from
the University of Warsaw, Faculty of Economic Sciences [2], see [3] for
a list of her publications and [4] for a summary of her research project
funded recently by the Polish National Science Centre. She needs the
list of user names to perform a segmentation analysis, including users
which were active on the older AUR releases both do not show any
activity on AUR 4. She would also like to use the user names as
identifiers to establish connections with other platforms, such as
GitHub.
The next question is: Would it make sense to even make this data
publicly available? Would it make sense to extend our RPC interface such
that one can search for users names? GitHub, for example, already
provides such an interface [5]. Let me quickly summarize some arguments
* User names are mostly identifiers. It is questionable whether they
can/should be considered personal/private information. Maybe this can
only be answered by a lawyer, though.
* The user names of all accounts with any kind of public activity, like
uploading a package, filing a request, writing a comment, are public
already.
* After logging into the aurweb interface, you can already check whether
an account with a given user name exists because the account details
page URIs have the form https://aur.archlinux.org/account/$username.
This means that for any platform providing a list of user names (such
as GitHub), you can "establish connections" with the AUR already.
* Principle of data economy: We should not share any kind of information
we do not need to share.
* Sharing user names lowers the threshold for sharing other information
which is considered more confidential.
* Users can (and should) already use crawlers to fetch the user names.
For example, the user names of all package maintainers and comment
authors appear on the package details pages. The names of all users
filing package requests appear in the mailing list archives etc.
* We do not have ToS so we better not share anything.
I, personally, find the second last argument a very weak one. Telling
users to build crawlers scraping an brute-forcing our HTML pages makes
life difficult for both them and us. What do you think?
On the other side of the coin, the last argument is a very good one and
it brings me to my last point. Independently of the outcome of this
discussion, I think we should add some ToS that users need to agree upon
when registering. It should contain information on liability and on
privacy. Is anybody willing to write a draft? Do we need the support of
a lawyer here?
Thank you for your time and have a nice Sunday!
Regards,
Lukas
[1] http://coin.wne.uw.edu.pl/dcelinska/en/
[2] https://www.wne.uw.edu.pl/index.php/en/
[3] http://coin.wne.uw.edu.pl/dcelinska/en/pages/publications.html
[4] https://ncn.gov.pl/sites/default/files/listy-rankingowe/2016-03-15/streszczenia/337724-en.pdf
[5] https://developer.github.com/v3/users/
usernames are public and easy to find by scraping them from the
website/mailing list/etc, handing the whole database of usernames in a
silver platter is a whole different story, which is what is being asked.
Is there any community/website that provides a full list of registered
usernames on request?
There is also the question of how useful that data would be, without any
other data such as email the username list is useless, you have no
guarantee that user foo on github is the same person as user foo on the
AUR/Wiki/Forum or user foo somewhere else. In this case I'd also have to
agree that sharing usernames lowers the threshold for sharing other
information.
It also doesn't fit with their stated research goals, only github and
projects associated with scraping data from github are mentioned, why
would they want to throw the AUR usernames in the mix?
--
Mauro Santos
Mauro Santos