Launchpad Entry: https://blueprints.launchpad.net/ubuntu/+spec/reliable-nss-caching
We need a reliable mechanism to dynamically cache NSS entries that scale and remain manageable with a very large number of clients using network directories with large number of entries (more than 100K). The goal is to improve resilience and performance, and to allow disconnected operation. Various approach are possible, but traditional NSS caching mechanisms, such as nscd and nss-updatedb, either fail or do not fulfill all aspect of the use-case.
Ubuntu is being considered as a desktop operating system in many corporate deployments, including some very large public administration. Use of a network directory (LDAP, NIS, etc) for storing users and groups is a given in such organization.
Inevitably, some of these users will be using laptops and as such are expected to continue working when disconnected. While it is always possible to create local users and groups on these laptops for disconnected operation, it is rightly frowned upon as management-heavy solution.
Even for fixed workstation, the possibility that the network directory would become temporarily unavailable (network outage, directory server failure, etc) have to be accounted for. As such, the notion of disconnected operation can be extended to include resilience to network directory failure.
The NSS framework have been written with the assumption that database lookup would be cheap. As such, applications tend to abuse lookup and make liberal use of functions that enumerate entire database, such as initgroups() and getgrouplist(). When these calls have to walk through or return thousands of entries, the assumption break and performance get severely degraded. Caching help alleviate the load on the network directory infrastructure and reduce directory lookup lag on the client.
Microsoft Windows, when used in an Active Directory setting, allows completely seamless disconnected operation for laptop. For most organization to consider Ubuntu as a viable alternative, we need feature parity.
Any organization with more than 10000 PC (usually made up of a combination of desktops and laptops), and a corresponding number of network directory entries (users and groups).
To be discussed. I am looking for insights on how this could be implemented in a reliable, manageable and straightforward fashion.
One of the thorny issue is disconnected operation. When disconnected, lookup and enumeration should not block but instead return immediately with whatever entries have been cached, even if incomplete.
At this point, it appear to me that any solution we adopt needs to be aware of the status of the network directory (reachable, responsive, etc) to decide if NSS lookup and enumeration should be served from the cache, the directory or a combination of both. How to achieve that remain to be determined.
Current implementation of NSS caching from which we could base a final solution or borrow ideas:
nscd: The traditional transparent caching mechanism. Unfortunately, it is of no help with database enumeration (ie, getgrouplist()), only entry lookup. Anecdotally, it have also proven somewhat unreliable, especially when used with libnss-ldap. We may also look at the BusyBox implementation of nscd, unscd (http://busybox.net/~vda/unscd/).
- nss-updatedb: Keep a local NSS cache in DB format. Cache synchronisation can be automated (ie, cron once a day, or some such). Work fairly well for small NSS database, but does not cache dynamically and do not scale to large deployment.
nss-ldapd: An interesting approach that split the work of NSS lookup between a lightweight library front-end, and daemon backend that actually manage the connection to the directory server. Not sure how/if it handle caching, disconnected operation, but it could probably be extended. LDAP-only. Documented at http://ch.tudelft.nl/~arthur/nss-ldapd/design.html.
The NSS slapd overlay: Introduced in Intrepid, this is an overlay to slapd more or less based on nss-ldapd above, summarily described at http://firstname.lastname@example.org/msg02792.html. LDAP-only.
- winbind: Does NSS caching (and Likewise Open also do disconnected), but sadly only for Active Directory and Windows domain.
Whether the approach could be LDAP-specific or generic to any NSS backend remain to be discussed. A solution that could handle NIS would be preferable, although not at all cost.
Failure mode to test possible solutions against:
- No network connectivity
- Network connectivity, but network directory not responsive (timeout)
- Network directory name lookup failure
- Slow network directory lookup
None at this time. First step is to build a network directory with a large number of object (100K users and 10k groups) to help with testing, benchmarking and profiling possible solutions.
BoF agenda and discussion