Message ID | 20250128011023.55012-1-slava@dubeyko.com |
---|---|
State | New |
Headers | show |
Series | ceph: is_root_ceph_dentry() cleanup | expand |
On Tue, 2025-01-28 at 03:07 +0000, Al Viro wrote: > On Mon, Jan 27, 2025 at 05:10:23PM -0800, Viacheslav Dubeyko wrote: > > From: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> > > > > This patch introduces CEPH_HIDDEN_DIR_NAME. It > > declares name of the hidden directory .ceph in > > the include/linux/ceph/ceph_fs.h instead of hiding > > it in dir.c file. Also hardcoded length of the name > > is changed on strlen(CEPH_HIDDEN_DIR_NAME). > > Hmm... > > Speaking of that area > * how the hell could ceph_lookup() ever be called with dentry > that is *NOT* negative? VFS certainly won't do that; I'm not sure about > ceph_handle_notrace_create(), but it doesn't look like that's possible > without server being malicious (if it's possible at all). > > * speaking of malicious servers, what happens if > it gets CEPH_MDS_OP_LOOKUP and it returns a normal reply to positive > lookup, but with cpu_to_le32(-ENOENT) shoved into head->result? > AFAICS, ceph_handle_snapdir() will be called with dentry > that is already made positive; results will not be pretty... I assume that you imply this code: /* can we conclude ENOENT locally? */ if (d_really_is_negative(dentry)) { struct ceph_inode_info *ci = ceph_inode(dir); struct ceph_dentry_info *di = ceph_dentry(dentry); spin_lock(&ci->i_ceph_lock); doutc(cl, " dir %llx.%llx flags are 0x%lx\n", ceph_vinop(dir), ci->i_ceph_flags); if (strncmp(dentry->d_name.name, fsc->mount_options->snapdir_name, dentry->d_name.len) && !is_root_ceph_dentry(dir, dentry) && ceph_test_mount_opt(fsc, DCACHE) && __ceph_dir_is_complete(ci) && __ceph_caps_issued_mask_metric(ci, CEPH_CAP_FILE_SHARED, 1)) { __ceph_touch_fmode(ci, mdsc, CEPH_FILE_MODE_RD); spin_unlock(&ci->i_ceph_lock); doutc(cl, " dir %llx.%llx complete, -ENOENT\n", ceph_vinop(dir)); d_add(dentry, NULL); di->lease_shared_gen = atomic_read(&ci->i_shared_gen); return NULL; } spin_unlock(&ci->i_ceph_lock); } Am I correct? So, how can we rework this code if it's wrong? What is your vision? Do you mean that it's dead code? How can we check it? Thanks, Slava.
On Tue, Feb 11, 2025 at 12:08:21AM +0000, Viacheslav Dubeyko wrote: > In general, the MDS can issue NULL dentries to clients so that they "know" the > direntry does not exist without having some capability (or lease) associated > with it. As far as I can see, if application repeatedly does stat of file, then > the kernel driver isn't repeatedly going out to the MDS to lookup that file. So, > I assume that this is the goal of this check and logic. Er... On repeated stat(2) you will get ->lookup() called the first time around; afterwards you'll be getting dentry from dcache lookup. With ->d_revalidate() each time you find it in dcache, and eviction if ->d_revalidate() says it's stale. In which case a new dentry will be allocated and fed to ->lookup(), passed to it in negative-unhashed state...
On Tue, Feb 11, 2025 at 06:01:21PM +0000, Viacheslav Dubeyko wrote: > After some considerations, I believe we can follow such simple logic. > Correct me if I will be wrong here. The ceph_lookup() method's responsibility is > to "look up a single dir entry". It sounds for me that if we have positive > dentry, > then it doesn't make sense to call the ceph_lookup(). And if ceph_lookup() has > been > called for the positive dentry, then something wrong is happening. VFS never calls it that way; ceph itself might, if ceph_handle_notrace_create() is called with positive dentry. > But all this logic is not about negative dentry, it's about local check > before sending request to MDS server. So, I think we need to change the logic > in likewise way: > > if (<we can check locally>) { > <do check locally> > if (-ENOENT) > return NULL; > } else { > <send request to MDS server> > } > > Am I right here? :) Let me change the logic in this way and to test it. First of all, returning NULL does *not* mean "it's negative"; d_add(dentry, NULL) does that. What would "we can check locally" be? AFAICS, the checks in __ceph_dir_is_complete() and near it are doing just that... The really unpleasant question is whether ceph_handle_notrace_create() could end up feeding an already-positive dentry to direct call of ceph_lookup()...
On Tue, Feb 11, 2025 at 07:32:16PM +0000, Viacheslav Dubeyko wrote: > > The really unpleasant question is whether ceph_handle_notrace_create() could > > end up feeding an already-positive dentry to direct call of ceph_lookup()... > > We have ceph_handle_notrace_create() call in several methods: > (1) ceph_mknod() > (2) ceph_symlink() > (3) ceph_mkdir() > (4) ceph_atomic_open() > > Every time we create object at first and, then, we call > ceph_handle_notrace_create() if creation was successful. We have such pattern: > > req = ceph_mdsc_create_request(mdsc, CEPH_MDS_OP_MKNOD, USE_AUTH_MDS); > > <skipped> > > err = ceph_mdsc_do_request(mdsc, dir, req); > if (!err && !req->r_reply_info.head->is_dentry) > err = ceph_handle_notrace_create(dir, dentry); > > And ceph_lookup() has such logic: > > if (d_really_is_negative(dentry)) { > <execute logic> > if (-ENOENT) > return NULL; > } > > req = ceph_mdsc_create_request(mdsc, op, USE_ANY_MDS); > > <skipped> > > err = ceph_mdsc_do_request(mdsc, NULL, req); > > So, we have two different type of requests here: > (1) ceph_mdsc_create_request(mdsc, CEPH_MDS_OP_MKNOD, USE_AUTH_MDS) > (2) ceph_mdsc_create_request(mdsc, op, USE_ANY_MDS) > > The first request creates an object on MDS side and second one checks that this > object exists on MDS side by lookup. I assume that first request simply reports > success or failure of object creation. And only second one can extract metadata > from MDS side. If only... The first request may return that metadata, in which case we'll get splice_dentry() called by ceph_fill_trace() when reply arrives. Note that calls of ceph_handle_notrace_create() are conditional. Intent is as you've described and normally that's how it works, but that assumes sane reply. Which is not a good assumption to make, when it comes to memory corruption on client... I _think_ the conditions are sufficient. There are 4 callers of ceph_handle_notrace_create(). All of them require is_dentry in reply to be false; the one in ceph_mkdir() requires both is_dentry and is_target to be false. ceph_fill_trace() does not call splice_dentry() when both is_dentry and is_target are false, so we are only interested in the mknod/symlink/atomic_open callers. Past the handling of !is_dentry && !is_target case ceph_fill_trace() has a large if (is_dentry); not a concern for us. Next comes if (is_target) where we deal with ceph_fill_inode(); no calls of splice_dentry() in it. After that we have if (is_dentry && ....) { ... splice_dentry() call possible here ... } else if ((req->r_op == CEPH_MDS_OP_LOOKUPSNAP || req->r_op == CEPH_MDS_OP_MKSNAP) && test_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags) && !test_bit(CEPH_MDS_R_ABORTED, &req->r_req_flags)) { ... splice_dentry() call possible here ... } else if (is_dentry && ...) { ... } Since we only care about !is_dentry case, that leaves the middle part. LOOKUPSNAP can come from ceph_lookup() or ceph_d_revalidate(), neither of which call ceph_handle_notrace_create(). MKSNAP comes only from ceph_mkdir(), which _does_ call ceph_handle_notrace_create(), but only in case !is_dentry && !is_target, so that call of splice_dentry() is not going to happen there. *IF* the analysis above is correct, we can rely upon the ceph_lookup() always getting a negative dentry and that if (d_really_is_negative(dentry)) is always taken. But I could be missing something subtle in there ;-/
diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c index 0bf388e07a02..5151c614b5cb 100644 --- a/fs/ceph/dir.c +++ b/fs/ceph/dir.c @@ -782,10 +782,16 @@ struct dentry *ceph_finish_lookup(struct ceph_mds_request *req, return dentry; } +static inline +bool is_hidden_ceph_dir(struct dentry *dentry) +{ + size_t len = strlen(CEPH_HIDDEN_DIR_NAME); + return strncmp(dentry->d_name.name, CEPH_HIDDEN_DIR_NAME, len) == 0; +} + static bool is_root_ceph_dentry(struct inode *inode, struct dentry *dentry) { - return ceph_ino(inode) == CEPH_INO_ROOT && - strncmp(dentry->d_name.name, ".ceph", 5) == 0; + return ceph_ino(inode) == CEPH_INO_ROOT && is_hidden_ceph_dir(dentry); } /* diff --git a/include/linux/ceph/ceph_fs.h b/include/linux/ceph/ceph_fs.h index 2d7d86f0290d..84a1391aab29 100644 --- a/include/linux/ceph/ceph_fs.h +++ b/include/linux/ceph/ceph_fs.h @@ -31,6 +31,8 @@ #define CEPH_INO_CEPH 2 /* hidden .ceph dir */ #define CEPH_INO_GLOBAL_SNAPREALM 3 /* global dummy snaprealm */ +#define CEPH_HIDDEN_DIR_NAME ".ceph" + /* arbitrary limit on max # of monitors (cluster of 3 is typical) */ #define CEPH_MAX_MON 31