Message ID | cover.1620641727.git.mchehab+huawei@kernel.org |
---|---|
Headers | show |
Series | Get rid of UTF-8 chars that can be mapped as ASCII | expand |
Em Mon, 10 May 2021 15:33:47 +0100 Edward Cree <ecree.xilinx@gmail.com> escreveu: > On 10/05/2021 14:59, Matthew Wilcox wrote: > > Most of these > > UTF-8 characters come from latex conversions and really aren't > > necessary (and are being used incorrectly). > I fully agree with fixing those. > The cover-letter, however, gave the impression that that was not the > main purpose of this series; just, perhaps, a happy side-effect. Sorry for the mess. The main reason why I wrote this series is because there are lots of UTF-8 left-over chars from the ReST conversion. See: - https://lore.kernel.org/linux-doc/20210507100435.3095f924@coco.lan/ A large set of the UTF-8 letf-over chars were due to my conversion work, so I feel personally responsible to fix those ;-) Yet, this series has two positive side effects: - it helps people needing to touch the documents using non-utf8 locales[1]; - it makes easier to grep for a text; [1] There are still some widely used distros nowadays (LTS ones?) that don't set UTF-8 as default. Last time I installed a Debian machine I had to explicitly set UTF-8 charset after install as the default were using ASCII encoding (can't remember if it was Debian 10 or an older version). Unintentionally, I ended by giving emphasis to the non-utf8 instead of giving emphasis to the conversion left-overs. FYI, this patch series originated from a discussion at linux-doc, reporting that Sphinx breaks when LANG is not set to utf-8[2]. That's why I probably ended giving the wrong emphasis at the cover letter. [2] See https://lore.kernel.org/linux-doc/20210506103913.GE6564@kitsune.suse.cz/ for the original report. I strongly suspect that the VM set by Michal to build the docs was using a distro that doesn't set UTF-8 as default. PS.: I intend to prepare afterwards a separate fix to avoid Sphinx logger to crash during Kernel doc builds when the locale charset is not UTF-8, but I'm not too fluent in python. So, I need some time to check if are there a way to just avoid python log crashes without touching Sphinx code and without needing to trick it to think that the machine's locale is UTF-8. See: while there was just a single document originally stored at the Kernel tree as a LaTeX document during the time we did the conversion (cdrom-standard.tex), there are several other documents stored as text that seemed to be generated by some tool like LaTeX, whose the original version were not preserved. Also, there were other documents using different markdown dialects that were converted via pandoc (and/or other similar tools). That's not to mention the ones that were converted from DocBook. Such tools tend to use some logic to use "neat" versions of some ASCII characters, like what this tool does: https://daringfireball.net/projects/smartypants/ (Sphinx itself seemed to use this tool on its early versions) All tool-converted documents can carry UTF-8 on unexpected places. See, on this series, a large amount of patches deal with U+A0 (NO-BREAK SPACE) chars. I can't see why someone writing a plain text document (or a ReST one) would type a NO-BREAK SPACE instead of a normal white space. The same applies, up to some sort, to curly commas: usually people just write ASCII "commas" on their documents, and use some tool like LaTeX or a text editor like libreoffice in order to convert them into “utf-8 curly commas”[3]. [3] Sphinx will do such things at the produced output, doing something similar to what smartypants does, nowadays using this: https://docutils.sourceforge.io/docs/user/smartquotes.html E. g.: - Straight quotes (" and ') turned into "curly" quote characters; - dashes (-- and ---) turned into en- and em-dash entities; - three consecutive dots (... or . . .) turned into an ellipsis char. > > You seem quite knowedgeable about the various differences. Perhaps > > you'd be willing to write a document for Documentation/doc-guide/ > > that provides guidance for when to use which kinds of horizontal > > line? > I have Opinions about the proper usage of punctuation, but I also know > that other people have differing opinions. For instance, I place > spaces around an em dash, which is nonstandard according to most > style guides. Really this is an individual enough thing that I'm not > sure we could have a "kernel style guide" that would be more useful > than general-purpose guidance like the page you linked. > Moreover, such a guide could make non-native speakers needlessly self- > conscious about their writing and discourage them from contributing > documentation at all. I don't think so. In a matter of fact, as a non-native speaker, I guess this can actually help people willing to write documents. > I'm not advocating here for trying to push > kernel developers towards an eats-shoots-and-leaves level of > linguistic pedantry; rather, I merely think that existing correct > usages should be left intact (and therefore, excising incorrect usage > should only be attempted by someone with both the expertise and time > to check each case). > > But if you really want such a doc I wouldn't mind contributing to it. IMO, a document like that can be helpful. I can help reviewing it. Thanks, Mauro
Em Mon, 10 May 2021 14:49:44 +0100 David Woodhouse <dwmw2@infradead.org> escreveu: > On Mon, 2021-05-10 at 13:55 +0200, Mauro Carvalho Chehab wrote: > > This patch series is doing conversion only when using ASCII makes > > more sense than using UTF-8. > > > > See, a number of converted documents ended with weird characters > > like ZERO WIDTH NO-BREAK SPACE (U+FEFF) character. This specific > > character doesn't do any good. > > > > Others use NO-BREAK SPACE (U+A0) instead of 0x20. Harmless, until > > someone tries to use grep[1]. > > Replacing those makes sense. But replacing emdashes — which are a > distinct character that has no direct replacement in ASCII and which > people do *deliberately* use instead of hyphen-minus — does not. > > Perhaps stick to those two, and any cases where an emdash or endash has > been used where U+002D HYPHEN-MINUS *should* have been used. Ok. I'll rework the series excluding EM/EN DASH chars from it. I'll then apply manually the changes for EM/EN DASH chars (probably on a separate series) where it seems to fit. That should make easier to discuss such replacements. > And please fix your cover letter which made no reference to 'grep', and > only presented a completely bogus argument for the change instead. OK! Regards, Mauro