Note: I’m adding information I discover that seems to be leading to a resolution in the note.
The Problem:
When pasting content from a source external to Gutenberg into Gutenberg some HTML/CSS formatting is lost.[1] While Gutenberg retains most HTML (semantic) elements it drops CSS (styling/non-semantic) elements. This means that properties such as font size, text alignment, text color, etc. are all removed during the paste event.
Not the Problem:
We could discuss the plugins, custom HTML blocks, etc. (e.g., Wordable, JetPack) available for converting external content sources (e.g., Google Docs) to WP friendly content but this question is decidedly not about those solutions. Instead, this question is exclusively focused on how to programmatically alter Gutenberg’s paste handling behavior.
Seeing the Problem in Action
This problem occurs in many circumstances. For example, try pasting the following block of HTML into the paragraph block in Gutenberg:
<p style="color:red">Hello WordPress StackExchange!</p>
Then view the HTML for that paragraph block and you’ll see:
<p>Hello WordPress StackExchange!</p>
The style="color:red"
has been stripped out.
Looking at the Paragraph Block
One of the blocks that suffers from this stripping is the paragraph block (/gutenberg/packages/block-library/src/paragraph
). This block[2] uses the RichText
component (/gutenberg/packages/block-editor/rich-text
) to implement its rich text editing functionality.
Looking at the RichText Component
In /rich-text/index.js
we find the onPaste
method which the paragraph block inherits. This function in turn calls the pasteHandler
function (/gutenberg/packages/blocks/src/api/raw-handling/paste-handler.js
).
Looking at the Paste Handler
The pasteHandler
function “Converts an HTML string to known blocks. Strips everything else.” according to the JSDoc.
This function takes five parameters:
-
HTML
= The source content to convert if in HTML format -
plainText
= The source content to convert if in text format -
mode
= Whether to paste the content in as blocks or inline content in existing block. -
tagName
= What tag we are inserting the content into. -
canUserUseUnfilteredHTML
= Initially I thought this determined whether one could use any HTML/CSS one desired but it appears to be more limited, AFAIK it only determines whether theiframeRemover
function is run against the pasted content, which is only tangentially relevant.
We can see that pasteHandler
is imported (index.js
):
import {
pasteHandler,
children as childrenSource,
getBlockTransforms,
findTransform,
isUnmodifiedDefaultBlock
} from '@wordpress/blocks';
pasteHandler
is then called from onPaste
:
onPaste( { value, onChange, html, plainText, files } ) {
...
if ( files && files.length && ! html ) {
const content = pasteHandler( {
HTML: filePasteHandler( files),
mode: 'BLOCKS',
tagName,
} );
...
const content = pasteHandler ( {
HTML: html,
plainText,
mode,
tagName,
canUserUseUnfilteredHTML,
} );
...
}
We are interested for our purposes only in a portion of the pasteHandler function:
const rawTransforms = getRawTransformations();
const phrasingContentSchema = getPhrasingContentSchema( 'paste' );
const blockContentSchema = getBlockContentSchema( rawTransforms, phrasingContentSchema, true );
const blocks = compact( flatMap( pieces, ( piece ) => {
...
if ( ! canUserUseUnfilteredHTML ) {
// Should run before `figureContentReducer`.
filters.unshift( iframeRemover );
}
const schema = {
...blockContentSchema,
// Keep top-level phrasing content, normalised by `normaliseBlocks`.
...phrasingContentSchema,
};
piece = deepFilterHTML( piece, filters, blockContentSchema );
piece = removeInvalidHTML( piece, schema );
piece = normaliseBlocks( piece );
piece = deepFilterHTML( piece, [
htmlFormattingRemover,
brRemover,
emptyParagraphRemover,
], blockContentSchema );
...
return htmlToBlocks( { html: piece, rawTransforms } );
} ) );
Even here, most of what occurs is not relevant to our current issue. We don’t care, for example, about Google Doc UIDs being removed or Word lists being converted.
Instead we are interested in:
-
rawTransforms
– contains the results of a call togetRawTransformations
, also defined inpaste-handler.js
.- I don’t think this code is involved, but maybe someone can help me understand what it does? 🙂
-
phrasingContentSchema
– contains the results of callinggetPhrasingContentSchema
, defined inphrasing-content.js
.- This appears to remove a few invisible attributes (u, abbr, data, etc.) which could be part of this problem but the more likely issues folks will run into are with the CSS styles, not these attributes.
-
blockContentSchema
– contains the results of a call togetBlockContentSchema
, defined inutils.js
.- Again,not entirely sure I understadn what it does, but I don’t think it is involved.
-
phrasingContentReducer
– one of the filters, defined inphrasing-content-reducer.js
.- I’m unsure but I suspect this snippet may be involved:
if ( node.nodeName === 'SPAN' && node.style ) {
const {
fontWeight,
fontStyle,
textDecorationLine,
textDecoration,
verticalAlign,
} = node.style;
-
deepFilterHTML
– defined inutils.js
, essentially a wrapper fordeepFilterNodeList
, also found inutils.js
.- Again, not sure I understand this segment of code, could be involved.
-
removeInvalidHTML
– defined inutils.js
, essentially a wrapper forcleanNodeList
, also found inutils.js
.- Believe this is involved,
cleanNodeList
JSDoc states, “Given a schema, unwraps or removes nodes, attributes and classes on a node”.
- Believe this is involved,
You’ll notice several functions that did not make the list – after reviewing their code, I don’t believe they are involved in the current problem (e.g., normaliseBlocks
, brRemover
, emptyParagraphRemover
, etc).
The Conclusion
I just rewrote most of this question, I’ll try to refine a bit later and share more on what specific snippets of code that I did not understand does when I have a chance to look at it. Hoping that this may be helpful to others / someone may be able to explain to me what I am missing…or I can keep slogging away. 🙂
[1] Technically, this isn’t always true. Some blocks may accept most/all content pasted into them – for example the HTML block. But the retaining of pasted content is an exception to and not the rule.
[2] You can find the reference in /paragraph/edit.js
in the ParagraphBlock
function.