Briefly, this error occurs when Elasticsearch encounters a Unicode character that is not valid or is a surrogate code. Surrogate codes are used in Unicode to represent characters that are not directly representable. This can happen when indexing documents with such characters. To resolve this issue, you can sanitize your input data to remove or replace these invalid characters. Alternatively, you can use a try-catch block to handle the error and skip the problematic documents during indexing. Lastly, you can configure Elasticsearch to ignore malformed characters during indexing.
This guide will help you check for common problems that cause the log ” Invalid unicode character code; [{}] is a surrogate code ” to appear. To understand the issues related to this log, read the explanation below about the following Elasticsearch concepts: parser, plugin.
Log Context
Log “Invalid unicode character code; [{}] is a surrogate code” class name is AbstractBuilder.java. We extracted the following from Elasticsearch source code for those seeking an in-depth context :
private static String hexToUnicode(Source source; String hex) { try { int code = Integer.parseInt(hex; 16); // U+D800���U+DFFF can only be used as surrogate pairs and therefore are not valid character codes if (code >= 0xD800 && code <= 0xDFFF) { throw new ParsingException(source; "Invalid unicode character code; [{}] is a surrogate code"; hex); } return String.valueOf(Character.toChars(code)); } catch (IllegalArgumentException e) { throw new ParsingException(source; "Invalid unicode character code [{}]"; hex); }
[ratemypost]