Giter Club home page Giter Club logo

Comments (3)

dadoonet avatar dadoonet commented on June 8, 2024

Thanks @thomasm82.

Indeed you are right. With #37 fix, mapping is not overwritten anymore but still it's not useable.

May I ask you how did you fix it in mapper attachment plugin?
May be it could worth a hack before having this fixed in elasticsearch core.

Let me know.

from elasticsearch-mapper-attachments.

thomasm82 avatar thomasm82 commented on June 8, 2024

Hi there,
thanks for your response.

Well I described a bit of it at the end of my post in the elasticsearch GH: elastic/elasticsearch#5402

Part 1 - My own ParseContext

In detail I implemented my own ParseContext, which I passed the original ParseContext as a delegate to be used:

/* 
 * Created: Mar 12, 2014 11:40:41 AM
 */
package org.elasticsearch.index.mapper.attachment;

import java.util.List;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.document.Field;
import org.elasticsearch.common.bytes.BytesReference;
import org.elasticsearch.common.lucene.all.AllEntries;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.common.xcontent.XContentParser;
import org.elasticsearch.index.analysis.AnalysisService;
import org.elasticsearch.index.mapper.ContentPath;
import org.elasticsearch.index.mapper.DocumentMapper;
import org.elasticsearch.index.mapper.DocumentMapper.ParseListener;
import org.elasticsearch.index.mapper.DocumentMapperParser;
import org.elasticsearch.index.mapper.FieldMapper;
import org.elasticsearch.index.mapper.ParseContext;
import org.elasticsearch.index.mapper.SourceToParse;
import org.elasticsearch.index.mapper.object.RootObjectMapper;

public class AttachmentParseContext extends ParseContext {

    private ParseContext delegate;

    public AttachmentParseContext(ParseContext context) {
        /*
         * needs to be there as no default constructor is available in
         * superclass; we are delegating most calls to the original context
         * anyways
         */
        super(context.index(), context.indexSettings(), context
                .docMapperParser(), context.docMapper(), context.path());
        this.delegate = context;
    }

    /*
     * ................... START: OVERRIDING CODE ...................
     */

    /**
     * Unlike {@link ParseContext#externalValueSet()} this method also returns
     * <code>true</code>, if the external value is set without the boolean flag
     * indicating it. This is needed in order to support multi fields within an
     * attachment's field.
     * 
     * @return <code>true</code> in case either {@link #delegate
     *         #externalValueSet()} returns <code>true</code> or
     *         {@link #delegate#externalValue()} is not <code>null</code>.
     */
    public boolean externalValueSet() {
        return delegate.externalValueSet() || delegate.externalValue() != null;
    }

    /*
     * ................... END: OVERRIDING CODE ...................
     */

    /*
     * ................... START: DELEGATE METHODS ...................
     */

    public void addDoc(Document doc) {
        delegate.addDoc(doc);
    }

    public AllEntries allEntries() {
        return delegate.allEntries();
    }

    public AnalysisService analysisService() {
        return delegate.analysisService();
    }

    public Analyzer analyzer() {
        return delegate.analyzer();
    }

    public void analyzer(Analyzer analyzer) {
        delegate.analyzer(analyzer);
    }

    public void clearWithinCopyTo() {
        delegate.clearWithinCopyTo();
    }

    public void clearWithinNewMapper() {
        delegate.clearWithinNewMapper();
    }

    public Document doc() {
        return delegate.doc();
    }

    public float docBoost() {
        return delegate.docBoost();
    }

    public void docBoost(float docBoost) {
        delegate.docBoost(docBoost);
    }

    public DocumentMapper docMapper() {
        return delegate.docMapper();
    }

    public DocumentMapperParser docMapperParser() {
        return delegate.docMapperParser();
    }

    public List<Document> docs() {
        return delegate.docs();
    }

    public boolean equals(Object obj) {
        return delegate.equals(obj);
    }

    public Object externalValue() {
        return delegate.externalValue();
    }

    public void externalValue(Object externalValue) {
        delegate.externalValue(externalValue);
    }

    public boolean flyweight() {
        return delegate.flyweight();
    }

    public int hashCode() {
        return delegate.hashCode();
    }

    public String id() {
        return delegate.id();
    }

    public void id(String id) {
        delegate.id(id);
    }

    public String ignoredValue(String indexName) {
        return delegate.ignoredValue(indexName);
    }

    public void ignoredValue(String indexName, String value) {
        delegate.ignoredValue(indexName, value);
    }

    public boolean includeInAll(Boolean includeInAll, FieldMapper mapper) {
        return delegate.includeInAll(includeInAll, mapper);
    }

    public String index() {
        return delegate.index();
    }

    public Settings indexSettings() {
        return delegate.indexSettings();
    }

    public boolean isWithinCopyTo() {
        return delegate.isWithinCopyTo();
    }

    public boolean isWithinNewMapper() {
        return delegate.isWithinNewMapper();
    }

    public ParseListener listener() {
        return delegate.listener();
    }

    public boolean mappingsModified() {
        return delegate.mappingsModified();
    }

    public XContentParser parser() {
        return delegate.parser();
    }

    public ContentPath path() {
        return delegate.path();
    }

    public void reset(XContentParser parser, Document document,
            SourceToParse source, ParseListener listener) {
        delegate.reset(parser, document, source, listener);
    }

    public RootObjectMapper root() {
        return delegate.root();
    }

    public Document rootDoc() {
        return delegate.rootDoc();
    }

    public void setMappingsModified() {
        delegate.setMappingsModified();
    }

    public void setWithinCopyTo() {
        delegate.setWithinCopyTo();
    }

    public void setWithinNewMapper() {
        delegate.setWithinNewMapper();
    }

    public BytesReference source() {
        return delegate.source();
    }

    public void source(BytesReference source) {
        delegate.source(source);
    }

    public SourceToParse sourceToParse() {
        return delegate.sourceToParse();
    }

    public StringBuilder stringBuilder() {
        return delegate.stringBuilder();
    }

    public Document switchDoc(Document doc) {
        return delegate.switchDoc(doc);
    }

    public String toString() {
        return delegate.toString();
    }

    public String type() {
        return delegate.type();
    }

    public Field uid() {
        return delegate.uid();
    }

    public void uid(Field uid) {
        delegate.uid(uid);
    }

    public Field version() {
        return delegate.version();
    }

    public void version(Field version) {
        delegate.version(version);
    }

    /*
     * ................... END: DELEGATE METHODS ...................
     */
}

Part 2 - Integration with the plugin

In order to get my context being used, I simply changed one line (as far as I can remember) within the org.elasticsearch.index.mapper.attachment.AttachmentMapper.parse(ParseContext context) method where I am wrapping the original context by calling context = new AttachmentParseContext(context);:

    @Override
    public void parse(ParseContext context) throws IOException {
        byte[] content = null;
        String contentType = null;
        int indexedChars = defaultIndexedChars;
        String name = null;

        XContentParser parser = context.parser();
        XContentParser.Token token = parser.currentToken();
        if (token == XContentParser.Token.VALUE_STRING) {
            content = parser.binaryValue();
        } else {
            String currentFieldName = null;
            while ((token = parser.nextToken()) != XContentParser.Token.END_OBJECT) {
                if (token == XContentParser.Token.FIELD_NAME) {
                    currentFieldName = parser.currentName();
                } else if (token == XContentParser.Token.VALUE_STRING) {
                    if ("content".equals(currentFieldName)) {
                        content = parser.binaryValue();
                    } else if ("_content_type".equals(currentFieldName)) {
                        contentType = parser.text();
                    } else if ("_name".equals(currentFieldName)) {
                        name = parser.text();
                    }
                } else if (token == XContentParser.Token.VALUE_NUMBER) {
                    if ("_indexed_chars".equals(currentFieldName) || "_indexedChars".equals(currentFieldName)) {
                        indexedChars = parser.intValue();
                    }
                }
            }
        }

        // Throw clean exception when no content is provided Fix #23
        if (content == null) {
            throw new MapperParsingException("No content is provided.");
        }

        Metadata metadata = new Metadata();
        if (contentType != null) {
            metadata.add(Metadata.CONTENT_TYPE, contentType);
        }
        if (name != null) {
            metadata.add(Metadata.RESOURCE_NAME_KEY, name);
        }

        String parsedContent;
        try {
            // Set the maximum length of strings returned by the parseToString method, -1 sets no limit            
            parsedContent = tika().parseToString(new BytesStreamInput(content, false), metadata, indexedChars);
            LanguageIdentifier languageIdentifier = new LanguageIdentifier(parsedContent);
            String language = "en";
            if (languageIdentifier.isReasonablyCertain()) {
                language = languageIdentifier.getLanguage();
            }
            context.externalValue(language);
            languageMapper.parse(context);
        } catch (Throwable e) {
            // #18: we could ignore errors when Tika does not parse data
            if (!ignoreErrors) throw new MapperParsingException("Failed to extract [" + indexedChars + "] characters of text for [" + name + "]", e);
            return;
        }

        context = new AttachmentParseContext(context);
        context.externalValue(parsedContent);
        contentMapper.parse(context);


        try {
            context.externalValue(name);
            nameMapper.parse(context);
        } catch(MapperParsingException e){
            if (!ignoreErrors) throw e;
            if (logger.isDebugEnabled()) logger.debug("Ignoring MapperParsingException catch while parsing name: {}", e.getMessage());
        }

        try {
            context.externalValue(metadata.get(Metadata.DATE));
            dateMapper.parse(context);
        } catch(MapperParsingException e){
            if (!ignoreErrors) throw e;
            if (logger.isDebugEnabled()) logger.debug("Ignoring MapperParsingException catch while parsing date: {}: {}", e.getMessage(), context.externalValue());
        }

        try {
            context.externalValue(metadata.get(Metadata.TITLE));
            titleMapper.parse(context);
        } catch(MapperParsingException e){
            if (!ignoreErrors) throw e;
            if (logger.isDebugEnabled()) logger.debug("Ignoring MapperParsingException catch while parsing title: {}: {}", e.getMessage(), context.externalValue());
        }

        try {
            context.externalValue(metadata.get(Metadata.AUTHOR));
            authorMapper.parse(context);
        } catch(MapperParsingException e){
            if (!ignoreErrors) throw e;
            if (logger.isDebugEnabled()) logger.debug("Ignoring MapperParsingException catch while parsing author: {}: {}", e.getMessage(), context.externalValue());
        }

        try {
            context.externalValue(metadata.get(Metadata.KEYWORDS));
            keywordsMapper.parse(context);
        } catch(MapperParsingException e){
            if (!ignoreErrors) throw e;
            if (logger.isDebugEnabled()) logger.debug("Ignoring MapperParsingException catch while parsing keywords: {}: {}", e.getMessage(), context.externalValue());
        }

        try {
            context.externalValue(metadata.get(Metadata.CONTENT_TYPE));
            contentTypeMapper.parse(context);
        } catch(MapperParsingException e){
            if (!ignoreErrors) throw e;
            if (logger.isDebugEnabled()) logger.debug("Ignoring MapperParsingException catch while parsing content_type: {}: {}", e.getMessage(), context.externalValue());
        }

        try {
            if (metadata.get(Metadata.CONTENT_LENGTH) != null) {
                // We try to get CONTENT_LENGTH from Tika first
                context.externalValue(metadata.get(Metadata.CONTENT_LENGTH));
            } else {
                // Otherwise, we use our byte[] length
                context.externalValue(content.length);
            }
            contentLengthMapper.parse(context);
        } catch(MapperParsingException e){
            if (!ignoreErrors) throw e;
            if (logger.isDebugEnabled()) logger.debug("Ignoring MapperParsingException catch while parsing content_length: {}: {}", e.getMessage(), context.externalValue());
        }
    }

This way you get back an external value whenever it is set, even if it was consumed - i.e. read - already.

Hope this helps,
Tom

from elasticsearch-mapper-attachments.

dadoonet avatar dadoonet commented on June 8, 2024

Thanks for sharing this! I really appreciate.
I created a PR based on that in core code.

We'll see if we need or not to patch the plugin in the meantime...

Keeping this open for the moment.

from elasticsearch-mapper-attachments.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.